Dog Ratings - Wrangling, Analysis, Visualizations

Created: 10/20/2018
Latest Update: 10/23/2018
By: Can Bekleyici - bekleydata.com

Table of Contents

Introduction

This project aims to to get insightful information about dog ratings from the twitter page WeRateDogs™, while demonstrating advanced data wrangling and visualization techniques using various Python libraries. WeRateDogs™ is a community page on twitter designated to rating dogs on their appearences and stories, and which was formed by the user @dog_rates in 2015. The page has since grown extremely in popularity, with many users sharing its content and requesting their dogs being rated aswell. The site is a driving force for the development of the 'dog culture', with it's famous terms like "pupper", "mlem", "floof" etc. They also developed their own unusual rating system over time, in which almost every dog is rated above 10/10, because "they're good dogs".

Part I - Gathering

In [1]:
# import packages
import pandas as pd
import numpy as np
import requests
import tweepy as tp
import json
In [2]:
# load csv file into df
df = pd.read_csv('twitter-archive-enhanced.csv')
df.head()
Out[2]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... 12 10 Franklin None None None None
In [3]:
# download 'image-predictions.tsv'
r = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')
with open('image-predictions.tsv', mode='wb') as file:
    file.write(r.content)
In [4]:
# read 'image-predictions.tsv' in a df
df2 = pd.read_csv('image-predictions.tsv', sep='\t')
df2.head()
Out[4]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
In [ ]:
# connect to the twitter api
consumer_key = 'xxxx xxxx xxxx xxxx'
consumer_secret = 'xxxx xxxx xxxx xxxx'
access_token = 'xxxx xxxx xxxx xxxx'
access_secret = 'xxxx xxxx xxxx xxxx'

auth = tp.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tp.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
In [ ]:
# retreives a list of dictionaries from the twitter api, with informations about each tweets id, favorite count, and retweet count
df_list = []
errors = []
for id in df['tweet_id']:
    try:
        tweet = api.get_status(id, tweet_mode='extended')
        df_list.append({'tweet_id': str(tweet.id),
                        'favorite_count': int(tweet.favorite_count),
                        'retweet_count': int(tweet.retweet_count)})
    except Exception as e:
        print(str(id) + " : " + str(e))
        errors.append(id)
In [ ]:
# returns the number of unretreivable id's
len(errors)
In [ ]:
# saves the list of dictionaries in a json formatted txt file
with open('tweet_json.txt', 'w') as outfile:  
    json.dump(df_list, outfile)
In [5]:
# loads the json file into a dataframe
with open('tweet_json.txt', 'r') as file:
    df3 = pd.DataFrame(json.load(file), columns=['tweet_id', 'favorite_count', 'retweet_count'])
In [6]:
df3.head()
Out[6]:
tweet_id favorite_count retweet_count
0 892420643555336193 38302 8404
1 892177421306343426 32827 6196
2 891815181378084864 24720 4100
3 891689557279858688 41634 8531
4 891327558926688256 39797 9249

Part II - Assessing the Data

In [7]:
df.sample(5)
Out[7]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
1224 714214115368108032 NaN NaN 2016-03-27 22:14:49 +0000 <a href="http://twitter.com/download/iphone" r... Happy Easter from the squad! 🐇🐶 13/10 for all ... NaN NaN NaN https://twitter.com/dog_rates/status/714214115... 13 10 None None None None None
770 776819012571455488 NaN NaN 2016-09-16 16:24:19 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: Everybody look at this beautifu... 6.798284e+17 4.196984e+09 2015-12-24 00:58:27 +0000 https://twitter.com/dog_rates/status/679828447... 13 10 None None None pupper None
738 780800785462489090 NaN NaN 2016-09-27 16:06:28 +0000 <a href="http://twitter.com/download/iphone" r... This is Koda. He has a weird relationship with... NaN NaN NaN https://twitter.com/dog_rates/status/780800785... 11 10 Koda None None None None
226 848690551926992896 NaN NaN 2017-04-03 00:16:10 +0000 <a href="http://twitter.com/download/iphone" r... Please stop sending in animals other than dogs... NaN NaN NaN https://twitter.com/dog_rates/status/848690551... 12 10 None None None None None
2257 667728196545200128 NaN NaN 2015-11-20 15:36:22 +0000 <a href="http://twitter.com" rel="nofollow">Tw... Meet Olive. He comes to spot by tree to remini... NaN NaN NaN https://twitter.com/dog_rates/status/667728196... 11 10 Olive None None None None
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB
In [9]:
# checking rating denominators
df['rating_denominator'].value_counts()
Out[9]:
10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64
In [10]:
# checking rating numerators
df['rating_numerator'].value_counts()
Out[10]:
12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64
In [11]:
# inspecting the reason for extremely high rating numerators
for i in df[df['rating_numerator'] > 20]['text']:
    print(i)
@dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research
@s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10
@markhoppus 182/10
@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho
RT @dog_rates: This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wu…
The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd
Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. 
Keep Sam smiling by clicking and sharing this link:
https://t.co/98tB8y7y7t https://t.co/LouL5vdvxx
This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS
This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq
Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh
Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv
This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq
Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1
Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12
From left to right:
Cletus, Jerome, Alejandro, Burp, &amp; Titson
None know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK
Here is a whole flock of puppers.  60/50 I'll take the lot https://t.co/9dpcw6MdWa
Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ
Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3
Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55
Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD
IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq
Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw
After so many requests... here you go.

Good dogg. 420/10 https://t.co/yfAAo1gdeY
In [12]:
# checks for duplicated entries
df[df.tweet_id.duplicated()]
Out[12]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
In [13]:
df['name'].value_counts()[:10]
Out[13]:
None       745
a           55
Charlie     12
Cooper      11
Oliver      11
Lucy        11
Tucker      10
Penny       10
Lola        10
Winston      9
Name: name, dtype: int64
In [14]:
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
In [15]:
df3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2341 entries, 0 to 2340
Data columns (total 3 columns):
tweet_id          2341 non-null object
favorite_count    2341 non-null int64
retweet_count     2341 non-null int64
dtypes: int64(2), object(1)
memory usage: 54.9+ KB
In [16]:
df3.retweet_count.describe()
Out[16]:
count     2341.000000
mean      2961.133276
std       4971.707929
min          0.000000
25%        595.000000
50%       1382.000000
75%       3452.000000
max      84382.000000
Name: retweet_count, dtype: float64

Quality Issues (to be fixed):

  • the rating denominators are inconsistent
  • some rating numerators are relatively high
  • a high number of entries in df are retweets or replies
  • a number of entries with 'name' being 'a', 'an', 'the'
  • missing data in 'expanded_urls' in df
  • dog's stages should be categorical data
  • missing dog names in df, recorded as 'None' string instead of NaN
  • inconsistency in dog breeds in df2
  • 'tweet_id' columns should be string types
  • timestamp column should be of datetime type

Tidiness Issues (to be fixed):

  • tables are seperated, although they contain the same observations
  • the variable for the dog's stage (e.g. doggo, floofer, pupper etc.) is spread in different columns

Part IIIa - Cleaning (tidiness)

  • tables are seperated, although they contain the same observations

Definition: Joining the 3 dataframes in one master dataframe on the 'tweet_id' primary key. Adding the predicted dog breeds, favorite counts, and retweet counts.

In [17]:
df2.head(1)
Out[17]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
In [18]:
# creates a predicted dog breed column, based on the the confidence level of minimum 20% and 'is dog' statements
df2['breed_pred'] = [i['p1'] if i['p1_dog'] == True and i['p1_conf'] > 0.2 
                     else i['p2'] if i['p2_dog'] == True and i['p2_conf'] > 0.2
                     else i['p3'] if i['p3_dog'] == True and i['p3_conf'] > 0.2
                     else np.nan for index, i in df2.iterrows()]
In [19]:
# testing
df2.breed_pred.value_counts()[:10]
Out[19]:
golden_retriever      152
Labrador_retriever     99
Pembroke               90
Chihuahua              79
pug                    56
Samoyed                44
chow                   41
toy_poodle             39
Pomeranian             37
malamute               30
Name: breed_pred, dtype: int64
In [20]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB
In [21]:
# join 'breed_pred' from df2 with df into a master df, keep only relevant collumns of both dataframes:
table1 = df[['tweet_id', 'in_reply_to_status_id', 'retweeted_status_id', 'timestamp', 'text', 'expanded_urls', 'rating_numerator', 'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo']]
table2 = df2[['tweet_id', 'breed_pred']]

df_master = pd.merge(table1, table2, on=['tweet_id'], how='left')
In [22]:
# test
df_master.head(3)
Out[22]:
tweet_id in_reply_to_status_id retweeted_status_id timestamp text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo breed_pred
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 This is Phineas. He's a mystical boy. Only eve... https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None NaN
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 This is Tilly. She's just checking pup on you.... https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None Chihuahua
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 This is Archie. He is a rare Norwegian Pouncin... https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None Chihuahua
In [23]:
# convert the tweet id column in df3 into int64 type for merging purposes
df3['tweet_id'] = df3['tweet_id'].astype('int64')
# join the 'favorite count' and 'retweet count' columns from df3 on the master df:
df_master = pd.merge(df_master, df3, on=['tweet_id'], how='left')
# return the tweet id column dtypes into object
df_master['tweet_id'] = df_master['tweet_id'].astype('object')
In [24]:
# test
df_master.head(3)
Out[24]:
tweet_id in_reply_to_status_id retweeted_status_id timestamp text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo breed_pred favorite_count retweet_count
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 This is Phineas. He's a mystical boy. Only eve... https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None NaN 38302.0 8404.0
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 This is Tilly. She's just checking pup on you.... https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None Chihuahua 32827.0 6196.0
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 This is Archie. He is a rare Norwegian Pouncin... https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None Chihuahua 24720.0 4100.0
  • The variable for the dog's stage (e.g. doggo, floofer, pupper etc.) is spread in different columns

Definition: Create a new categorical column, which defines, wether the dog has been described as doggo, floofer/floof, pupper, or puppo

In [25]:
# create the new 'stage' column
df_master['stage'] = ['doggo' if 'doggo' in i
                     else 'pupper' if 'pupper' in i
                     else 'floofer' if 'floofer' in i
                     else 'floof' if 'floof' in i
                     else 'puppo' if 'puppo' in i
                     else np.nan for i in df_master['text'].str.lower()]
# turn into categorical
df_master['stage'] = df_master['stage'].astype('category')
# drop other columns
df_master.drop(['doggo', 'floofer', 'pupper', 'puppo'], axis=1, inplace=True)
In [26]:
# test
df_master['stage'].dtype
Out[26]:
CategoricalDtype(categories=['doggo', 'floof', 'floofer', 'pupper', 'puppo'], ordered=False)
In [27]:
df_master.head()
Out[27]:
tweet_id in_reply_to_status_id retweeted_status_id timestamp text expanded_urls rating_numerator rating_denominator name breed_pred favorite_count retweet_count stage
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 This is Phineas. He's a mystical boy. Only eve... https://twitter.com/dog_rates/status/892420643... 13 10 Phineas NaN 38302.0 8404.0 NaN
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 This is Tilly. She's just checking pup on you.... https://twitter.com/dog_rates/status/892177421... 13 10 Tilly Chihuahua 32827.0 6196.0 NaN
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 This is Archie. He is a rare Norwegian Pouncin... https://twitter.com/dog_rates/status/891815181... 12 10 Archie Chihuahua 24720.0 4100.0 NaN
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 This is Darla. She commenced a snooze mid meal... https://twitter.com/dog_rates/status/891689557... 13 10 Darla NaN 41634.0 8531.0 NaN
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 This is Franklin. He would like you to stop ca... https://twitter.com/dog_rates/status/891327558... 12 10 Franklin basset 39797.0 9249.0 NaN

Part IIIb - Cleaning (quality)

  • missing dog names in df, recorded as 'None' string instead of NaN
  • a number of entries in name being 'a', 'an', 'the'

Definition:

  • replace 'None' with NaN in name column
  • replace 'a', 'an', 'the' with NaN in name column
In [28]:
df_master['name'] = df_master['name'].replace('None', np.nan)
In [29]:
# test
assert sum(df_master['name'] == 'None') == 0
In [30]:
df_master['name'] = df_master['name'].replace(['a', 'an', 'the'], np.nan)
In [31]:
# test
assert sum(df_master['name'] == 'a') + sum(df_master['name'] == 'an') + sum(df_master['name'] == 'a') == 0
  • inconsistency in dog breeds

Definition: replace all '_' and '-' with spaces, and everything in lower cases.

In [32]:
df_master['breed_pred'] = df_master['breed_pred'].astype('str')
df_master['breed_pred'] = df_master['breed_pred'].str.lower().str.replace('_', ' ').str.replace('-', ' ')
# NaN's back to non strings
df_master['breed_pred'] = df_master['breed_pred'].replace('nan', np.nan)
In [33]:
# test (first 10 rows)
df_master['breed_pred'].value_counts()[:10]
Out[33]:
golden retriever      152
labrador retriever     99
pembroke               90
chihuahua              79
pug                    56
samoyed                44
chow                   41
toy poodle             39
pomeranian             37
malamute               30
Name: breed_pred, dtype: int64
  • missing data in 'expanded_urls' in df

Definition: for the missing entries, generate the url using the tweet_id's

In [34]:
df_master['expanded_urls'] = [str('https://twitter.com/dog_rates/status/' + str(id)) 
                              for id, url in zip(df_master['tweet_id'], df_master['expanded_urls'])]
In [35]:
# test
assert df_master['expanded_urls'].isnull().any() == False
  • a high number of entries in df are retweets or replies

Definition: Identify retweet and reply rows, using regex in text column. Delete the rows, which contain retweets.

In [36]:
# drop the rows
df_master = df_master.drop(df_master[(df_master['in_reply_to_status_id'].isnull() == False) | (df_master['retweeted_status_id'].isnull() == False)].index)
# also drop the reply status and retweet status columns
df_master = df_master.drop(['in_reply_to_status_id', 'retweeted_status_id'], axis=1)
In [37]:
# test
df_master.shape
Out[37]:
(2097, 11)
  • the rating denominators are inconsistent
  • some rating numerators are relatively high

Definition: I figured that some of the dog ratings are exaggerated, eg. 420/10, and would distort the hole analysis, since rating are usually between 10/10 and 15/10. Those rows will be dropped and excluded from the analysis.

Then some of the ratings have denominators other than the usual */10. I figured, that this can be the case for a group of dogs (e.g. a group of 5 dogs with 11/10 rating each would result in a 55/50 rating). If that is the case, the rating will be standardized to a denominator of 10.

In [38]:
# standardizing to a denominator of 10 for groups of dogs:
df_master['rating_num'] = [int(round(num/(denom/10)))  if denom != 10 and num/denom <= 2 
                           else num for num, denom in zip(df_master['rating_numerator'], df_master['rating_denominator'])]
df_master['rating_denom'] = [10 if denom != 10 and num/denom <= 2
                             else denom for num, denom in zip(df_master['rating_numerator'], df_master['rating_denominator'])]
# drop other columns
df_master = df_master.drop(['rating_numerator', 'rating_denominator'], axis=1)
In [39]:
# test denominators
df_master['rating_denom'].unique()
Out[39]:
array([10,  7], dtype=int64)
In [40]:
# test numerators
df_master['rating_num'].unique()
Out[40]:
array([  13,   12,   14,    5,   11,    6,   10,    0,   24,   75,   27,
          3,    7,    8,    9,    4, 1776,    2,   26,    1,  420],
      dtype=int64)

Most of the high numerators have been standardized in the process, but there are still some outliers. Proceeding with removing the rows which couldn't be standardized and the rows which contain unusual high ratings (chosen as above 20/10)

In [41]:
df_master = df_master.drop(df_master[((df_master['rating_denom'] != 10) | (df_master['rating_num'] > 20))].index)
In [42]:
# test
df_master['rating_denom'].unique(), df_master['rating_num'].unique()
Out[42]:
(array([10], dtype=int64),
 array([13, 12, 14,  5, 11,  6, 10,  0,  3,  7,  8,  9,  4,  2,  1],
       dtype=int64))
  • timestamp column should be of datetime type

Definition: Change the timestamp column to datetime format with the to_datetime function

In [43]:
df_master['timestamp'] = pd.to_datetime(df_master['timestamp'], format='%Y-%m-%d %H:%M:%S')
In [44]:
# test
df_master['timestamp'].describe()
Out[44]:
count                    2091
unique                   2091
top       2016-09-12 15:10:21
freq                        1
first     2015-11-15 22:32:08
last      2017-08-01 16:23:56
Name: timestamp, dtype: object
In [45]:
df_master.sample(5)
Out[45]:
tweet_id timestamp text expanded_urls name breed_pred favorite_count retweet_count stage rating_num rating_denom
1542 689599056876867584 2016-01-20 00:03:21 Here we see 33 dogs posing for a picture. All ... https://twitter.com/dog_rates/status/689599056... NaN NaN 12295.0 5954.0 NaN 11 10
2322 666430724426358785 2015-11-17 01:40:41 Oh boy what a pup! Sunglasses take this one to... https://twitter.com/dog_rates/status/666430724... NaN NaN 316.0 197.0 NaN 6 10
1269 709519240576036864 2016-03-14 23:19:03 This is Vince. He's a Gregorian Flapjeck. Whit... https://twitter.com/dog_rates/status/709519240... Vince cocker spaniel 1569.0 257.0 NaN 9 10
1115 732726085725589504 2016-05-18 00:14:46 This is Aldrick. He looks wise af. Also except... https://twitter.com/dog_rates/status/732726085... Aldrick pomeranian 3719.0 951.0 NaN 11 10
2085 670804601705242624 2015-11-29 03:20:54 Meet Mason. He's a total frat boy. Pretends to... https://twitter.com/dog_rates/status/670804601... Mason pomeranian 2017.0 991.0 NaN 10 10
In [46]:
# save the master dataframe to a csv file
df_master.to_csv('twitter_archive_master.csv', index=False)

Part IV - Exploratory Data Analysis

Now that the dataset is ready for analysis, I would like to explore it's content and collect some additional information from it.

In [47]:
# import matplotlib, seaborn and the csv
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.25)
% matplotlib inline

df = pd.read_csv('twitter_archive_master.csv')
In [48]:
# truncates the timestamp column
df['timestamp'] = [i[:10] for i in df['timestamp']]
# retrieves information about the time frame of the data set
df['timestamp'].min(), df['timestamp'].max()
Out[48]:
('2015-11-15', '2017-08-01')
In [49]:
# number of entries and columns in the dataset
df.shape
Out[49]:
(2091, 11)
In [50]:
# shows the top 10 represented dog breeds in the dataset
df['breed_pred'].value_counts()[:10]
Out[50]:
golden retriever      137
labrador retriever     93
pembroke               88
chihuahua              74
pug                    53
samoyed                40
chow                   39
toy poodle             37
pomeranian             36
malamute               29
Name: breed_pred, dtype: int64
In [51]:
# shows the top 10 represented dog names in the dataset
df['name'].value_counts()[:10]
Out[51]:
Lucy       11
Charlie    11
Oliver     10
Cooper     10
Tucker      9
Penny       9
Sadie       8
Winston     8
Lola        8
Daisy       7
Name: name, dtype: int64

Information about the dog's ratings:

In [52]:
# shows descriptive information about the dog's ratings
df['rating_num'].describe()
Out[52]:
count    2091.000000
mean       10.598278
std         2.161710
min         0.000000
25%        10.000000
50%        11.000000
75%        12.000000
max        14.000000
Name: rating_num, dtype: float64
In [53]:
# the distribution of ratings on a histogram
plt.figure(figsize=[8,6])
plt.hist(df['rating_num'], bins=np.arange(0.5, 14.5, 1), alpha=0.8, color='#64818e')
plt.xlabel('Rating')
plt.ylabel('Amount')
plt.title('Rating Distribution');

Information about favorite_count and retweet_count:

In [54]:
# shows descriptive information about favorite counts
df['favorite_count'].describe()
Out[54]:
count      2090.000000
mean       8862.364115
std       12814.740968
min          80.000000
25%        1975.250000
50%        4034.500000
75%       11064.500000
max      165078.000000
Name: favorite_count, dtype: float64
In [55]:
# shows descriptive information about retweet counts
df['retweet_count'].describe()
Out[55]:
count     2090.000000
mean      2746.163636
std       4787.423447
min         12.000000
25%        606.500000
50%       1329.500000
75%       3128.500000
max      84382.000000
Name: retweet_count, dtype: float64

Trends over time:

In [56]:
# shows the distribution of ratings over time
import calendar
import warnings
warnings.filterwarnings("ignore")

labels = ([str(i + ' 2015') for i in calendar.month_name[11:13]] 
          + [str(i + ' 2016') for i in calendar.month_name[1:13]] 
          + [str(i + ' 2017') for i in calendar.month_name[1:8]])

plt.figure(figsize=[16,6])
plt.title('Rating over Time')
plt.xticks(np.arange(1, 645, 30), labels, rotation=75)
ax = sns.lineplot(x='timestamp', y='rating_num', data=df, color='#64818e')
ax.set(xlabel='Date', ylabel='Dog Ratings')
plt.savefig('visuals/rating_over_time.png', bbox_inches='tight')
plt.show();
In [57]:
# shows the correlation coefficient
np.corrcoef(list(reversed(df['timestamp'].index.values.astype(float))), df['rating_num'])
Out[57]:
array([[1.       , 0.5359766],
       [0.5359766, 1.       ]])
In [58]:
# shows favorite counts over time
plt.figure(figsize=[8,6])
plt.scatter(x=df['favorite_count'], y=df['rating_num'], alpha=0.1, color='#64818e')
plt.xlim(0, 50000)
plt.xlabel('Favorite Count')
plt.ylabel('Rating')
plt.title('Favorite Counts on Ratings');
In [59]:
# shows retweet counts over time
plt.figure(figsize=[8,6])
plt.scatter(x=df['retweet_count'], y=df['rating_num'], alpha=0.1, color='#64818e')
plt.xlim(0, 20000)
plt.xlabel('Retweet Count')
plt.ylabel('Rating')
plt.title('Retweet Counts on Ratings');

EDA Conclusion:

The dataset has in total 2091 entries of dog rate tweets between November 2015 and August 2017. The top 3 most represented dog breeds in this dataset are Golden Retrievers, Labrador Retrievers and Pembroke Corgis, while Charlie and Lucy being the most represented dog names with 11 dog rates each. The rating distribution is left-skewed with a median rating of 11 and an average rating of 10.598278. For all of the dataset's tweets, the average favorite count is 8862 and the average retweet count is 2746.

A line plot revealed a trend in the past, where ratings have become higher over time. There is a moderate positive correlation between rating and time, which can also be derived from the correlaton coefficient of 0.5359766. Scatterplots revealed a positive correlation between the rating of the dog and the favorite count or retweet count.

Note: A question to be discussed is, wether the ratings should be normalized for further analysis. It is doubtful to assume that dog's have gotten "better" or "cuter" within a one and a half year time span. The inconsistency of ratings within the dataset would favor more recent dog ratings, and therefore distort the reality. But since the analysis in this project does not aim to give conclusions about the population, no further adjustments will be made at this point.

Part Va - Asking Questions

  1. Which dog breed recieved the highest ratings on average?
  2. Which dog stage recieved the highest ratings on average?
  3. Which dog breed recieved the highest favorite count on average?
  4. Which dog stage recieved the highest favorite count on average?

1. Which dog breed recieved the highest ratings on average?

To answer this question, I only included dog breeds in the analysis, which recieved at least 10 different ratings. Any breed category below 10 entries are likely to be coincidential. The assumption is, that certain type of dogs would recieve higher ratings, than others.

In [60]:
# groups the dataset by dog breeds and computes the average rating for each group
breed_groups = df.groupby('breed_pred')['rating_num'].mean()

# filters out dog breed that appear less than 10 times
breed_groups = breed_groups[df['breed_pred'].value_counts() > 10].sort_values(ascending=False)
breed_groups
Out[60]:
breed_pred
eskimo dog                        11.875000
samoyed                           11.725000
chow                              11.641026
golden retriever                  11.532847
great pyrenees                    11.500000
pembroke                          11.409091
siberian husky                    11.315789
bernese mountain dog              11.272727
cardigan                          11.250000
cocker spaniel                    11.240000
french bulldog                    11.222222
rottweiler                        11.187500
pomeranian                        11.166667
pekinese                          11.166667
labrador retriever                11.118280
toy poodle                        11.054054
german shepherd                   11.000000
kuvasz                            11.000000
malamute                          11.000000
american staffordshire terrier    11.000000
shih tzu                          10.866667
vizsla                            10.833333
miniature pinscher                10.777778
shetland sheepdog                 10.750000
basset                            10.733333
chihuahua                         10.702703
chesapeake bay retriever          10.681818
west highland white terrier       10.666667
staffordshire bullterrier         10.666667
border collie                     10.583333
beagle                            10.375000
lakeland terrier                  10.285714
pug                               10.264151
airedale                          10.181818
italian greyhound                  9.833333
maltese dog                        9.500000
Name: rating_num, dtype: float64

Plotting a horizontal bar chart with breed_groups

In [61]:
# define axis values
x_axis = list(breed_groups)
y_axis = [i.title() for i in breed_groups.index]

# plot the bar chart
plt.figure(figsize=[10,14])
plt.xlim(8,13)
plt.xlabel('Average Rating')
plt.title('Average Rating for each Dog Breed on WeRateDogs™')
colors = ['#8dafbe' if (x < max(x_axis)) else '#f8e863' for x in x_axis]
ax = sns.barplot(x=x_axis, y=y_axis, label=y_axis, color="b", palette=colors)
plt.savefig('visuals/rating_each_breed.png', bbox_inches='tight')
plt.show();

Answer:

Eskimo dogs recieved the highest ratings on WeRateDogs with an average rating of 11.875, followed by Samoyeds with an average rating of 11.725 and Chows with an average rating of 11.641.

2. Which dog stage recieved the highest ratings on average?

In [62]:
# puts floofs and floofers into the same category 'floof'
df['stage'] = df['stage'].replace('floofer', 'floof')
In [63]:
# groups the dataset by dog stages and computes the average rating for each group
stage_groups = df.groupby('stage')['rating_num'].mean()

# sorts the stage groups by descending order
stage_groups = stage_groups.sort_values(ascending=False)
stage_groups
Out[63]:
stage
puppo     12.107143
doggo     11.725275
floof     11.694444
pupper    10.688525
Name: rating_num, dtype: float64
In [64]:
# define axis values
x_axis = list(stage_groups)
y_axis = [i.title() for i in stage_groups.index]

# plot the bar chart
plt.figure(figsize=[10,5])
plt.xlim(8,13)
plt.xlabel('Average Rating')
plt.title('Average Rating for each Dog Stage on WeRateDogs™')
colors = ['#8dafbe' if (x < max(x_axis)) else '#f8e863' for x in x_axis]
ax = sns.barplot(x=x_axis, y=y_axis, label=y_axis, color="b", palette=colors)
plt.savefig('visuals/rating_each_stage.png', bbox_inches='tight')
plt.show();

Answer:

Puppo's, which is a term for teenage dogs, recieved the highest ratings on WeRateDogs with an average rating of 12.107, while pupper's, which basically are puppies, recieved the lowest ratings with an average rating of 10.689.

3. Which dog breed recieved the highest favorite counts on average?

In [65]:
# groups the dataset by dog breeds and computes the average favorite count for each group
breed_groups_fav = df.groupby('breed_pred')['favorite_count'].mean()

# filters out dog breed that appear less than 10 times
breed_groups_fav = breed_groups_fav[df['breed_pred'].value_counts() > 10].sort_values(ascending=False)
breed_groups_fav
Out[65]:
breed_pred
french bulldog                    17949.038462
eskimo dog                        15911.750000
cardigan                          14303.875000
samoyed                           14001.300000
great pyrenees                    12451.785714
labrador retriever                11998.677419
golden retriever                  11963.686131
basset                            11832.400000
cocker spaniel                    11812.800000
malamute                          11729.482759
pembroke                          11212.772727
chesapeake bay retriever          10640.045455
italian greyhound                 10231.833333
staffordshire bullterrier          9893.888889
german shepherd                    9806.550000
chow                               9802.871795
chihuahua                          8951.986486
border collie                      8653.750000
shetland sheepdog                  8317.562500
miniature pinscher                 8210.555556
rottweiler                         7768.625000
pekinese                           7371.500000
american staffordshire terrier     7291.076923
west highland white terrier        7166.833333
vizsla                             7063.833333
bernese mountain dog               7046.636364
siberian husky                     6939.052632
pomeranian                         6762.750000
toy poodle                         6535.405405
pug                                5723.301887
kuvasz                             5679.142857
airedale                           5437.363636
beagle                             4666.437500
lakeland terrier                   4500.000000
shih tzu                           4122.866667
maltese dog                        2938.625000
Name: favorite_count, dtype: float64
In [66]:
# define axis values
x_axis = list(breed_groups_fav)
y_axis = [i.title() for i in breed_groups_fav.index]

# plot the bar chart
plt.figure(figsize=[10,14])
plt.xlim(0, 20000)
plt.xlabel('Average Favorite Count')
plt.title('Average Favorite Count for each Dog Breed on WeRateDogs™')
colors = ['#8dafbe' if (x < max(x_axis)) else '#f8e863' for x in x_axis]
ax = sns.barplot(x=x_axis, y=y_axis, label=y_axis, color="b", palette=colors)
plt.show();

Answer:

Regarding the popularity of dog breeds, French Bulldogs have scored the highest between all dog breeds with an average favorite count of 17949 per tweet, contrary to the average ratings, in which they only placed 11th. The second most popular dog breed on WeRateDogs is the Eskimo Dog with an average favorite counnt of 15911 per tweet, which also placed 1st on the average rating chart.

4. Which dog stage recieved the highest favorite count on average?

In [67]:
# groups the dataset by dog stages and computes the average favorite count for each group
stage_groups_fav = df.groupby('stage')['favorite_count'].mean()

# sorts the stage groups by descending order
stage_groups_fav = stage_groups_fav.sort_values(ascending=False)
stage_groups_fav
Out[67]:
stage
puppo     22091.857143
doggo     17623.032967
floof      8518.222222
pupper     6948.307377
Name: favorite_count, dtype: float64
In [68]:
# define axis values
x_axis = list(stage_groups_fav)
y_axis = [i.title() for i in stage_groups_fav.index]

# plot the bar chart
plt.figure(figsize=[10,5])
plt.xlim(0,25000)
plt.xlabel('Average Favorite Count')
plt.title('Average Favorite Count for each Dog Stage on WeRateDogs™')
colors = ['#8dafbe' if (x < max(x_axis)) else '#f8e863' for x in x_axis]
ax = sns.barplot(x=x_axis, y=y_axis, label=y_axis, color="b", palette=colors)
plt.show();

Answer:

Puppos have recieved the most favorite counts (or "likes") on average between the other dog stages, with an average favorite count of 22091 per tweet. This makes puppos not only the highest rated dog stage, but also the most popular within the WeRateDogs community in the observed time period.

Part Vb - Additional Visualizations

To convert the charts of the last chapter into interesting visualizations, a word cloud has been framed onto a picture of each sections highest placing dogs in addition to their key metrics.

For this task, following steps have been made:

  • Parse all the texts for each group, eskimo dogs, french bulldogs, and puppos
  • Generate a word cloud frame for each observation using Andreas Mueller's wordcloud package for python.
  • Merge appropriate images with the word clouds using photoshop.
In [69]:
# Parse texts for Question 1
import re
q1_text = str()
for i in df[df['breed_pred'] == 'eskimo dog'].text:
    q1_text += re.search('(.*)[^http.-{}]', i).group(0)
    
# Parse texts for Question 2 & 4
q24_text = str()
for i in df[df['stage'] == 'puppo'].text:
    q24_text += re.search('(.*)[^http.-{}]', i).group(0)

# Parse texts for Question 3
q3_text = str()
for i in df[df['breed_pred'] == 'french bulldog'].text:
    q3_text += re.search('(.*)[^http.-{}]', i).group(0)
In [70]:
# Import packages for the wordcloud
from os import path
from PIL import Image
import os

from wordcloud import WordCloud, STOPWORDS
In [71]:
# get data directory
d = path.dirname(__file__) if "__file__" in locals() else os.getcwd()
In [72]:
# read the mask
doggy_mask = np.array(Image.open(path.join(d, "visuals/doggymask.png")))

stopwords = set(STOPWORDS)
stopwords.add("said")

wc = WordCloud(background_color="#e6e6e6", min_font_size=10, max_font_size=70
               , max_words=4000, mask=doggy_mask, stopwords=stopwords)
In [73]:
# generate word cloud and store for each category
wc.generate(q1_text)
wc.to_file(path.join(d, "visuals/eskimo_cloud.png"))

wc.generate(q24_text)
wc.to_file(path.join(d, "visuals/puppo_cloud.png"))

wc.generate(q3_text)
wc.to_file(path.join(d, "visuals/frenchy_cloud.png"))
Out[73]:
<wordcloud.wordcloud.WordCloud at 0x1d46d701a20>

The Word Cloud results (after Photoshop), because why not?

Dog Breed with Highest Average Rating:

Dog Breed with Highest Average Rating

Dog Breed with Highest Average Favorite Count:

Dog Breed with Highest Average Favorite Count

Dog Stage with Highest Rating and Favorite Count:

Dog Stage with Highest Rating and Favorite Count

Summary

To achieve the goals of the analysis, data have been gathered from different sources, including the twitter api as well as other web data. The potentials of Python's pandas library has been used extensively on the assessment and data cleaning parts. For the conclusion part, statements regarding the performance of each dog breed and dog stage (age group) have been given. To sum up the conclusions, visualizations in forms of word clouds have been plotted in accordance to the observations.