import pandas as pd
Creating, Reading and Writing
DataFrame and Series
2 core objects- - DataFrame - array of inidividual entries (contains row and column)
keys = ‘column names’, values = list of entries
rows = Index
- Series- sequence of data values
don’t have any column name
row names defined by index parameter aswell
#DataFrame_integer
'Yes' : [390, 233], 'No' : [1,23]}) pd.DataFrame({
Yes | No | |
---|---|---|
0 | 390 | 1 |
1 | 233 | 23 |
# DataFrame_Strings
'Suzaine': ['I liked chocolate', 'Lets have some fun'],
pd.DataFrame({'Marie': ['butterscotch worked fine', 'wow, its raining']},
= ['topic_1', 'topic_2']) index
Suzaine | Marie | |
---|---|---|
topic_1 | I liked chocolate | butterscotch worked fine |
topic_2 | Lets have some fun | wow, its raining |
# series
1, 2, 3],
pd.Series([= ['2014_sales', '2015_sales', '2016_sales'],
index= 'Product A') name
2014_sales 1
2015_sales 2
2016_sales 3
Name: Product A, dtype: int64
# example
= pd.Series(['4 cups', '1 cup', '2 large', '1 can'],
Dinner = ['Flour', 'Milk', 'Eggs', 'Spam'],
index = 'Dinner')
name print(Dinner)
Flour 4 cups
Milk 1 cup
Eggs 2 large
Spam 1 can
Name: Dinner, dtype: object
Writing data files
"Dinner.csv") Dinner.to_csv(
Reading data files
= pd.read_csv('Reactions.csv')
reactions print(reactions.shape)
(25553, 5)
print(reactions.head())
Unnamed: 0 Content ID \
0 0 97522e57-d9ab-4bd6-97bf-c24d952602d2
1 1 97522e57-d9ab-4bd6-97bf-c24d952602d2
2 2 97522e57-d9ab-4bd6-97bf-c24d952602d2
3 3 97522e57-d9ab-4bd6-97bf-c24d952602d2
4 4 97522e57-d9ab-4bd6-97bf-c24d952602d2
User ID Type Datetime
0 NaN NaN 2021-04-22 15:17:15
1 5d454588-283d-459d-915d-c48a2cb4c27f disgust 2020-11-07 09:43:50
2 92b87fa5-f271-43e0-af66-84fac21052e6 dislike 2021-06-17 12:22:51
3 163daa38-8b77-48c9-9af6-37a6c1447ac2 scared 2021-04-18 05:13:58
4 34e8add9-0206-47fd-a501-037b994650a2 disgust 2021-01-06 19:13:01
Indexing, Selecting and Assigning
= pd.read_csv("winemag-data-130k-v2.csv")
data 'display.max_rows', 5)
pd.set_option(print(data.head())
Unnamed: 0 country description \
0 0 Italy Aromas include tropical fruit, broom, brimston...
1 1 Portugal This is ripe and fruity, a wine that is smooth...
2 2 US Tart and snappy, the flavors of lime flesh and...
3 3 US Pineapple rind, lemon pith and orange blossom ...
4 4 US Much like the regular bottling from 2012, this...
designation points price province \
0 Vulkà Bianco 87 NaN Sicily & Sardinia
1 Avidagos 87 15.0 Douro
2 NaN 87 14.0 Oregon
3 Reserve Late Harvest 87 13.0 Michigan
4 Vintner's Reserve Wild Child Block 87 65.0 Oregon
region_1 region_2 taster_name \
0 Etna NaN Kerin O’Keefe
1 NaN NaN Roger Voss
2 Willamette Valley Willamette Valley Paul Gregutt
3 Lake Michigan Shore NaN Alexander Peartree
4 Willamette Valley Willamette Valley Paul Gregutt
taster_twitter_handle title \
0 @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna)
1 @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro)
2 @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley)
3 NaN St. Julian 2013 Reserve Late Harvest Riesling ...
4 @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child...
variety winery
0 White Blend Nicosia
1 Portuguese Red Quinta dos Avidagos
2 Pinot Gris Rainstorm
3 Riesling St. Julian
4 Pinot Noir Sweet Cheeks
print(data.columns)
Index(['Unnamed: 0', 'country', 'description', 'designation', 'points',
'price', 'province', 'region_1', 'region_2', 'taster_name',
'taster_twitter_handle', 'title', 'variety', 'winery'],
dtype='object')
print(data.country)
0 Italy
1 Portugal
...
129969 France
129970 France
Name: country, Length: 129971, dtype: object
print(data['country']) #handles reserved characters
0 Italy
1 Portugal
...
129969 France
129970 France
Name: country, Length: 129971, dtype: object
print(data['country'][4])
US
Indexing
index based or numerical position based (.iloc operator used)
- python's std. library appraoch (0:10 selects 0, 1, ...9)
label based or value based (.loc operator used)
-indexes inclusively. So 0:10 will select entries 0,...,10
# selecting first row
0] data.iloc[
Unnamed: 0 0
country Italy
...
variety White Blend
winery Nicosia
Name: 0, Length: 14, dtype: object
3, 1] data.iloc[:
0 Italy
1 Portugal
2 US
Name: country, dtype: object
-5:] #selecting last 5 rows, plus all columns data.iloc[
Unnamed: 0 | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
129966 | 129966 | Germany | Notes of honeysuckle and cantaloupe sweeten th... | Brauneberger Juffer-Sonnenuhr Spätlese | 90 | 28.0 | Mosel | NaN | NaN | Anna Lee C. Iijima | NaN | Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... | Riesling | Dr. H. Thanisch (Erben Müller-Burggraef) |
129967 | 129967 | US | Citation is given as much as a decade of bottl... | NaN | 90 | 75.0 | Oregon | Oregon | Oregon Other | Paul Gregutt | @paulgwine | Citation 2004 Pinot Noir (Oregon) | Pinot Noir | Citation |
129968 | 129968 | France | Well-drained gravel soil gives this wine its c... | Kritt | 90 | 30.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Gresser 2013 Kritt Gewurztraminer (Als... | Gewürztraminer | Domaine Gresser |
129969 | 129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | 129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
'taster_name', 'variety', 'winery']] data.loc[:, [
taster_name | variety | winery | |
---|---|---|---|
0 | Kerin O’Keefe | White Blend | Nicosia |
1 | Roger Voss | Portuguese Red | Quinta dos Avidagos |
... | ... | ... | ... |
129969 | Roger Voss | Pinot Gris | Domaine Marcel Deiss |
129970 | Roger Voss | Gewürztraminer | Domaine Schoffit |
129971 rows × 3 columns
Manipulating the index
'title') #now first column is title data.set_index(
Unnamed: 0 | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
title | |||||||||||||
Nicosia 2013 Vulkà Bianco (Etna) | 0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | White Blend | Nicosia |
Quinta dos Avidagos 2011 Avidagos Red (Douro) | 1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Portuguese Red | Quinta dos Avidagos |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | 129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Pinot Gris | Domaine Marcel Deiss |
Domaine Schoffit 2012 Lieu-dit Harth Cuvée Caroline Gewurztraminer (Alsace) | 129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Gewürztraminer | Domaine Schoffit |
129971 rows × 13 columns
# conditional selection
# selects data with US in columns names for countries
== 'US'] data.loc[data.country
Unnamed: 0 | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | 3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129952 | 129952 | US | This Zinfandel from the eastern section of Nap... | NaN | 90 | 22.0 | California | Chiles Valley | Napa | Virginie Boone | @vboone | Houdini 2011 Zinfandel (Chiles Valley) | Zinfandel | Houdini |
129967 | 129967 | US | Citation is given as much as a decade of bottl... | NaN | 90 | 75.0 | Oregon | Oregon | Oregon Other | Paul Gregutt | @paulgwine | Citation 2004 Pinot Noir (Oregon) | Pinot Noir | Citation |
54504 rows × 14 columns
# selecting particular rows
= [1, 2, 3, 5, 8]
indices = data.loc[indices]
sample_rows print(sample_rows)
Unnamed: 0 country description \
1 1 Portugal This is ripe and fruity, a wine that is smooth...
2 2 US Tart and snappy, the flavors of lime flesh and...
3 3 US Pineapple rind, lemon pith and orange blossom ...
5 5 Spain Blackberry and raspberry aromas show a typical...
8 8 Germany Savory dried thyme notes accent sunnier flavor...
designation points price province region_1 \
1 Avidagos 50 15.0 Douro NaN
2 NaN 50 14.0 Oregon Willamette Valley
3 Reserve Late Harvest 50 13.0 Michigan Lake Michigan Shore
5 Ars In Vitro 50 15.0 Northern Spain Navarra
8 Shine 50 12.0 Rheinhessen NaN
region_2 taster_name taster_twitter_handle \
1 NaN Roger Voss @vossroger
2 Willamette Valley Paul Gregutt @paulgwine
3 NaN Alexander Peartree NaN
5 NaN Michael Schachner @wineschach
8 NaN Anna Lee C. Iijima NaN
title variety \
1 Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red
2 Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris
3 St. Julian 2013 Reserve Late Harvest Riesling ... Riesling
5 Tandem 2011 Ars In Vitro Tempranillo-Merlot (N... Tempranillo-Merlot
8 Heinz Eifel 2013 Shine Gewürztraminer (Rheinhe... Gewürztraminer
winery
1 Quinta dos Avidagos
2 Rainstorm
3 St. Julian
5 Tandem
8 Heinz Eifel
# selecting costly wines from US
== 'US') & (data.price >= 75)] data.loc[(data.country
Unnamed: 0 | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
60 | 60 | US | Syrupy and dense, this wine is jammy in plum a... | Estate | 86 | 100.0 | California | Napa Valley | Napa | Virginie Boone | @vboone | Okapi 2013 Estate Cabernet Sauvignon (Napa Val... | Cabernet Sauvignon | Okapi |
73 | 73 | US | Juicy plum, raspberry and pencil lead lead the... | Bella Vetta Vineyard | 86 | 75.0 | California | Howell Mountain | Napa | Virginie Boone | @vboone | Hindsight 2013 Bella Vetta Vineyard Cabernet S... | Cabernet Sauvignon | Hindsight |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129919 | 129919 | US | This ripe, rich, almost decadently thick wine ... | Reserve | 91 | 105.0 | Washington | Walla Walla Valley (WA) | Columbia Valley | Paul Gregutt | @paulgwine | Nicholas Cole Cellars 2004 Reserve Red (Walla ... | Red Blend | Nicholas Cole Cellars |
129967 | 129967 | US | Citation is given as much as a decade of bottl... | NaN | 90 | 75.0 | Oregon | Oregon | Oregon Other | Paul Gregutt | @paulgwine | Citation 2004 Pinot Noir (Oregon) | Pinot Noir | Citation |
3629 rows × 14 columns
# wines from Australia and New Zealand
data.loc['Australia', 'New Zealand']))
(data.country.isin([ ]
Unnamed: 0 | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
77 | 77 | Australia | This medium-bodied Chardonnay features aromas ... | Made With Organic Grapes | 50 | 18.0 | South Australia | South Australia | NaN | Joe Czerwinski | @JoeCz | Yalumba 2016 Made With Organic Grapes Chardonn... | Chardonnay | Yalumba |
83 | 83 | Australia | Pale copper in hue, this wine exudes passion f... | Jester Sangiovese | 50 | 20.0 | South Australia | McLaren Vale | NaN | Joe Czerwinski | @JoeCz | Mitolo 2016 Jester Sangiovese Rosé (McLaren Vale) | Rosé | Mitolo |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129956 | 129956 | New Zealand | The blend is 44% Merlot, 33% Cabernet Sauvigno... | Gimblett Gravels Merlot-Cabernet Sauvignon-Malbec | 50 | 19.0 | Hawke's Bay | NaN | NaN | Joe Czerwinski | @JoeCz | Esk Valley 2011 Gimblett Gravels Merlot-Cabern... | Bordeaux-style Red Blend | Esk Valley |
129958 | 129958 | New Zealand | This blend of Cabernet Sauvignon-Merlot and Ca... | Irongate | 50 | 35.0 | Hawke's Bay | NaN | NaN | Joe Czerwinski | @JoeCz | Babich 2010 Irongate Red (Hawke's Bay) | Bordeaux-style Red Blend | Babich |
3748 rows × 14 columns
# selecting rows and columns
= ['price', 'region_1', 'region_2']
columns = [1, 10, 100]
rows = data.loc[rows, columns]
df print(df)
price region_1 region_2
1 15.0 NaN NaN
10 19.0 Napa Valley Napa
100 18.0 Finger Lakes Finger Lakes
# selecting notnull values
data.loc[data.price.notnull()]
Unnamed: 0 | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | 2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129969 | 129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | 129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
120975 rows × 14 columns
Assigning data
'points'] = 50
data[print(data['points'])
0 50
1 50
..
129969 50
129970 50
Name: points, Length: 129971, dtype: int64
Renaming and Combining
# renaming columns
print(data.rename(columns={'points' : 'score'}))
Unnamed: 0 country \
0 0 Italy
1 1 Portugal
... ... ...
129969 129969 France
129970 129970 France
description \
0 Aromas include tropical fruit, broom, brimston...
1 This is ripe and fruity, a wine that is smooth...
... ...
129969 A dry style of Pinot Gris, this is crisp with ...
129970 Big, rich and off-dry, this is powered by inte...
designation score price province \
0 Vulkà Bianco 87 NaN Sicily & Sardinia
1 Avidagos 87 15.0 Douro
... ... ... ... ...
129969 NaN 90 32.0 Alsace
129970 Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace
region_1 region_2 taster_name taster_twitter_handle \
0 Etna NaN Kerin O’Keefe @kerinokeefe
1 NaN NaN Roger Voss @vossroger
... ... ... ... ...
129969 Alsace NaN Roger Voss @vossroger
129970 Alsace NaN Roger Voss @vossroger
title variety \
0 Nicosia 2013 Vulkà Bianco (Etna) White Blend
1 Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red
... ... ...
129969 Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris
129970 Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer
winery
0 Nicosia
1 Quinta dos Avidagos
... ...
129969 Domaine Marcel Deiss
129970 Domaine Schoffit
[129971 rows x 14 columns]
# renaming indexes
print(data.rename(index={0:'first_entry', 1: 'second_entry'}))
Unnamed: 0 country \
first_entry 0 Italy
second_entry 1 Portugal
... ... ...
129969 129969 France
129970 129970 France
description \
first_entry Aromas include tropical fruit, broom, brimston...
second_entry This is ripe and fruity, a wine that is smooth...
... ...
129969 A dry style of Pinot Gris, this is crisp with ...
129970 Big, rich and off-dry, this is powered by inte...
designation points price province \
first_entry Vulkà Bianco 87 NaN Sicily & Sardinia
second_entry Avidagos 87 15.0 Douro
... ... ... ... ...
129969 NaN 90 32.0 Alsace
129970 Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace
region_1 region_2 taster_name taster_twitter_handle \
first_entry Etna NaN Kerin O’Keefe @kerinokeefe
second_entry NaN NaN Roger Voss @vossroger
... ... ... ... ...
129969 Alsace NaN Roger Voss @vossroger
129970 Alsace NaN Roger Voss @vossroger
title \
first_entry Nicosia 2013 Vulkà Bianco (Etna)
second_entry Quinta dos Avidagos 2011 Avidagos Red (Douro)
... ...
129969 Domaine Marcel Deiss 2012 Pinot Gris (Alsace)
129970 Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...
variety winery
first_entry White Blend Nicosia
second_entry Portuguese Red Quinta dos Avidagos
... ... ...
129969 Pinot Gris Domaine Marcel Deiss
129970 Gewürztraminer Domaine Schoffit
[129971 rows x 14 columns]
# renaming axis
"wines", axis = 'rows').rename_axis('fields', axis = 'columns') data.rename_axis (
fields | Unnamed: 0 | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
wines | ||||||||||||||
0 | 0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | 1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129969 | 129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | 129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
129971 rows × 14 columns
# combining with concat(), join(), and merge()
= 'CAvideos.csv'
file1 = pd.read_csv(file1)
CAdata CAdata
video_id | trending_date | title | channel_title | category_id | publish_time | tags | views | likes | dislikes | comment_count | thumbnail_link | comments_disabled | ratings_disabled | video_error_or_removed | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | n1WpP7iowLc | 17.14.11 | Eminem - Walk On Water (Audio) ft. Beyoncé | EminemVEVO | 10 | 2017-11-10T17:00:03.000Z | Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In... | 17158579 | 787425 | 43420 | 125882 | https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg | False | False | False | Eminem's new track Walk on Water ft. Beyoncé i... |
1 | 0dBIkQ4Mz1M | 17.14.11 | PLUSH - Bad Unboxing Fan Mail | iDubbbzTV | 23 | 2017-11-13T17:00:00.000Z | plush|"bad unboxing"|"unboxing"|"fan mail"|"id... | 1014651 | 127794 | 1688 | 13030 | https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg | False | False | False | STill got a lot of packages. Probably will las... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
40879 | lbMKLzQ4cNQ | 18.14.06 | Trump Advisor Grovels To Trudeau | The Young Turks | 25 | 2018-06-13T04:00:05.000Z | 180612__TB02SorryExcuse|"News"|"Politics"|"The... | 115225 | 2115 | 182 | 1672 | https://i.ytimg.com/vi/lbMKLzQ4cNQ/default.jpg | False | False | False | Peter Navarro isn’t talking so tough now. Ana ... |
40880 | POTgw38-m58 | 18.14.06 | 【完整版】遇到恐怖情人該怎麼辦?2018.06.13小明星大跟班 | 我愛小明星大跟班 | 24 | 2018-06-13T16:00:03.000Z | 吳宗憲|"吳姍儒"|"小明星大跟班"|"Sandy"|"Jacky wu"|"憲哥"|"中天... | 107392 | 300 | 62 | 251 | https://i.ytimg.com/vi/POTgw38-m58/default.jpg | False | False | False | 藝人:李妍瑾、玉兔、班傑、LaLa、小優、少少專家:陳筱屏(律師)、Wendy(心理師)、羅... |
40881 rows × 16 columns
= 'FRvideos.csv'
file2 = pd.read_csv(file2)
FRdata FRdata
video_id | trending_date | title | channel_title | category_id | publish_time | tags | views | likes | dislikes | comment_count | thumbnail_link | comments_disabled | ratings_disabled | video_error_or_removed | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ro6eob0LrCY | 17.14.11 | Malika LePen : Femme de Gauche - Trailer | Le Raptor Dissident | 24 | 2017-11-13T17:32:55.000Z | Raptor"|"Dissident"|"Expliquez"|"moi"|"cette"|... | 212702 | 29282 | 1108 | 3817 | https://i.ytimg.com/vi/Ro6eob0LrCY/default.jpg | False | False | False | Dimanche.\n18h30.\nSoyez présents pour la vidé... |
1 | Yo84eqYwP98 | 17.14.11 | LA PIRE PARTIE ft Le Rire Jaune, Pierre Croce,... | Le Labo | 24 | 2017-11-12T15:00:02.000Z | [none] | 432721 | 14053 | 576 | 1161 | https://i.ytimg.com/vi/Yo84eqYwP98/default.jpg | False | False | False | Le jeu de société: https://goo.gl/hhG1Ta\n\nGa... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
40722 | NlxE_QQMRzg | 18.14.06 | Նռան հատիկ, Սերիա 192 / Pomegranate seed / Nra... | PanArmenian TV | 1 | 2018-06-13T18:30:00.000Z | Նռան հատիկ|"Սերիա 192"|"Pomegranate seed"|"Nra... | 78117 | 244 | 74 | 46 | https://i.ytimg.com/vi/NlxE_QQMRzg/default.jpg | False | False | False | Follow Armenia TV on social platforms:Instagra... |
40723 | _LgKglfnqlc | 18.14.06 | Mandoumbé ak Koor Gui 2018 Episode 28 | Yesdakar | 24 | 2018-06-13T19:45:14.000Z | ramadan2018|"koorgui"|"Mandoumbé" | 46604 | 947 | 37 | 127 | https://i.ytimg.com/vi/_LgKglfnqlc/default.jpg | False | False | False | NaN |
40724 rows × 16 columns
# joining
= CAdata.set_index(['title', 'trending_date'])
left = FRdata.set_index(['title', 'trending_date'])
right
= '_CAN', rsuffix = '_FR') left.join(right, lsuffix
video_id_CAN | channel_title_CAN | category_id_CAN | publish_time_CAN | tags_CAN | views_CAN | likes_CAN | dislikes_CAN | comment_count_CAN | thumbnail_link_CAN | ... | tags_FR | views_FR | likes_FR | dislikes_FR | comment_count_FR | thumbnail_link_FR | comments_disabled_FR | ratings_disabled_FR | video_error_or_removed_FR | description_FR | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
title | trending_date | |||||||||||||||||||||
!! THIS VIDEO IS NOTHING BUT PAIN !! | Getting Over It - Part 7 | 18.04.01 | PNn8sECd7io | Markiplier | 20 | 2018-01-03T19:33:53.000Z | getting over it|"markiplier"|"funny moments"|"... | 835930 | 47058 | 1023 | 8250 | https://i.ytimg.com/vi/PNn8sECd7io/default.jpg | ... | getting over it"|"markiplier"|"funny moments"|... | 835930.0 | 47058.0 | 1023.0 | 8250.0 | https://i.ytimg.com/vi/PNn8sECd7io/default.jpg | False | False | False | Getting Over It continues with RAGE BEYOND ALL... |
#1 Fortnite World Rank - 2,323 Solo Wins! | 18.09.03 | DvPW66IFhMI | AlexRamiGaming | 20 | 2018-03-09T07:15:52.000Z | PS4 Battle Royale|"PS4 Pro Battle Royale"|"Bat... | 212838 | 5199 | 542 | 11 | https://i.ytimg.com/vi/DvPW66IFhMI/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
🚨 BREAKING NEWS 🔴 Raja Live all Slot Channels Welcome 🎰 | 18.07.05 | Wt9Gkpmbt44 | TheBigJackpot | 24 | 2018-05-07T06:58:59.000Z | Slot Machine|"win"|"Gambling"|"Big Win"|"raja"... | 28973 | 2167 | 175 | 10 | https://i.ytimg.com/vi/Wt9Gkpmbt44/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
🚨Active Shooter at YouTube Headquarters - LIVE BREAKING NEWS COVERAGE | 18.04.04 | Az72jrKbANA | Right Side Broadcasting Network | 25 | 2018-04-03T23:12:37.000Z | YouTube shooter|"YouTube active shooter"|"acti... | 103513 | 1722 | 181 | 76 | https://i.ytimg.com/vi/Az72jrKbANA/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
40881 rows × 28 columns
Summary Functions and Maps
# some of the summary functions include- describe, mean, unique, value_counts
print(data.columns)
Index(['Unnamed: 0', 'country', 'description', 'designation', 'points',
'price', 'province', 'region_1', 'region_2', 'taster_name',
'taster_twitter_handle', 'title', 'variety', 'winery'],
dtype='object')
print(data.points.describe())
count 129971.000000
mean 88.447138
...
75% 91.000000
max 100.000000
Name: points, Length: 8, dtype: float64
# to see the list of unique values
print(data.taster_name.unique)
<bound method Series.unique of 0 Kerin O’Keefe
1 Roger Voss
...
129969 Roger Voss
129970 Roger Voss
Name: taster_name, Length: 129971, dtype: object>
print(data.taster_name.value_counts)
<bound method IndexOpsMixin.value_counts of 0 Kerin O’Keefe
1 Roger Voss
...
129969 Roger Voss
129970 Roger Voss
Name: taster_name, Length: 129971, dtype: object>
# best_bargain_wine- wine with the highest points-to-price ratio
= (data.points / data.price).idxmax()
bargain_idx = data.loc[bargain_idx, 'title']
bargain_wine print(bargain_wine)
Bandit NV Merlot (California)
Maps
takes one set of values and ‘maps’ them to another set of values
example usage - remean the scores of wines received to 0
use apply if you wish to call custom method on each row
= data.points.mean()
review_points_mean map(lambda p:p - review_points_mean) data.points.
0 -1.447138
1 -1.447138
...
129969 1.552862
129970 1.552862
Name: points, Length: 129971, dtype: float64
= data.points.mean()
data_points_mean map(lambda p:p - data_points_mean) data.points.
0 -1.447138
1 -1.447138
...
129969 1.552862
129970 1.552862
Name: points, Length: 129971, dtype: float64
# create descriptor_counts from description for 'tropical' and 'fruity'
= data.description.map(lambda desc:'tropical' in desc).sum()
n_tropical # desc signifies description
= data.description.map(lambda desc:'fruity' in desc).sum()
n_fruity = pd.Series([n_tropical, n_fruity], index= ['tropical', 'fruity'])
descriptor_counts print(descriptor_counts)
tropical 3607
fruity 9090
dtype: int64
simplify with star ratings
95 and above = 3 stars
between 85 and 95 = 2 stars
less than 85 = 1 star
plus, any wines from Canada should get 3 stars
print(data.columns)
Index(['Unnamed: 0', 'country', 'description', 'designation', 'points',
'price', 'province', 'region_1', 'region_2', 'taster_name',
'taster_twitter_handle', 'title', 'variety', 'winery'],
dtype='object')
# categorizing using map for points
= data.points.map(lambda
cat 'three_stars' if p>=95
p:else 'two stars' if p >= 85
else 'one star')
#count
= cat.value_counts()
star_rating
print(star_rating)
points
two stars 115125
one star 12430
three_stars 2416
Name: count, dtype: int64
# categorizing using apply for points and Country
= data.apply(lambda row:
cat2 'three stars' if (row['points'] >= 95 or row['country'] == 'Canada')
else 'two stars' if (row['points'] >= 85)
else 'one star', axis = 1)
= cat2.value_counts()
star_rating2 print(star_rating2)
two stars 114877
one star 12421
three stars 2673
Name: count, dtype: int64
# simple way without mapping
def stars(row):
if row.country == 'Canada':
return 3
elif row.points >= 95:
return 3
elif row.points >= 85:
return 2
else:
return 1
= data.apply(stars, axis = 'columns')
star_ratings print(star_ratings)
0 2
1 2
..
129969 2
129970 2
Length: 129971, dtype: int64
def data_points(row):
= row.points - data_points_mean
row.points return row
apply(data_points, axis = 'columns') data.
Unnamed: 0 | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | -1.447138 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | 1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | -1.447138 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129969 | 129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 1.552862 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | 129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 1.552862 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
129971 rows × 14 columns
1) data.head(
Unnamed: 0 | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
- operation (below) between a lot of values on the left-hand side > and a single value on the right-hand side (the mean value).
= data.points.mean()
data_points_mean - data_points_mean data.points
0 -1.447138
1 -1.447138
...
129969 1.552862
129970 1.552862
Name: points, Length: 129971, dtype: float64
+ "-" + data.region_1 data.country
0 Italy-Etna
1 NaN
...
129969 France-Alsace
129970 France-Alsace
Length: 129971, dtype: object
Grouping and Sorting
use groupby to group data
apply() method can fetch us the data that matches the group
# groupwise analysis
'points').points.count() data.groupby(
points
80 397
81 692
...
99 33
100 19
Name: points, Length: 21, dtype: int64
# ascending or descending order
'points').price.min() data.groupby(
points
80 5.0
81 5.0
...
99 44.0
100 80.0
Name: price, Length: 21, dtype: float64
#grouping in countries and sorting
'country', 'province']).apply(lambda df:df.loc[df.points.idxmax()]) data.groupby([
Unnamed: 0 | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
country | province | ||||||||||||||
Argentina | Mendoza Province | 82754 | Argentina | If the color doesn't tell the full story, the ... | Nicasia Vineyard | 97 | 120.0 | Mendoza Province | Mendoza | NaN | Michael Schachner | @wineschach | Bodega Catena Zapata 2006 Nicasia Vineyard Mal... | Malbec | Bodega Catena Zapata |
Other | 78303 | Argentina | Take note, this could be the best wine Colomé ... | Reserva | 95 | 90.0 | Other | Salta | NaN | Michael Schachner | @wineschach | Colomé 2010 Reserva Malbec (Salta) | Malbec | Colomé | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Uruguay | San Jose | 39898 | Uruguay | Baked, sweet, heavy aromas turn earthy with ti... | El Preciado Gran Reserva | 87 | 50.0 | San Jose | NaN | NaN | Michael Schachner | @wineschach | Castillo Viejo 2005 El Preciado Gran Reserva R... | Red Blend | Castillo Viejo |
Uruguay | 39361 | Uruguay | Cherry and berry aromas are ripe, healthy and ... | Blend 002 Limited Edition | 91 | 22.0 | Uruguay | NaN | NaN | Michael Schachner | @wineschach | Narbona NV Blend 002 Limited Edition Tannat-Ca... | Tannat-Cabernet Franc | Narbona |
425 rows × 14 columns
help(pd.Series.idxmax)
Help on function idxmax in module pandas.core.series:
idxmax(self, axis: 'Axis' = 0, skipna: 'bool' = True, *args, **kwargs) -> 'Hashable'
Return the row label of the maximum value.
If multiple values equal the maximum, the first row label with that
value is returned.
Parameters
----------
axis : {0 or 'index'}
Unused. Parameter needed for compatibility with DataFrame.
skipna : bool, default True
Exclude NA/null values. If the entire Series is NA, the result
will be NA.
*args, **kwargs
Additional arguments and keywords have no effect but might be
accepted for compatibility with NumPy.
Returns
-------
Index
Label of the maximum value.
Raises
------
ValueError
If the Series is empty.
See Also
--------
numpy.argmax : Return indices of the maximum values
along the given axis.
DataFrame.idxmax : Return index of first occurrence of maximum
over requested axis.
Series.idxmin : Return index *label* of the first occurrence
of minimum of values.
Notes
-----
This method is the Series version of ``ndarray.argmax``. This method
returns the label of the maximum, while ``ndarray.argmax`` returns
the position. To get the position, use ``series.values.argmax()``.
Examples
--------
>>> s = pd.Series(data=[1, None, 4, 3, 4],
... index=['A', 'B', 'C', 'D', 'E'])
>>> s
A 1.0
B NaN
C 4.0
D 3.0
E 4.0
dtype: float64
>>> s.idxmax()
'C'
If `skipna` is False and there is an NA value in the data,
the function returns ``nan``.
>>> s.idxmax(skipna=False)
nan
'country']).price.agg([len, 'min', 'max']) data.groupby([
len | min | max | |
---|---|---|---|
country | |||
Argentina | 3800 | 4.0 | 230.0 |
Armenia | 2 | 14.0 | 15.0 |
... | ... | ... | ... |
Ukraine | 14 | 6.0 | 13.0 |
Uruguay | 109 | 10.0 | 130.0 |
43 rows × 3 columns
Multi-indexes
can help to convert to regular index
= data.groupby(['country', 'province']).description.agg([len])
countries_reviewed print(countries_reviewed)
len
country province
Argentina Mendoza Province 3264
Other 536
... ...
Uruguay San Jose 3
Uruguay 24
[425 rows x 1 columns]
= countries_reviewed.index
mi type(mi)
pandas.core.indexes.multi.MultiIndex
countries_reviewed.reset_index()
country | province | len | |
---|---|---|---|
0 | Argentina | Mendoza Province | 3264 |
1 | Argentina | Other | 536 |
... | ... | ... | ... |
423 | Uruguay | San Jose | 3 |
424 | Uruguay | Uruguay | 24 |
425 rows × 3 columns
# create a series of price and points. sort values by price (ascending)
= data.groupby('price')['points'].max().sort_index()
rating print(rating)
price
4.0 86
5.0 87
..
2500.0 96
3300.0 88
Name: points, Length: 390, dtype: int64
= data.groupby('variety').price.agg('max', 'min')
df print(df)
variety
Abouriou 75.0
Agiorgitiko 66.0
...
Çalkarası 19.0
Žilavka 15.0
Name: price, Length: 707, dtype: float64
Sorting
#ascending by defalt
= countries_reviewed.reset_index()
countries_reviewed = 'len') countries_reviewed.sort_values(by
country | province | len | |
---|---|---|---|
179 | Greece | Muscat of Kefallonian | 1 |
192 | Greece | Sterea Ellada | 1 |
... | ... | ... | ... |
415 | US | Washington | 8639 |
392 | US | California | 36247 |
425 rows × 3 columns
# descending
= 'len', ascending= False) countries_reviewed.sort_values(by
country | province | len | |
---|---|---|---|
392 | US | California | 36247 |
415 | US | Washington | 8639 |
... | ... | ... | ... |
63 | Chile | Coelemu | 1 |
149 | Greece | Beotia | 1 |
425 rows × 3 columns
# sorting index_values
countries_reviewed.sort_index()
country | province | len | |
---|---|---|---|
0 | Argentina | Mendoza Province | 3264 |
1 | Argentina | Other | 536 |
... | ... | ... | ... |
423 | Uruguay | San Jose | 3 |
424 | Uruguay | Uruguay | 24 |
425 rows × 3 columns
# sorting more than one column
=['country', 'len']) countries_reviewed.sort_values(by
country | province | len | |
---|---|---|---|
1 | Argentina | Other | 536 |
0 | Argentina | Mendoza Province | 3264 |
... | ... | ... | ... |
424 | Uruguay | Uruguay | 24 |
419 | Uruguay | Canelones | 43 |
425 rows × 3 columns
Data Types and Missing Data
missing values are given the value NaN - ‘Not a Number’- float64 dtype
# find the data type
data.price.dtype
dtype('float64')
# for every column
print(data.dtypes)
Unnamed: 0 int64
country object
...
variety object
winery object
Length: 14, dtype: object
# transform data type
'float64') data.points.astype(
0 87.0
1 87.0
...
129969 90.0
129970 90.0
Name: points, Length: 129971, dtype: float64
# finding values in country by NaN
data[pd.isnull(data.country)]
Unnamed: 0 | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
913 | 913 | NaN | Amber in color, this wine has aromas of peach ... | Asureti Valley | 87 | 30.0 | NaN | NaN | NaN | Mike DeSimone | @worldwineguys | Gotsa Family Wines 2014 Asureti Valley Chinuri | Chinuri | Gotsa Family Wines |
3131 | 3131 | NaN | Soft, fruity and juicy, this is a pleasant, si... | Partager | 83 | NaN | NaN | NaN | NaN | Roger Voss | @vossroger | Barton & Guestier NV Partager Red | Red Blend | Barton & Guestier |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129590 | 129590 | NaN | A blend of 60% Syrah, 30% Cabernet Sauvignon a... | Shah | 90 | 30.0 | NaN | NaN | NaN | Mike DeSimone | @worldwineguys | Büyülübağ 2012 Shah Red | Red Blend | Büyülübağ |
129900 | 129900 | NaN | This wine offers a delightful bouquet of black... | NaN | 91 | 32.0 | NaN | NaN | NaN | Mike DeSimone | @worldwineguys | Psagot 2014 Merlot | Merlot | Psagot |
63 rows × 14 columns
# replacing missing values
'Unknown') data.country.fillna(
0 Italy
1 Portugal
...
129969 France
129970 France
Name: country, Length: 129971, dtype: object
# replacing ('what?','bywhat?')
'NaN', '@Unknown') data.price.replace(
0 NaN
1 15.0
...
129969 32.0
129970 21.0
Name: price, Length: 129971, dtype: float64
# missing price values and count them
sum() data.price.isnull().
8996
# arrange region_1 in ascending order of values
'Unkown').value_counts().sort_values(ascending= False) data.region_1.fillna(
region_1
Unkown 21247
Napa Valley 4480
...
Geelong 1
Paestum 1
Name: count, Length: 1230, dtype: int64