Using Pandas groupby
Using "groupby"

This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.

Using "groupby" in pandas

In [1]:
import pandas as pd
In [2]:
url = 'http://bit.ly/drinksbycountry'
drinks = pd.read_csv(url)
In [3]:
drinks.head()
Out[3]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
In [5]:
# get mean of the beer_servings' column
drinks.beer_servings.mean()
Out[5]:
106.16062176165804
In [6]:
# using .groupby
drinks.groupby('continent').beer_servings.mean()
Out[6]:
continent
Africa            61.471698
Asia              37.045455
Europe           193.777778
North America    145.434783
Oceania           89.687500
South America    175.083333
Name: beer_servings, dtype: float64
In [9]:
# here we are accessing all of Africa in the column "continent
drinks[drinks.continent=='Africa'].head()
Out[9]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
2 Algeria 25 0 14 0.7 Africa
4 Angola 217 57 45 5.9 Africa
18 Benin 34 4 13 1.1 Africa
22 Botswana 173 35 35 5.4 Africa
26 Burkina Faso 25 7 7 4.3 Africa
In [10]:
drinks[drinks.continent=='Africa'].mean()
Out[10]:
beer_servings                   61.471698
spirit_servings                 16.339623
wine_servings                   16.264151
total_litres_of_pure_alcohol     3.007547
dtype: float64
In [11]:
drinks[drinks.continent=='Africa'].beer_servings.mean()
Out[11]:
61.471698113207545
In [14]:
drinks[drinks.continent=='Europe'].beer_servings.mean()
Out[14]:
193.77777777777777

This is the same as the number given when we used .groupby

  • This is because we are grouping beer_servings by the continent

.groupby max and min

In [15]:
drinks.groupby('continent').beer_servings.max()
Out[15]:
continent
Africa           376
Asia             247
Europe           361
North America    285
Oceania          306
South America    333
Name: beer_servings, dtype: int64
In [16]:
drinks.groupby('continent').beer_servings.min()
Out[16]:
continent
Africa            0
Asia              0
Europe            0
North America     1
Oceania           0
South America    93
Name: beer_servings, dtype: int64

Aggregate findings

In [18]:
drinks.groupby('continent').beer_servings.agg(['count', 'min', 'max', 'mean'])
Out[18]:
count min max mean
continent
Africa 53 0 376 61.471698
Asia 44 0 247 37.045455
Europe 45 0 361 193.777778
North America 23 1 285 145.434783
Oceania 16 0 306 89.687500
South America 12 93 333 175.083333

You can get mean of all numeric columns instead of specifying beer_servings

In [19]:
drinks.groupby('continent').mean()
Out[19]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent
Africa 61.471698 16.339623 16.264151 3.007547
Asia 37.045455 60.840909 9.068182 2.170455
Europe 193.777778 132.555556 142.222222 8.617778
North America 145.434783 165.739130 24.521739 5.995652
Oceania 89.687500 58.437500 35.625000 3.381250
South America 175.083333 114.750000 62.416667 6.308333

Visualization

In [20]:
# allow plots to appear in notebook using matplotlib
%matplotlib inline
In [24]:
data = drinks.groupby('continent').mean()
data
Out[24]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent
Africa 61.471698 16.339623 16.264151 3.007547
Asia 37.045455 60.840909 9.068182 2.170455
Europe 193.777778 132.555556 142.222222 8.617778
North America 145.434783 165.739130 24.521739 5.995652
Oceania 89.687500 58.437500 35.625000 3.381250
South America 175.083333 114.750000 62.416667 6.308333
In [22]:
data.plot(kind='bar')
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x1179ad1d0>
Tags: pandas