Using Pandas index
Using Pandas Index

This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.

Pandas Index

In [1]:
import pandas as pd
In [2]:
url = 'http://bit.ly/drinksbycountry'
drinks = pd.read_csv(url)
In [3]:
drinks.head()
Out[3]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
In [4]:
drinks.index
Out[4]:
RangeIndex(start=0, stop=193, step=1)

The index is from 0 to 193 (0, 1, 2, 3, 4... 193)

In [5]:
drinks.columns
Out[5]:
Index(['country', 'beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')
In [6]:
# index is not part of the DataFrame
drinks.shape
Out[6]:
(193, 6)
In [7]:
# rarely people leave columns without headers
url2 = 'http://bit.ly/movieusers'
pd.read_table(url2, header=None, sep='|').head()
Out[7]:
0 1 2 3 4
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213

What are indexes for?

  1. Identification
  2. Selection
  3. Alignment

1. Identification

In [8]:
# you can identify what rows we are working with here
drinks[drinks.continent=='South America']
Out[8]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
6 Argentina 193 25 221 8.3 South America
20 Bolivia 167 41 8 3.8 South America
23 Brazil 245 145 16 7.2 South America
35 Chile 130 124 172 7.6 South America
37 Colombia 159 76 3 4.2 South America
52 Ecuador 162 74 3 4.2 South America
72 Guyana 93 302 1 7.1 South America
132 Paraguay 213 117 74 7.3 South America
133 Peru 163 160 21 6.1 South America
163 Suriname 128 178 7 5.6 South America
185 Uruguay 115 35 220 6.6 South America
188 Venezuela 333 100 3 7.7 South America

2. Selection

In [9]:
# .loc method to retrieve element/cell
drinks.loc[23, 'beer_servings']
Out[9]:
245
In [10]:
# inplace=True makes the change 
# sets the index to 'country'
drinks.set_index('country', inplace=True)
drinks.head()
Out[10]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
country
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa
In [11]:
drinks.index
Out[11]:
Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'Tanzania', 'USA', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela',
       'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', name='country', length=193)
In [12]:
# country is no longer one of the columns
drinks.columns
Out[12]:
Index(['beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')
In [13]:
# we can select based on country instead of a number 
# we can select more easily by setting a meaningful index
drinks.loc['Brazil', 'beer_servings']
Out[13]:
245

'country' is the name of the index
We can clear this out

In [14]:
# clearing index name
drinks.index.name = None
drinks.head()
Out[14]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa
In [15]:
# say you prefer to use the default index and you want back the column of countries
drinks.index.name = 'country'
drinks.reset_index(inplace=True)
drinks.head()
Out[15]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
In [16]:
drinks.describe()
Out[16]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
count 193.000000 193.000000 193.000000 193.000000
mean 106.160622 80.994819 49.450777 4.717098
std 101.143103 88.284312 79.697598 3.773298
min 0.000000 0.000000 0.000000 0.000000
25% 20.000000 4.000000 1.000000 1.300000
50% 76.000000 56.000000 8.000000 4.200000
75% 188.000000 128.000000 59.000000 7.200000
max 376.000000 438.000000 370.000000 14.400000
In [21]:
type(drinks.describe())
# you can see this is a DataFrame so we can interact with it accordingly
Out[21]:
pandas.core.frame.DataFrame
In [18]:
drinks.describe().index
Out[18]:
Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')
In [19]:
drinks.describe().columns
Out[19]:
Index(['beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol'],
      dtype='object')
In [23]:
# .loc is a DataFrame method
# format of .loc
# .loc['index_name_or_number', 'column_name]
drinks.describe().loc['25%', 'beer_servings']
Out[23]:
20.0

3. Alignment

In [24]:
drinks.head()
Out[24]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
In [26]:
drinks.continent.head()
Out[26]:
0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: object
In [27]:
drinks.set_index('country', inplace=True)
In [28]:
drinks.head()
Out[28]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
country
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa
In [29]:
drinks.continent.head()
Out[29]:
country
Afghanistan      Asia
Albania        Europe
Algeria        Africa
Andorra        Europe
Angola         Africa
Name: continent, dtype: object
In [30]:
type(drinks.continent.head())
Out[30]:
pandas.core.series.Series
In [31]:
drinks.continent.value_counts()
Out[31]:
Africa           53
Europe           45
Asia             44
North America    23
Oceania          16
South America    12
Name: continent, dtype: int64
In [32]:
type(drinks.continent.value_counts())
Out[32]:
pandas.core.series.Series
In [33]:
drinks.continent.value_counts().values
Out[33]:
array([53, 45, 44, 23, 16, 12])
In [35]:
# we can use the index to select values from the series
# this is similar to .loc for DataFrame
# because series does not have multiple columns, we can do this
drinks.continent.value_counts()['Africa']
Out[35]:
53
In [37]:
# sort based on values in the Series
drinks.continent.value_counts().sort_values()
Out[37]:
South America    12
Oceania          16
North America    23
Asia             44
Europe           45
Africa           53
Name: continent, dtype: int64
In [38]:
# sort index based on ascending order
drinks.continent.value_counts().sort_index()
Out[38]:
Africa           53
Asia             44
Europe           45
North America    23
Oceania          16
South America    12
Name: continent, dtype: int64
In [50]:
# creating a a pandas series
people = pd.Series([3000000, 85000], index=['Albania', 'Andorra'], name='population')
people
Out[50]:
Albania    3000000
Andorra      85000
Name: population, dtype: int64
In [51]:
drinks.beer_servings.head()
Out[51]:
country
Afghanistan      0
Albania         89
Algeria         25
Andorra        245
Angola         217
Name: beer_servings, dtype: int64
In [53]:
# you can do math based on shared index
drinks.beer_servings * people
Out[53]:
Afghanistan                     NaN
Albania                 267000000.0
Algeria                         NaN
Andorra                  20825000.0
Angola                          NaN
Antigua & Barbuda               NaN
Argentina                       NaN
Armenia                         NaN
Australia                       NaN
Austria                         NaN
Azerbaijan                      NaN
Bahamas                         NaN
Bahrain                         NaN
Bangladesh                      NaN
Barbados                        NaN
Belarus                         NaN
Belgium                         NaN
Belize                          NaN
Benin                           NaN
Bhutan                          NaN
Bolivia                         NaN
Bosnia-Herzegovina              NaN
Botswana                        NaN
Brazil                          NaN
Brunei                          NaN
Bulgaria                        NaN
Burkina Faso                    NaN
Burundi                         NaN
Cabo Verde                      NaN
Cambodia                        NaN
                           ...     
Sudan                           NaN
Suriname                        NaN
Swaziland                       NaN
Sweden                          NaN
Switzerland                     NaN
Syria                           NaN
Tajikistan                      NaN
Tanzania                        NaN
Thailand                        NaN
Timor-Leste                     NaN
Togo                            NaN
Tonga                           NaN
Trinidad & Tobago               NaN
Tunisia                         NaN
Turkey                          NaN
Turkmenistan                    NaN
Tuvalu                          NaN
USA                             NaN
Uganda                          NaN
Ukraine                         NaN
United Arab Emirates            NaN
United Kingdom                  NaN
Uruguay                         NaN
Uzbekistan                      NaN
Vanuatu                         NaN
Venezuela                       NaN
Vietnam                         NaN
Yemen                           NaN
Zambia                          NaN
Zimbabwe                        NaN
dtype: float64
In [55]:
# axis=1, column concatenation
# beauty of automatic alignment using index
pd.concat([drinks, people], axis=1).head()
Out[55]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent population
Afghanistan 0 0 0 0.0 Asia NaN
Albania 89 132 54 4.9 Europe 3000000.0
Algeria 25 0 14 0.7 Africa NaN
Andorra 245 138 312 12.4 Europe 85000.0
Angola 217 57 45 5.9 Africa NaN
Tags: pandas