Making DataFrame smaller and faster in pandas
Making DataFrame Smaller and Faster

This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.

Making pandas DataFrame smaller and faster

In [1]:
import pandas as pd
In [2]:
url = 'http://bit.ly/drinksbycountry'
drinks = pd.read_csv(url)
In [3]:
drinks.head()
Out[3]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
In [4]:
drinks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.1+ KB
  • object usually means there's a string
  • memory usage
    • DataFrame takes at least 9.1kb of memory
    • It might be a lot more depending on what's in those object columns
    • In this case, they're just strings of countries and continents
In [6]:
# we can count the actual memory usage using the following command
drinks.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 30.4 KB
In [10]:
# we can check how much space each column is actually taking
# the numbers are in bytes, not kilobytes
drinks.memory_usage(deep=True)
Out[10]:
Index                              80
country                         12588
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                       12332
dtype: int64
In [11]:
type(drinks.memory_usage(deep=True))
Out[11]:
pandas.core.series.Series
In [13]:
# since it is a series, we can use .sum()
drinks.memory_usage(deep=True).sum()
Out[13]:
31176
In [21]:
# there are only 6 unique values of continent
# we can replace strings with digits to save space
sorted(drinks.continent.unique())
Out[21]:
['Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America']
In [20]:
drinks.continent.head()
Out[20]:
0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: object
In [24]:
# converting continent from object to category 
# it stores the strings as integers
drinks['continent'] = drinks.continent.astype('category')
In [23]:
drinks.dtypes
Out[23]:
country                           object
beer_servings                      int64
spirit_servings                    int64
wine_servings                      int64
total_litres_of_pure_alcohol     float64
continent                       category
dtype: object
In [26]:
drinks.continent.head()
Out[26]:
0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: category
Categories (6, object): [Africa, Asia, Europe, North America, Oceania, South America]
In [30]:
# .cat is similar to .str
# we can do more stuff after .cat
# we can see here how pandas represents the continents as integers
drinks.continent.cat.codes.head()
Out[30]:
0    1
1    2
2    0
3    2
4    0
dtype: int8
In [32]:
# before this conversion, it was over 12332 bytes
# now it is 584 bytes
drinks.memory_usage(deep=True)
Out[32]:
Index                              80
country                         12588
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         584
dtype: int64
In [34]:
# we can convert country to a category too
drinks.dtypes
Out[34]:
country                           object
beer_servings                      int64
spirit_servings                    int64
wine_servings                      int64
total_litres_of_pure_alcohol     float64
continent                       category
dtype: object
In [35]:
drinks['country'] = drinks.country.astype('category')
In [39]:
# this is larger! 
# this is because we've too many categories
drinks.memory_usage(deep=True)
Out[39]:
Index                              80
country                         12974
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         584
dtype: int64
In [37]:
# now we've 193 digits
# it points to a lookup table with 193 strings!
drinks.country.cat.categories
Out[37]:
Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'United Arab Emirates', 'United Kingdom', 'Uruguay', 'Uzbekistan',
       'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', length=193)

The key to converting to category is to ensure that there are few categories to save memory usage. If there are too many, we should not convert.

In [46]:
# passing a dictionary {} to the DataFrame method = 
id_list =[100, 101, 102, 103]
quality_list = ['good', 'very good', 'good', 'excellent']
df = pd.DataFrame({'ID': id_list, 'quality': quality_list })
df
Out[46]:
ID quality
0 100 good
1 101 very good
2 102 good
3 103 excellent
In [52]:
# this sorts using alphabetical order
# but there is a logical ordering to these categories, we need to tell pandas there is a logical ordering
df.sort_values('quality')
Out[52]:
ID quality
0 100 good
2 102 good
1 101 very good
3 103 excellent
In [49]:
# how do we tell pandas there is a logical order?
quality_list_ordered = ['good', 'very good', 'excellent']
df['quality'] = df.quality.astype('category', categories=quality_list_ordered, ordered=True)
In [53]:
# here we have good < very good < excellent
df.quality
Out[53]:
0         good
1    very good
2         good
3    excellent
Name: quality, dtype: category
Categories (3, object): [good < very good < excellent]
In [56]:
# now it sorts using the logical order we defined
df.sort_values('quality')
Out[56]:
ID quality
0 100 good
2 102 good
1 101 very good
3 103 excellent
In [58]:
# we can now use boolean conditions with this
# here we want all columns where the row > good
df.loc[df.quality > 'good', :]
Out[58]:
ID quality
1 101 very good
3 103 excellent
Tags: pandas