Handling missing values in pandas
Handling Missing Values

This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.

Handling missing values in pandas

In [1]:
import pandas as pd
In [2]:
url = 'http://bit.ly/uforeports'
ufo = pd.read_csv(url)
In [4]:
ufo.tail()
Out[4]:
City Colors Reported Shape Reported State Time
18236 Grant Park NaN TRIANGLE IL 12/31/2000 23:00
18237 Spirit Lake NaN DISK IA 12/31/2000 23:00
18238 Eagle River NaN NaN WI 12/31/2000 23:45
18239 Eagle River RED LIGHT WI 12/31/2000 23:45
18240 Ybor NaN OVAL FL 12/31/2000 23:59

NaN shows missing values

In [5]:
# checking which are NaN (True) using isnull
# creates DataFrame of True's and False's
ufo.isnull().tail()
Out[5]:
City Colors Reported Shape Reported State Time
18236 False True False False False
18237 False True False False False
18238 False True True False False
18239 False False False False False
18240 False True False False False
In [6]:
ufo.notnull().tail()
Out[6]:
City Colors Reported Shape Reported State Time
18236 True False True True True
18237 True False True True True
18238 True False False True True
18239 True True True True True
18240 True False True True True
In [14]:
# count number of missing values in each column
# sum True's
ufo.isnull().sum()
Out[14]:
City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64
In [11]:
# we create a pandas series of booleans
booleans = pd.Series([True, False, True])
In [13]:
# use sum() on series
# this would sum all True
booleans.sum()

# sum() uses axis=0 by default
# the following code does the same thing
# booleans.sum(axis=0)
Out[13]:
2
In [17]:
# this allows us to see the 25 rows of missing values in the column City
ufo[ufo.City.isnull()]
Out[17]:
City Colors Reported Shape Reported State Time
21 NaN NaN NaN LA 8/15/1943 0:00
22 NaN NaN LIGHT LA 8/15/1943 0:00
204 NaN NaN DISK CA 7/15/1952 12:30
241 NaN BLUE DISK MT 7/4/1953 14:00
613 NaN NaN DISK NV 7/1/1960 12:00
1877 NaN YELLOW CIRCLE AZ 8/15/1969 1:00
2013 NaN NaN NaN NH 8/1/1970 9:30
2546 NaN NaN FIREBALL OH 10/25/1973 23:30
3123 NaN RED TRIANGLE WV 11/25/1975 23:00
4736 NaN NaN SPHERE CA 6/23/1982 23:00
5269 NaN NaN NaN AZ 6/30/1985 21:30
6735 NaN NaN FORMATION TX 4/1/1992 2:00
7208 NaN NaN CIRCLE MI 10/4/1993 17:30
8828 NaN NaN TRIANGLE WA 10/30/1995 21:30
8967 NaN NaN VARIOUS CA 12/8/1995 18:00
9273 NaN NaN TRIANGLE OH 5/1/1996 3:00
9388 NaN NaN OVAL CA 6/12/1996 12:00
9587 NaN NaN EGG FL 8/24/1996 15:00
10399 NaN NaN TRIANGLE IL 6/15/1997 23:00
11625 NaN NaN CIRCLE TX 6/7/1998 7:00
12441 NaN RED FIREBALL WA 10/26/1998 17:58
15767 NaN NaN RECTANGLE NV 1/21/2000 11:30
15812 NaN NaN LIGHT NV 2/2/2000 3:00
16054 NaN GREEN NaN FL 3/11/2000 3:30
16608 NaN NaN SPHERE NY 6/15/2000 15:00

What do we do about the missing values?

Method 1: drop missing values

In [18]:
ufo.shape
Out[18]:
(18241, 5)
In [19]:
# drop rows if any of the 5 columns have a missing value
# how='any' is the default, you need not include this
ufo.dropna(how='any').shape
Out[19]:
(2486, 5)
In [20]:
# no changes are made
# you can change using inplace='true'
# ufo.dropna(how='any', inplace=True).shape
In [21]:
# drop row if all of the columns are missing
ufo.dropna(how='all').shape
Out[21]:
(18241, 5)
In [23]:
# drop row if either City or Shape Reported are missing
ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape
Out[23]:
(15576, 5)
In [25]:
# drop row if both City and Shape Reported are missing
ufo.dropna(subset=['City', 'Shape Reported'], how='all').shape
Out[25]:
(18237, 5)

Method 2: Filling missing values

In [28]:
# this shows missing values (NaN)
ufo['Shape Reported'].value_counts(dropna=False)
Out[28]:
LIGHT        2803
NaN          2644
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
VARIOUS       333
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
CRESCENT        2
ROUND           2
HEXAGON         1
PYRAMID         1
FLARE           1
DOME            1
Name: Shape Reported, dtype: int64
In [29]:
# inplace=True makes the change to the data
ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)
In [32]:
ufo['Shape Reported'].value_counts(dropna=False)
Out[32]:
VARIOUS      2977
LIGHT        2803
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
CRESCENT        2
ROUND           2
HEXAGON         1
FLARE           1
DOME            1
PYRAMID         1
Name: Shape Reported, dtype: int64
Tags: pandas