Finding, filtering and converting Series to NaN
Filtering and Converting Series to NaN

This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.

Filtering and Converting Series to NaN

Simply use .loc only for slicing a DataFrame

In [1]:
import pandas as pd
In [2]:
url = 'http://bit.ly/imdbratings'
movies = pd.read_csv(url)
In [3]:
movies.head()
Out[3]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
In [4]:
# counting missing values
movies.content_rating.isnull().sum()
Out[4]:
3
In [5]:
movies.loc[movies.content_rating.isnull(), :]
Out[5]:
star_rating title content_rating genre duration actors_list
187 8.2 Butch Cassidy and the Sundance Kid NaN Biography 110 [u'Paul Newman', u'Robert Redford', u'Katharin...
649 7.7 Where Eagles Dare NaN Action 158 [u'Richard Burton', u'Clint Eastwood', u'Mary ...
936 7.4 True Grit NaN Adventure 128 [u'John Wayne', u'Kim Darby', u'Glen Campbell']
In [12]:
# counting content_rating unique values
# you can see there're 65 'NOT RATED' and 3 'NaN'
# we want to combine all to make 68 NaN
movies.content_rating.value_counts(dropna=False)
Out[12]:
R           460
PG-13       189
PG          123
NaN          68
APPROVED     47
UNRATED      38
G            32
PASSED        7
NC-17         7
X             4
GP            3
TV-MA         1
Name: content_rating, dtype: int64
In [13]:
# examining content_rating's 'NOT RATED'
movies.loc[movies.content_rating=='NOT RATED', :]
Out[13]:
star_rating title content_rating genre duration actors_list
In [8]:
# filtering only 1 column
movies.loc[movies.content_rating=='NOT RATED', 'content_rating']
Out[8]:
5      NOT RATED
6      NOT RATED
41     NOT RATED
63     NOT RATED
66     NOT RATED
72     NOT RATED
83     NOT RATED
87     NOT RATED
88     NOT RATED
89     NOT RATED
93     NOT RATED
100    NOT RATED
104    NOT RATED
105    NOT RATED
108    NOT RATED
109    NOT RATED
111    NOT RATED
116    NOT RATED
122    NOT RATED
128    NOT RATED
132    NOT RATED
133    NOT RATED
134    NOT RATED
140    NOT RATED
149    NOT RATED
165    NOT RATED
167    NOT RATED
169    NOT RATED
174    NOT RATED
178    NOT RATED
         ...    
215    NOT RATED
231    NOT RATED
234    NOT RATED
246    NOT RATED
252    NOT RATED
254    NOT RATED
255    NOT RATED
263    NOT RATED
265    NOT RATED
315    NOT RATED
328    NOT RATED
343    NOT RATED
405    NOT RATED
419    NOT RATED
427    NOT RATED
453    NOT RATED
478    NOT RATED
481    NOT RATED
491    NOT RATED
528    NOT RATED
531    NOT RATED
546    NOT RATED
573    NOT RATED
592    NOT RATED
647    NOT RATED
665    NOT RATED
673    NOT RATED
763    NOT RATED
827    NOT RATED
899    NOT RATED
Name: content_rating, dtype: object
In [9]:
import numpy as np
In [14]:
type(movies.loc[movies.content_rating=='NOT RATED', 'content_rating'])
Out[14]:
pandas.core.series.Series
In [15]:
# there's no error here
# however, if you use other methods of slicing, it would output an error

# equating this series to np.nan converts all to 'NaN'
movies.loc[movies.content_rating=='NOT RATED', 'content_rating'] = np.nan
In [17]:
# it has changed from 65 to 68
movies.content_rating.isnull().sum()
Out[17]:
68

Second example: SettingWithCopyWarning

In [18]:
# select top_movies
top_movies = movies.loc[movies.star_rating >= 9, :]
In [19]:
top_movies
Out[19]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
In [22]:
# there's a SettingWithCopyWarning here because Pandas is not sure if the DataFrame is a view or copy
top_movies.loc[0, 'duration'] = 150
/Users/ritchieng/anaconda3/envs/py3k/lib/python3.5/site-packages/pandas/core/indexing.py:465: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
In [23]:
top_movies
Out[23]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 150 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
In [25]:
# to get rid of the error, always use .copy()

top_movies = movies.loc[movies.star_rating >= 9, :].copy()
In [27]:
top_movies.loc[0, 'duration'] = 150
Tags: pandas