Filtering rows of a pandas DataFrame by column value
Filtering

This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.

Filtering rows of a pandas DataFrame by column value

In [1]:
import pandas as pd
In [2]:
# url

url = 'http://bit.ly/imdbratings'

# create DataFrame called movies
movies = pd.read_csv(url)
In [3]:
movies.head()
Out[3]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
In [4]:
movies.shape
Out[4]:
(979, 6)
In [6]:
# booleans
type(True)
type(False)
Out[6]:
bool

We want to create a list of booleans with the same number of rows as the movies' DataFrame

  • True if duration > 200
  • False if otherwise
In [7]:
# create list
booleans = []

# loop
for length in movies.duration:
    if length >= 200:
        booleans.append(True)
    else:
        booleans.append(False)
In [8]:
booleans[0:5]
Out[8]:
[False, False, True, False, False]
In [9]:
# len(booleans) is the same as the number of rows in movies' DataFrame
len(booleans)
Out[9]:
979
In [11]:
# convert booleans into a Pandas series
is_long = pd.Series(booleans)
In [12]:
is_long.head()
Out[12]:
0    False
1    False
2     True
3    False
4    False
dtype: bool
In [16]:
# pulls out genre
movies['genre']
Out[16]:
0          Crime
1          Crime
2          Crime
3         Action
4          Crime
5          Drama
6        Western
7      Adventure
8      Biography
9          Drama
10     Adventure
11        Action
12        Action
13         Drama
14     Adventure
15     Adventure
16         Drama
17         Drama
18     Biography
19        Action
20        Action
21         Crime
22         Drama
23         Crime
24         Drama
25        Comedy
26       Western
27         Drama
28         Crime
29        Comedy
         ...    
949       Comedy
950        Crime
951        Drama
952       Comedy
953    Adventure
954       Action
955        Drama
956       Comedy
957       Comedy
958        Drama
959       Comedy
960       Comedy
961    Biography
962       Comedy
963       Action
964    Biography
965      Mystery
966    Animation
967       Action
968        Drama
969        Crime
970        Drama
971       Comedy
972        Drama
973        Drama
974       Comedy
975    Adventure
976       Action
977       Horror
978        Crime
Name: genre, dtype: object
In [17]:
# this pulls out duration >= 200mins
movies[is_long]
Out[17]:
star_rating title content_rating genre duration actors_list
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
78 8.4 Once Upon a Time in America R Crime 229 [u'Robert De Niro', u'James Woods', u'Elizabet...
85 8.4 Lawrence of Arabia PG Adventure 216 [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
142 8.3 Lagaan: Once Upon a Time in India PG Adventure 224 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
204 8.1 Ben-Hur G Adventure 212 [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
445 7.9 The Ten Commandments APPROVED Adventure 220 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
630 7.7 Malcolm X PG-13 Biography 202 [u'Denzel Washington', u'Angela Bassett', u'De...
767 7.6 It's a Mad, Mad, Mad, Mad World APPROVED Action 205 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...

Faster method without a for loop

In [20]:
# this line of code replaces the for loop
# when you use a series name using pandas and use a comparison operator, it will loop through each row
is_long = movies.duration >= 200
is_long.head()
Out[20]:
0    False
1    False
2     True
3    False
4    False
Name: duration, dtype: bool
In [21]:
movies[is_long]
Out[21]:
star_rating title content_rating genre duration actors_list
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
78 8.4 Once Upon a Time in America R Crime 229 [u'Robert De Niro', u'James Woods', u'Elizabet...
85 8.4 Lawrence of Arabia PG Adventure 216 [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
142 8.3 Lagaan: Once Upon a Time in India PG Adventure 224 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
204 8.1 Ben-Hur G Adventure 212 [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
445 7.9 The Ten Commandments APPROVED Adventure 220 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
630 7.7 Malcolm X PG-13 Biography 202 [u'Denzel Washington', u'Angela Bassett', u'De...
767 7.6 It's a Mad, Mad, Mad, Mad World APPROVED Action 205 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...

Even better way to simplify movies[is_long]

In [25]:
movies[movies.duration >= 200]
Out[25]:
star_rating title content_rating genre duration actors_list
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
78 8.4 Once Upon a Time in America R Crime 229 [u'Robert De Niro', u'James Woods', u'Elizabet...
85 8.4 Lawrence of Arabia PG Adventure 216 [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
142 8.3 Lagaan: Once Upon a Time in India PG Adventure 224 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
204 8.1 Ben-Hur G Adventure 212 [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
445 7.9 The Ten Commandments APPROVED Adventure 220 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
630 7.7 Malcolm X PG-13 Biography 202 [u'Denzel Washington', u'Angela Bassett', u'De...
767 7.6 It's a Mad, Mad, Mad, Mad World APPROVED Action 205 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...

Additional tip: we want to study the duration and only the genre instead of all the columns

In [28]:
# this is a DataFrame, we use dot or bracket notation to get what we want
movies[movies.duration >= 200]['genre']
movies[movies.duration >= 200].genre
Out[28]:
2          Crime
7      Adventure
17         Drama
78         Crime
85     Adventure
142    Adventure
157        Drama
204    Adventure
445    Adventure
476        Drama
630    Biography
767       Action
Name: genre, dtype: object
In [29]:
# best practice is to use .loc instead of what we did above by selecting columns
movies.loc[movies.duration >= 200, 'genre']
Out[29]:
2          Crime
7      Adventure
17         Drama
78         Crime
85     Adventure
142    Adventure
157        Drama
204    Adventure
445    Adventure
476        Drama
630    Biography
767       Action
Name: genre, dtype: object
Tags: pandas