Removing duplicate rows

This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.

Finding and removing duplicate rows in pandas

In [4]:
import pandas as pd
In [6]:
# read a dataset of movie reviewers (modifying the default parameter values for read_table)
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
url = 'http://bit.ly/movieusers'
users = pd.read_table(url, sep='|', header=None, names=user_cols, index_col='user_id')
In [7]:
users.head()
Out[7]:
age gender occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213
In [8]:
users.shape
Out[8]:
(943, 4)

If we want to identify duplicate zip_code rows

In [15]:
# use df.cat_name.duplicated()
# output True if row above is the same
users.zip_code.duplicated()
Out[15]:
user_id
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29      True
30     False
       ...  
914    False
915    False
916    False
917    False
918    False
919     True
920    False
921    False
922     True
923    False
924    False
925    False
926    False
927    False
928     True
929     True
930    False
931    False
932    False
933    False
934     True
935    False
936    False
937     True
938    False
939    False
940     True
941    False
942    False
943    False
Name: zip_code, dtype: bool
In [10]:
# type
type(users.zip_code.duplicated())
Out[10]:
pandas.core.series.Series
In [12]:
# we can use .count() since it's a series
# there're 148 duplicates
users.zip_code.duplicated().sum()
Out[12]:
148
In [13]:
# it will output True if entire row is duplicated (row above)
users.duplicated()
Out[13]:
user_id
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
30     False
       ...  
914    False
915    False
916    False
917    False
918    False
919    False
920    False
921    False
922    False
923    False
924    False
925    False
926    False
927    False
928    False
929    False
930    False
931    False
932    False
933    False
934    False
935    False
936    False
937    False
938    False
939    False
940    False
941    False
942    False
943    False
dtype: bool
In [17]:
# examine duplicated rows
users.loc[users.duplicated(), :]
Out[17]:
age gender occupation zip_code
user_id
496 21 F student 55414
572 51 M educator 20003
621 17 M student 60402
684 28 M student 55414
733 44 F other 60630
805 27 F other 20009
890 32 M student 97301
In [18]:
# keep='first'
# mark duplicates as True except for the first occurence
users.loc[users.duplicated(keep='first'), :]
Out[18]:
age gender occupation zip_code
user_id
496 21 F student 55414
572 51 M educator 20003
621 17 M student 60402
684 28 M student 55414
733 44 F other 60630
805 27 F other 20009
890 32 M student 97301
In [19]:
# keep='last'
# 7 rows that are counted as duplicates, keeping the later one

# this is useful for splitting the data
users.loc[users.duplicated(keep='last'), :]
Out[19]:
age gender occupation zip_code
user_id
67 17 M student 60402
85 51 M educator 20003
198 21 F student 55414
350 32 M student 97301
428 28 M student 55414
437 27 F other 20009
460 44 F other 60630
In [22]:
# mark all duplicates as True
# this combines the two tables above
users.loc[users.duplicated(keep=False), :]
Out[22]:
age gender occupation zip_code
user_id
67 17 M student 60402
85 51 M educator 20003
198 21 F student 55414
350 32 M student 97301
428 28 M student 55414
437 27 F other 20009
460 44 F other 60630
496 21 F student 55414
572 51 M educator 20003
621 17 M student 60402
684 28 M student 55414
733 44 F other 60630
805 27 F other 20009
890 32 M student 97301

Dropping duplicates

In [23]:
# drops the 7 rows
users.drop_duplicates(keep='first').shape
Out[23]:
(936, 4)
In [24]:
# drops the last version of the 7 duplicate rows
users.drop_duplicates(keep='last').shape
Out[24]:
(936, 4)
In [25]:
# drops all 14 rows
users.drop_duplicates(keep=False).shape
Out[25]:
(929, 4)
In [28]:
# only consider "age" and "zip_code" as the relevant columns
users.duplicated(subset=['age', 'zip_code']).sum()
Out[28]:
16
Tags: pandas