Reading subset of columns or rows, iterating through a Series or DataFrame, dropping all non-numeric columns and passing arguments
Examining Dataset

This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.

Reading subset of columns or rows, iterating through a Series or DataFrame, dropping all non-numeric columns and passing arguments

1. Reading subset of columns or rows

In [1]:
import pandas as pd
In [2]:
link = 'http://bit.ly/uforeports'
ufo = pd.read_csv(link)
In [3]:
ufo.columns
Out[3]:
Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time'], dtype='object')
In [4]:
# reference using String
cols = ['City', 'State']

ufo = pd.read_csv(link, usecols=cols)
In [5]:
ufo.head()
Out[5]:
City State
0 Ithaca NY
1 Willingboro NJ
2 Holyoke CO
3 Abilene KS
4 New York Worlds Fair NY
In [6]:
# reference using position (Integer)
cols2 = [0, 4]

ufo = pd.read_csv(link, usecols=cols2)
In [7]:
ufo.head()
Out[7]:
City Time
0 Ithaca 6/1/1930 22:00
1 Willingboro 6/30/1930 20:00
2 Holyoke 2/15/1931 14:00
3 Abilene 6/1/1931 13:00
4 New York Worlds Fair 4/18/1933 19:00
In [8]:
# if you only want certain number of rows
ufo = pd.read_csv(link, nrows=3)
In [9]:
ufo
Out[9]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00

2. Iterating through a Series and DataFrame

In [11]:
# intuitive method
for c in ufo.City:
    print(c)
Ithaca
Willingboro
Holyoke
In [12]:
# pandas method
# you can grab index and row
for index, row in ufo.iterrows():
    print(index, row.City, row.State)
0 Ithaca NY
1 Willingboro NJ
2 Holyoke CO

3. Drop non-numeric column in a DataFrame

In [13]:
link = 'http://bit.ly/drinksbycountry'
drinks = pd.read_csv(link)
In [14]:
# you have 2 non-numeric columns
drinks.dtypes
Out[14]:
country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object
In [17]:
import numpy as np
drinks.select_dtypes(include=[np.number]).dtypes
Out[17]:
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
dtype: object

4. Passing arguments, when to use list or string

In [19]:
drinks.describe(include='all')
Out[19]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
count 193 193.000000 193.000000 193.000000 193.000000 193
unique 193 NaN NaN NaN NaN 6
top Bahrain NaN NaN NaN NaN Africa
freq 1 NaN NaN NaN NaN 53
mean NaN 106.160622 80.994819 49.450777 4.717098 NaN
std NaN 101.143103 88.284312 79.697598 3.773298 NaN
min NaN 0.000000 0.000000 0.000000 0.000000 NaN
25% NaN 20.000000 4.000000 1.000000 1.300000 NaN
50% NaN 76.000000 56.000000 8.000000 4.200000 NaN
75% NaN 188.000000 128.000000 59.000000 7.200000 NaN
max NaN 376.000000 438.000000 370.000000 14.400000 NaN
In [21]:
# here you pass a list
# use shift + tab to know what arguments to pass in
list_include = ['object', 'float64']
drinks.describe(include=list_include)
Out[21]:
country total_litres_of_pure_alcohol continent
count 193 193.000000 193
unique 193 NaN 6
top Bahrain NaN Africa
freq 1 NaN 53
mean NaN 4.717098 NaN
std NaN 3.773298 NaN
min NaN 0.000000 NaN
25% NaN 1.300000 NaN
50% NaN 4.200000 NaN
75% NaN 7.200000 NaN
max NaN 14.400000 NaN
Tags: pandas