Calculate variance, interquartile range and other variance measures
Variability of Data with Pandas

Variability of Data

Theory

  • Visualizations of data
    • Histogram
    • Boxplots
  • Range = max - min
    • Changes sometimes when we add new data to the dataset
      • Hence, this changes with outliers
      • Statisticians typically cut the top and bottom 25%
        • This is called Interquartile (IQR) range = Q3 - Q1
  • Quartiles
    • Split data into half
      • Median of everything = Q2
      • First half's median = Q1
      • Second half's median = Q3
      • IQR = Q3 - Q1
        • About 50% of data falls within the IQR
        • IQR is not affected by every value in the dataset
        • IQR is not affected by outliers
  • Outliers
    • We can statistically calculate an outlier
      • Outlier < Q1 - 1.5*IQR
      • Outlier > Q3 + 1.5*IQR
  • Deviation from mean = x_i - x_mean
  • Mean absolute deviation = sum(x_i - x_mean) / n
    • n is the number of examples
  • Squared deviation = (x_i - x_mean)^2
  • Mean squared deviation = variance = sum((x_i - x_mean)^2) / n
    • Sum of squares (SS) = sum((x_i - x_mean)^2)
  • Standard deviation (SD) = variance^0.5
    • Approximately 68% of data falls within 1 SD from the mean
    • Approximately 95% of data falls within 2 SD from the mean
    • Approximately 99.7% of data falls within 3 SD from the mean
  • Bessel's Correction
    • In general, samples underestimate the variability of a population
      • This is because most of the values are centered in the middle
      • We can correct for this using Bessel's Correction
        • We divide by n - 1 (degree of freedom = 1)
        • This will make the standard deviation bigger
    • In summary
      • If we are trying to estimate the standard deviation of the population, we divide by n - 1
      • If we are actually measuring the standard deviation of the population, we divide by n

Calculating variability of data using pandas

In [1]:
import pandas as pd
In [31]:
lst = [33219, 36254, 38801, 46335, 46840, 47596, 55130, 56863, 78070, 88830]
sample = pd.Series(lst)
In [32]:
type(sample)
Out[32]:
pandas.core.series.Series
In [33]:
sample
Out[33]:
0    33219
1    36254
2    38801
3    46335
4    46840
5    47596
6    55130
7    56863
8    78070
9    88830
dtype: int64
In [34]:
sample.mean()
Out[34]:
52793.800000000003
In [35]:
sample.median()
Out[35]:
47218.0
In [47]:
# standard deviation 
# default ddof = 1
# divded by n - 1
sample.std()
Out[47]:
18000.701849279834
In [48]:
# standard deviation 
# ddof = 0
# divded by n 
sample.std(ddof=0)
Out[48]:
17076.965197598776
In [43]:
# variance with ddof = 0
# sum((x_i - x_mean)^2) / n
sample.var(ddof=0)
Out[43]:
291622740.35999984
In [44]:
# variance with ddof = 1
# sum((x_i - x_mean)^2) / (n-1)
sample.var(ddof=1)
Out[44]:
324025267.06666648
In [45]:
# mean (average) absolute deviation
sample.mad()
Out[45]:
13543.560000000001

Summary

In [54]:
lst2 = [38946, 43420, 49191, 50430, 50557, 52580, 53595, 54135, 60181, 62076]
In [55]:
sample2 = pd.Series(lst2)
In [61]:
print sample2.std(ddof=0)
print sample2.mean()
print sample2.mad()
6557.16326547
51511.1
5002.3

Reading from a csv

In [63]:
path = './salary.csv'
salary = pd.read_csv(path)
In [65]:
# data read into pandas series
salary.head()
Out[65]:
salary
0 59147.29
1 61379.14
2 55683.19
3 56272.76
4 52055.88
In [67]:
# standard deviation
# degree of freedom = 0
# divided by n instead of divided by n - 1
salary.std(ddof=0)
Out[67]:
salary    10656.952669
dtype: float64
Tags: pandas