Calculate variance, interquartile range and other variance measures
Variability of Data with Pandas

Variability of Data¶

Theory

• Visualizations of data
• Histogram
• Boxplots
• Range = max - min
• Changes sometimes when we add new data to the dataset
• Hence, this changes with outliers
• Statisticians typically cut the top and bottom 25%
• This is called Interquartile (IQR) range = Q3 - Q1
• Quartiles
• Split data into half
• Median of everything = Q2
• First half's median = Q1
• Second half's median = Q3
• IQR = Q3 - Q1
• About 50% of data falls within the IQR
• IQR is not affected by every value in the dataset
• IQR is not affected by outliers
• Outliers
• We can statistically calculate an outlier
• Outlier < Q1 - 1.5*IQR
• Outlier > Q3 + 1.5*IQR
• Deviation from mean = x_i - x_mean
• Mean absolute deviation = sum(x_i - x_mean) / n
• n is the number of examples
• Squared deviation = (x_i - x_mean)^2
• Mean squared deviation = variance = sum((x_i - x_mean)^2) / n
• Sum of squares (SS) = sum((x_i - x_mean)^2)
• Standard deviation (SD) = variance^0.5
• Approximately 68% of data falls within 1 SD from the mean
• Approximately 95% of data falls within 2 SD from the mean
• Approximately 99.7% of data falls within 3 SD from the mean
• Bessel's Correction
• In general, samples underestimate the variability of a population
• This is because most of the values are centered in the middle
• We can correct for this using Bessel's Correction
• We divide by n - 1 (degree of freedom = 1)
• This will make the standard deviation bigger
• In summary
• If we are trying to estimate the standard deviation of the population, we divide by n - 1
• If we are actually measuring the standard deviation of the population, we divide by n

Calculating variability of data using pandas

In [1]:
import pandas as pd

In [31]:
lst = [33219, 36254, 38801, 46335, 46840, 47596, 55130, 56863, 78070, 88830]
sample = pd.Series(lst)

In [32]:
type(sample)

Out[32]:
pandas.core.series.Series
In [33]:
sample

Out[33]:
0    33219
1    36254
2    38801
3    46335
4    46840
5    47596
6    55130
7    56863
8    78070
9    88830
dtype: int64
In [34]:
sample.mean()

Out[34]:
52793.800000000003
In [35]:
sample.median()

Out[35]:
47218.0
In [47]:
# standard deviation
# default ddof = 1
# divded by n - 1
sample.std()

Out[47]:
18000.701849279834
In [48]:
# standard deviation
# ddof = 0
# divded by n
sample.std(ddof=0)

Out[48]:
17076.965197598776
In [43]:
# variance with ddof = 0
# sum((x_i - x_mean)^2) / n
sample.var(ddof=0)

Out[43]:
291622740.35999984
In [44]:
# variance with ddof = 1
# sum((x_i - x_mean)^2) / (n-1)
sample.var(ddof=1)

Out[44]:
324025267.06666648
In [45]:
# mean (average) absolute deviation

Out[45]:
13543.560000000001

Summary

In [54]:
lst2 = [38946, 43420, 49191, 50430, 50557, 52580, 53595, 54135, 60181, 62076]

In [55]:
sample2 = pd.Series(lst2)

In [61]:
print sample2.std(ddof=0)
print sample2.mean()

6557.16326547
51511.1
5002.3


In [63]:
path = './salary.csv'

In [65]:
# data read into pandas series

Out[65]:
salary
0 59147.29
1 61379.14
2 55683.19
3 56272.76
4 52055.88
In [67]:
# standard deviation
# degree of freedom = 0
# divided by n instead of divided by n - 1
salary.std(ddof=0)

Out[67]:
salary    10656.952669
dtype: float64
Tags: