Python tech for distributions

Keywords: bernoulli distribution | uniform distribution | empirical distribution | elections | Download Notebook

numpy
- Vector operations
- 2D arrays
Simulating a simple election model
Displaying the prediction
Empirical Distribution

Let us consider a table of probabilities that PredictWise made on October 2, 2012 for the US presidential elections. PredictWise aggregated polling data and, for each state, estimated the probability that the Obama or Romney would win. Here are those estimated probabilities, loaded inmto a pandas dataframe:

predictwise = pd.read_csv('data/predictwise.csv')
predictwise.head()

	Obama	Romney	States	Votes
0	0.000	1.000	Alabama	9
1	0.000	1.000	Alaska	3
2	0.062	0.938	Arizona	11
3	0.000	1.000	Arkansas	6
4	1.000	0.000	California	55

We can set the states as an index..after all, each state has a unique name.

predictwise = predictwise.set_index('States')
predictwise

	Obama	Romney	Votes
States
Alabama	0.000	1.000	9
Alaska	0.000	1.000	3
Arizona	0.062	0.938	11
Arkansas	0.000	1.000	6
California	1.000	0.000	55
Colorado	0.807	0.193	9
Connecticut	1.000	0.000	7
Delaware	1.000	0.000	3
District of Columbia	1.000	0.000	3
Florida	0.720	0.280	29
Georgia	0.004	0.996	16
Hawaii	1.000	0.000	4
Idaho	0.000	1.000	4
Illinois	1.000	0.000	20
Indiana	0.036	0.964	11
Iowa	0.837	0.163	6
Kansas	0.000	1.000	6
Kentucky	0.000	1.000	8
Louisiana	0.000	1.000	8
Maine	1.000	0.000	4
Maryland	1.000	0.000	10
Massachusetts	1.000	0.000	11
Michigan	0.987	0.013	16
Minnesota	0.982	0.018	10
Mississippi	0.000	1.000	6
Missouri	0.074	0.926	10
Montana	0.046	0.954	3
Nebraska	0.000	1.000	5
Nevada	0.851	0.149	6
New Hampshire	0.857	0.143	4
New Jersey	0.998	0.002	14
New Mexico	0.985	0.015	5
New York	1.000	0.000	29
North Carolina	0.349	0.651	15
North Dakota	0.025	0.975	3
Ohio	0.890	0.110	18
Oklahoma	0.000	1.000	7
Oregon	0.976	0.024	7
Pennsylvania	0.978	0.022	20
Rhode Island	1.000	0.000	4
South Carolina	0.000	1.000	9
South Dakota	0.001	0.999	3
Tennessee	0.001	0.999	11
Texas	0.000	1.000	38
Utah	0.000	1.000	6
Vermont	1.000	0.000	3
Virginia	0.798	0.202	13
Washington	0.999	0.001	12
West Virginia	0.002	0.998	5
Wisconsin	0.925	0.075	10
Wyoming	0.000	1.000	3

numpy

Lets do a short intro to numpy. You can obtain a numpy array from a Pandas dataframe:

predictwise.Obama.values

array([ 0.   ,  0.   ,  0.062,  0.   ,  1.   ,  0.807,  1.   ,  1.   ,
        1.   ,  0.72 ,  0.004,  1.   ,  0.   ,  1.   ,  0.036,  0.837,
        0.   ,  0.   ,  0.   ,  1.   ,  1.   ,  1.   ,  0.987,  0.982,
        0.   ,  0.074,  0.046,  0.   ,  0.851,  0.857,  0.998,  0.985,
        1.   ,  0.349,  0.025,  0.89 ,  0.   ,  0.976,  0.978,  1.   ,
        0.   ,  0.001,  0.001,  0.   ,  0.   ,  1.   ,  0.798,  0.999,
        0.002,  0.925,  0.   ])

A numpy array has a shape:

predictwise.Obama.values.shape

(51,)

And a type…this makes them efficient…

predictwise.Obama.values.dtype

dtype('float64')

You can construct a dub-dataframe in pandas with this strange selection notation below:

predictwise[['Obama', 'Romney']]

	Obama	Romney
States
Alabama	0.000	1.000
Alaska	0.000	1.000
Arizona	0.062	0.938
Arkansas	0.000	1.000
California	1.000	0.000
Colorado	0.807	0.193
Connecticut	1.000	0.000
Delaware	1.000	0.000
District of Columbia	1.000	0.000
Florida	0.720	0.280
Georgia	0.004	0.996
Hawaii	1.000	0.000
Idaho	0.000	1.000
Illinois	1.000	0.000
Indiana	0.036	0.964
Iowa	0.837	0.163
Kansas	0.000	1.000
Kentucky	0.000	1.000
Louisiana	0.000	1.000
Maine	1.000	0.000
Maryland	1.000	0.000
Massachusetts	1.000	0.000
Michigan	0.987	0.013
Minnesota	0.982	0.018
Mississippi	0.000	1.000
Missouri	0.074	0.926
Montana	0.046	0.954
Nebraska	0.000	1.000
Nevada	0.851	0.149
New Hampshire	0.857	0.143
New Jersey	0.998	0.002
New Mexico	0.985	0.015
New York	1.000	0.000
North Carolina	0.349	0.651
North Dakota	0.025	0.975
Ohio	0.890	0.110
Oklahoma	0.000	1.000
Oregon	0.976	0.024
Pennsylvania	0.978	0.022
Rhode Island	1.000	0.000
South Carolina	0.000	1.000
South Dakota	0.001	0.999
Tennessee	0.001	0.999
Texas	0.000	1.000
Utah	0.000	1.000
Vermont	1.000	0.000
Virginia	0.798	0.202
Washington	0.999	0.001
West Virginia	0.002	0.998
Wisconsin	0.925	0.075
Wyoming	0.000	1.000

…and when you get the .values, you get a 2D numpy array (see further down in this document)

predictwise[['Obama', 'Romney']].values

array([[ 0.   ,  1.   ],
       [ 0.   ,  1.   ],
       [ 0.062,  0.938],
       [ 0.   ,  1.   ],
       [ 1.   ,  0.   ],
       [ 0.807,  0.193],
       [ 1.   ,  0.   ],
       [ 1.   ,  0.   ],
       [ 1.   ,  0.   ],
       [ 0.72 ,  0.28 ],
       [ 0.004,  0.996],
       [ 1.   ,  0.   ],
       [ 0.   ,  1.   ],
       [ 1.   ,  0.   ],
       [ 0.036,  0.964],
       [ 0.837,  0.163],
       [ 0.   ,  1.   ],
       [ 0.   ,  1.   ],
       [ 0.   ,  1.   ],
       [ 1.   ,  0.   ],
       [ 1.   ,  0.   ],
       [ 1.   ,  0.   ],
       [ 0.987,  0.013],
       [ 0.982,  0.018],
       [ 0.   ,  1.   ],
       [ 0.074,  0.926],
       [ 0.046,  0.954],
       [ 0.   ,  1.   ],
       [ 0.851,  0.149],
       [ 0.857,  0.143],
       [ 0.998,  0.002],
       [ 0.985,  0.015],
       [ 1.   ,  0.   ],
       [ 0.349,  0.651],
       [ 0.025,  0.975],
       [ 0.89 ,  0.11 ],
       [ 0.   ,  1.   ],
       [ 0.976,  0.024],
       [ 0.978,  0.022],
       [ 1.   ,  0.   ],
       [ 0.   ,  1.   ],
       [ 0.001,  0.999],
       [ 0.001,  0.999],
       [ 0.   ,  1.   ],
       [ 0.   ,  1.   ],
       [ 1.   ,  0.   ],
       [ 0.798,  0.202],
       [ 0.999,  0.001],
       [ 0.002,  0.998],
       [ 0.925,  0.075],
       [ 0.   ,  1.   ]])

A “reshape” can convert a 1-D numpy array into a 2D one…

predictwise['Obama'].values.reshape(-1,1)

array([[ 0.   ],
       [ 0.   ],
       [ 0.062],
       [ 0.   ],
       [ 1.   ],
       [ 0.807],
       [ 1.   ],
       [ 1.   ],
       [ 1.   ],
       [ 0.72 ],
       [ 0.004],
       [ 1.   ],
       [ 0.   ],
       [ 1.   ],
       [ 0.036],
       [ 0.837],
       [ 0.   ],
       [ 0.   ],
       [ 0.   ],
       [ 1.   ],
       [ 1.   ],
       [ 1.   ],
       [ 0.987],
       [ 0.982],
       [ 0.   ],
       [ 0.074],
       [ 0.046],
       [ 0.   ],
       [ 0.851],
       [ 0.857],
       [ 0.998],
       [ 0.985],
       [ 1.   ],
       [ 0.349],
       [ 0.025],
       [ 0.89 ],
       [ 0.   ],
       [ 0.976],
       [ 0.978],
       [ 1.   ],
       [ 0.   ],
       [ 0.001],
       [ 0.001],
       [ 0.   ],
       [ 0.   ],
       [ 1.   ],
       [ 0.798],
       [ 0.999],
       [ 0.002],
       [ 0.925],
       [ 0.   ]])

You can construct a numpy array directly as well

my_array = np.array([1, 2, 3, 4])
my_array

array([1, 2, 3, 4])

In general you should manipulate numpy arrays by using numpy module functions (np.mean, for example). This is for efficiency purposes.

print(my_array.mean())
print(np.mean(my_array))

2.5
2.5

The way we constructed the numpy array above seems redundant. After all we already had a regular python list. Indeed, it is the other ways we have to construct numpy arrays that make them super useful.

There are many such numpy array constructors. Here are some commonly used constructors. Look them up in the documentation.

print(np.ones(10))
np.ones(10, dtype='int') # generates 10 integer ones

[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

np.dtype(float).itemsize # in bytes

Numpy gains a lot of its efficiency from being typed. That is, all elements in the array have the same type, such as integer or floating point. The default type, as can be seen above, is a float of size appropriate for the machine (64 bit on a 64 bit machine).

print(np.zeros(10))
print(np.random.random(10))
normal_array = np.random.randn(1000)
print("The sample mean and standard devation are %f and %f, respectively." %(np.mean(normal_array), np.std(normal_array)))

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.09500079  0.56750287  0.88898257  0.03858948  0.22709491  0.46560001
  0.9110299   0.87729626  0.47123426  0.64749469]
The sample mean and standard devation are 0.031173 and 0.986213, respectively.

Let’s plot a histogram of this normal distribution

plt.hist(normal_array, bins=30);

png

You can normalize this histogram:

plt.hist(normal_array, bins=30, normed=True, alpha=0.5);

png

Even better, you can use Seaborn’s distplot, which overlays a kernel density estimate.

sns.distplot(normal_array)

<matplotlib.axes._subplots.AxesSubplot at 0x120d1f2b0>

png

grid = np.arange(0., 1.01, 0.1)
print(grid)
np.random.choice(grid, 5, replace=False)

[ 0.   0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1. ]

array([ 0.8,  0.1,  0.2,  0.4,  0.6])

np.random.choice(grid, 25, replace=True)

array([ 0. ,  0.8,  0. ,  0.6,  1. ,  0.6,  0. ,  0.8,  0.6,  0.3,  0.6,
        0. ,  0.7,  1. ,  0.6,  0.8,  0.2,  0.3,  0.2,  0.6,  0.6,  0.8,
        0.8,  0.6,  0.9])

Vector operations

What does this mean? It means that instead of adding two arrays, element by element, you can just say: add the two arrays. Note that this behavior is very different from python lists.

first = np.ones(5)
second = np.ones(5)
first + second

array([ 2.,  2.,  2.,  2.,  2.])

Numpy supports a concept known as broadcasting, which dictates how arrays of different sizes are combined together. There are too many rules to list here, but importantly, multiplying an array by a number multiplies each element by the number. Adding a number adds the number to each element.

print(first + 1)
print(first*4)
print(first*second) # itemwise
print(first@second) # dot product, identical to np.dot(first, second)

[ 2.  2.  2.  2.  2.]
[ 4.  4.  4.  4.  4.]
[ 1.  1.  1.  1.  1.]
5.0

2D arrays

Similarly, we can create two-dimensional arrays.

my_array2d = np.array([ [1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12] ])

# 3 x 4 array of ones
ones_2d = np.ones([3, 4])
print(ones_2d)
# 3 x 4 array of ones with random noise
ones_noise = ones_2d + .01*np.random.randn(3, 4)
print(ones_noise)
# 3 x 3 identity matrix
my_identity = np.eye(3)
print(my_identity)

[[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]
[[ 0.99510521  0.99719296  0.99950497  0.99616235]
 [ 0.9913809   1.00196707  1.00545023  1.00486261]
 [ 1.01868421  0.99814142  1.00291637  0.99805381]]
[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]

Numpy arrays have set length (array dimensions), can be sliced, and can be iterated over with loop. Below is a schematic illustrating slicing two-dimensional arrays.

Drawing

Earlier when we generated the one-dimensional arrays of ones and random numbers, we gave ones and random the number of elements we wanted in the arrays. In two dimensions, we need to provide the shape of the array, ie, the number of rows and columns of the array.

onesarray = np.ones([3,4])
onesarray

array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

print(onesarray.shape)
onesarray.T

(3, 4)

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

onesarray.T.shape

(4, 3)

Matrix multiplication is accomplished by np.dot (or @).

print(np.dot(onesarray, onesarray.T)) # 3 x 3 matrix
np.dot(onesarray.T, onesarray) # 4 x 4 matrix

[[ 4.  4.  4.]
 [ 4.  4.  4.]
 [ 4.  4.  4.]]

array([[ 3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.]])

np.sum(onesarray)

12.0

The axis 0 is the one going downwards (the $y$-axis, so to speak), whereas axis 1 is the one going across (the $x$-axis). You will often use functions such as mean, sum, with an axis.

np.sum(onesarray, axis=0), np.sum(onesarray, axis=1)

(array([ 3.,  3.,  3.,  3.]), array([ 4.,  4.,  4.]))

Simulating a simple election model

Say you toss a coin and have a model which says that the probability of heads is 0.5 (you have figured this out from symmetry, or physics, or something). Still, there will be sequences of flips in which more or less than half the flips are heads. These fluctuations induce a distribution on the number of heads (say k) in N coin tosses (this is a binomial distribution).

Similarly, here, if the probability of Romney winning in Arizona is 0.938, it means that if somehow, there were 10000 replications (as if we were running the election in 10000 parallel universes) with an election each, Romney would win in 9380 of those Arizonas on the average across the replications. And there would be some replications with Romney winning more, and some with less. We can run these simulated universes or replications on a computer though not in real life.

To do this, we will assume that the outcome in each state is the result of an independent coin flip whose probability of coming up Obama is given by the Predictwise state-wise win probabilities. Lets write a function simulate_election that uses this predictive model to simulate the outcome of the election given a table of probabilities.

Bernoulli Random Variables (in scipy.stats)

The Bernoulli Distribution represents the distribution for coin flips. Let the random variable X represent such a coin flip, where X=1 is heads, and X=0 is tails. Let us further say that the probability of heads is p (p=0.5 is a fair coin).

We then say:

\[X \sim Bernoulli(p),\]

which is to be read as X has distribution Bernoulli(p). The probability distribution function (pdf) or probability mass function associated with the Bernoulli distribution is

\[\begin{eqnarray} P(X = 1) &=& p \\ P(X = 0) &=& 1 - p \end{eqnarray}\]

for p in the range 0 to 1. The pdf, or the probability that random variable $X=x$ may thus be written as

\[P(X=x) = p^x(1-p)^{1-x}\]

for x in the set {0,1}.

The Predictwise probability of Obama winning in each state is a Bernoulli Parameter. You can think of it as a different loaded coin being tossed in each state, and thus there is a bernoulli distribution for each state

Note: **some of the code, and ALL of the visual style for the distribution plots below was shamelessly stolen from https://gist.github.com/mattions/6113437/ **.

from scipy.stats import bernoulli
#bernoulli random variable
brv=bernoulli(p=0.3)
print(brv.rvs(size=20))
event_space=[0,1]
plt.figure(figsize=(12,8))
colors=sns.color_palette()
for i, p in enumerate([0.1, 0.2, 0.5, 0.7]):
    ax = plt.subplot(1, 4, i+1)
    plt.bar(event_space, bernoulli.pmf(event_space, p), label=p, color=colors[i], alpha=0.5)
    plt.plot(event_space, bernoulli.cdf(event_space, p), color=colors[i], alpha=0.5)

    ax.xaxis.set_ticks(event_space)
   
    plt.ylim((0,1))
    plt.legend(loc=0)
    if i == 0:
        plt.ylabel("PDF at $k$")
plt.tight_layout()

[1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1]

png

Running the simulation using the Uniform distribution

In the code below, each column simulates a single outcome from the 50 states + DC by choosing a random number between 0 and 1. Obama wins that simulation if the random number is $<$ the win probability. If he wins that simulation, we add in the electoral votes for that state, otherwise we dont. We do this n_sim times and return a list of total Obama electoral votes in each simulation.

n_sim = 10000
simulations = np.random.uniform(size=(51, n_sim))
obama_votes = (simulations < predictwise.Obama.values.reshape(-1, 1)) * predictwise.Votes.values.reshape(-1, 1)
#summing over rows gives the total electoral votes for each simulation
obama_votes.sum(axis=0)

array([328, 319, 308, ..., 329, 272, 332])

The first thing to pick up on here is that np.random.uniform gives you a random number between 0 and 1, uniformly. In other words, the number is equally likely to be between 0 and 0.1, 0.1 and 0.2, and so on. This is a very intuitive idea, but it is formalized by the notion of the Uniform Distribution.

We then say:

\[X \sim Uniform([0,1),\]

which is to be read as X has distribution Uniform([0,1]). The probability distribution function (pdf) associated with the Uniform distribution is

\begin{eqnarray} P(X = x) &=& 1 \, for \, x \in [0,1]
P(X = x) &=& 0 \, for \, x \notin [0,1] \end{eqnarray}

What assigning the vote to Obama when the random variable drawn from the Uniform distribution is less than the Predictwise probability of Obama winning (which is a Bernoulli Parameter) does for us is this: if we have a large number of simulations and $p_{Obama}=0.7$ , then 70\% of the time, the random numbes drawn will be below 0.7. And then, assigning those as Obama wins will hew to the frequentist notion of probability of the Obama win. But remember, of course, that in 30% of the simulations, Obama wont win, and this will induce fluctuations and a distribution on the total number of electoral college votes that Obama gets. And this is what we will see in the histogram below.

def simulate_election(model, n_sim):
    simulations = np.random.uniform(size=(51, n_sim))
    obama_votes = (simulations < model.Obama.values.reshape(-1, 1)) * model.Votes.values.reshape(-1, 1)
    #summing over rows gives the total electoral votes for each simulation
    return obama_votes.sum(axis=0)

Running the simulation using the Bernoulli distribution

We can directly use the Bernoulli distribution instead

n_sim=10000
simulations = np.zeros(shape=(51, n_sim))
obama_votes = np.zeros(shape=(51, n_sim))
for i in range(51):
    simulations[i,:] = bernoulli(p=predictwise.Obama.values[i]).rvs(size=n_sim)
    obama_votes[i,:] = simulations[i]*predictwise.Votes.values[i]
obama_votes.sum(axis=0)

array([ 333.,  327.,  333., ...,  332.,  322.,  299.])

def simulate_election2(model, n_sim):
    simulations = np.zeros(shape=(51, n_sim))
    obama_votes = np.zeros(shape=(51, n_sim))
    for i in range(51):
        simulations[i,:] = bernoulli(p=predictwise.Obama.values[i]).rvs(size=n_sim)
        obama_votes[i,:] = simulations[i]*predictwise.Votes.values[i]
    return obama_votes.sum(axis=0)

The following code takes the necessary probabilities for the Predictwise data, and runs 10000 simulations. If you think of this in terms of our coins, think of it as having 51 biased coins, one for each state, and tossing them 10,000 times each.

We use the results to compute the number of simulations, according to this predictive model, that Obama wins the election (i.e., the probability that he receives 269 or more electoral college votes)

result = simulate_election(predictwise, 10000)
print((result >= 269).sum())

result2 = simulate_election2(predictwise, 10000)
print((result2 >= 269).sum())

There are roughly only 50 simulations in which Romney wins the election!

Displaying the prediction

Now, lets visualize the simulation. We will build a histogram from the result of simulate_election. We will normalize the histogram by dividing the frequency of a vote tally by the number of simulations. We’ll overplot the “victory threshold” of 269 votes as a vertical black line and the result (Obama winning 332 votes) as a vertical red line.

We also compute the number of votes at the 5th and 95th quantiles, which we call the spread, and display it (this is an estimate of the outcome’s uncertainty). By 5th quantile we mean that if we ordered the number of votes Obama gets in each simulation in increasing order, the 5th quantile is the number below which 5\% of the simulations lie.

We also display the probability of an Obama victory

def plot_simulation(simulation):    
    plt.hist(simulation, bins=np.arange(200, 538, 1), 
             label='simulations', align='left', normed=True)
    plt.axvline(332, 0, .5, color='r', label='Actual Outcome')
    plt.axvline(269, 0, .5, color='k', label='Victory Threshold')
    p05 = np.percentile(simulation, 5.)
    p95 = np.percentile(simulation, 95.)
    iq = int(p95 - p05)
    pwin = ((simulation >= 269).mean() * 100)
    plt.title("Chance of Obama Victory: %0.2f%%, Spread: %d votes" % (pwin, iq))
    plt.legend(frameon=False, loc='upper left')
    plt.xlabel("Obama Electoral College Votes")
    plt.ylabel("Probability")
    sns.despine()

with sns.plotting_context('poster'):
    plot_simulation(result)

png

The model created by combining the probabilities we obtained from Predictwise with the simulation of a biased coin flip corresponding to the win probability in each states leads us to obtain a histogram of election outcomes. We are plotting the probabilities of a prediction, so we call this distribution over outcomes the predictive distribution. Simulating from our model and plotting a histogram allows us to visualize this predictive distribution. In general, such a set of probabilities is called a probability mass function.

Empirical Distribution

This is an empirical Probability Mass Function.

Lets summarize: the way the mass function arose here that we did ran 10,000 tosses (for each state), and depending on the value, assigned the state to Obama or Romney, and then summed up the electoral votes over the states.

There is a second, very useful question, we can ask of any such probability mass or probability density: what is the probability that a random variable is less than some value. In other words: $P(X < x)$. This is also a probability distribution and is called the Cumulative Distribution Function, or CDF (sometimes just called the distribution, as opposed to the density, or mass function). Its obtained by “summing” the probability density function for all $X$ less than $x$.

CDF = lambda x: np.float(np.sum(result < x))/result.shape[0]
for votes in [200, 300, 320, 340, 360, 400, 500]:
    print("Obama Win CDF at votes=", votes, " is ", CDF(votes))

Obama Win CDF at votes= 200  is  0.0
Obama Win CDF at votes= 300  is  0.152
Obama Win CDF at votes= 320  is  0.4561
Obama Win CDF at votes= 340  is  0.8431
Obama Win CDF at votes= 360  is  0.9968
Obama Win CDF at votes= 400  is  1.0
Obama Win CDF at votes= 500  is  1.0

votelist=np.arange(0, 540, 5)
plt.plot(votelist, [CDF(v) for v in votelist], '.-');
plt.xlim([200,400])
plt.ylim([-0.1,1.1])
plt.xlabel("votes for Obama")
plt.ylabel("probability of Obama win");

png

Contents