Keywords: normal-normal model |  hierarchical normal-normal model |  hierarchical |  divergences |  non-centered hierarchical model |  sampling distribution |  Download Notebook

Contents

%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import pymc3 as pm

From Gelman:

a simple hierarchical model based on the normal distribu- tion, in which observed data are normally distributed with a different mean for each ‘group’ or ‘experiment,’ with known observation variance, and a normal population distribution for the group means. This model is sometimes termed the one-way normal random-effects model with known data variance and is widely applicable, being an important special case of the hierarchical normal linear model,…

Statement of the model

The particular example we will deal with is called the 8-schools example, and is described thus:

A study was performed for the Educational Testing Service to analyze the effects of special coaching programs on test scores. Separate randomized experiments were performed to estimate the effects of coaching programs for the SAT-V (Scholastic Aptitude Test- Verbal) in each of eight high schools. The outcome variable in each study was the score on a special administration of the SAT-V, a standardized multiple choice test administered by the Educational Testing Service and used to help colleges make admissions decisions; the scores can vary between 200 and 800, with mean about 500 and standard deviation about 100. The SAT examinations are designed to be resistant to short-term efforts directed specifically toward improving performance on the test; instead they are designed to reflect knowledge acquired and abilities developed over many years of education. Nevertheless, each of the eight schools in this study considered its short-term coaching program to be successful at increasing SAT scores. Also, there was no prior reason to believe that any of the eight programs was more effective than any other or that some were more similar in effect to each other than to any other.

the estimated coaching effects are $\bar{y}_j$, and their sampling variances, $\sigma_j^2$… The estimates $\bar{y}_j$ are obtained by independent experiments and have approximately normal sampling distributions with sampling variances that are known, for all practical purposes, because the sample sizes in all of the eight experiments were relatively large, over thirty students in each school

Notice the bar on the y’s and the mention of standard errors (rather than standard deviations) in the third column in the table above. Why is this?

The answer is that these are means taken over many (> 30) students in each of the schools. The general structure of this model can be written thus:

Consider $J$ independent experiments, with experiment $j$ estimating the parameter $\theta_j$ from $n_j$ independent normally distributed data points, $y_{ij}$, each with known error variance $\sigma^2$; that is,

\[y_{ij} \vert \theta_j \sim N(\theta_j, \sigma^2), \, i = 1,...,n_j; j = 1,...,J.\]

We label the sample mean of each group $j$ as

\[\bar{y_j} = \frac{1}{n_j} \sum_{i=1}^{n_j} y_{ij}\]

with sampling variance:

\[\sigma_j^2 = \sigma^2/n_j\]

We can then write the likelihood for each $\theta_j$ using the sufficient statistics, $\bar{y}_j$:

\[\bar{y_j} \vert \theta_j \sim N(\theta_j,\sigma_j^2).\]

This is

a notation that will prove useful later because of the flexibility in allowing a separate variance $\sigma_j^2$ for the mean of each group $j$. …all expressions will be implicitly conditional on the known values $\sigma_j^2$…. Although rarely strictly true, the assumption of known variances at the sampling level of the model is often an adequate approximation.

The treatment of the model provided … is also appropriate for situations in which the variances differ for reasons other than the number of data points in the experiment. In fact, the likelihood can appear in much more general contexts than that stated here. For example, if the group sizes $n_j$ are large enough, then the means $\bar{y_j}$ are approximately normally distributed, given $\theta_j$, even when the data $y_{ij}$ are not.

Setting up the hierarchical model

We’ll set up the modelled in what is called a “Centered” parametrization which tells us how $\theta_i$ is modelled: it is written to be directly dependent as a normal distribution from the hyper-parameters.

J = 8
y = np.array([28,  8, -3,  7, -1,  1, 18, 12])
sigma = np.array([15, 10, 16, 11,  9, 11, 10, 18])

We set up our priors in a Hierarchical model to use this centered parametrization. We can say: the $\theta_j$ is drawn from a Normal hyper-prior distribution with parameters $\mu$ and $\tau$. Once we get a $\theta_j$ then can draw the means from it given the data $\sigma_j$ and one such draw corresponds to our data.

\[\mu \sim \mathcal{N}(0, 5)\\ \tau \sim \text{Half-Cauchy}(0, 5)\\ \theta_{j} \sim \mathcal{N}(\mu, \tau)\\ \bar{y_{j}} \sim \mathcal{N}(\theta_{j}, \sigma_{j})\]

where $j \in {1, \ldots, 8 }$ and the ${ y_{j}, \sigma_{j} }$ are given as data

with pm.Model() as schools1:

    mu = pm.Normal('mu', 0, sd=5)
    tau = pm.HalfCauchy('tau', beta=5)
    theta = pm.Normal('theta', mu=mu, sd=tau, shape=J)
    obs = pm.Normal('obs', mu=theta, sd=sigma, observed=y)
with schools1:
    trace1 = pm.sample(5000, init=None, njobs=2, tune=500)
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [theta, tau, mu]
Sampling 2 chains: 100%|██████████| 11000/11000 [00:21<00:00, 510.72draws/s]
There were 80 divergences after tuning. Increase `target_accept` or reparameterize.
There were 139 divergences after tuning. Increase `target_accept` or reparameterize.
The number of effective samples is smaller than 10% for some parameters.
pm.summary(trace1)
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
mu 4.207586 3.387222 0.128527 -2.364228 10.815423 443.438897 1.003844
theta__0 6.212825 5.831719 0.159452 -4.128980 18.410220 977.723211 1.000012
theta__1 4.745148 4.785493 0.135035 -3.945520 14.722419 869.961535 1.001362
theta__2 3.688946 5.384133 0.135945 -6.881554 14.602095 1194.801444 1.002863
theta__3 4.599909 5.106523 0.142366 -5.636394 14.969515 866.237232 1.001245
theta__4 3.345345 4.826723 0.134081 -6.022840 12.653606 1026.652251 1.003787
theta__5 3.861150 4.892589 0.126074 -6.373854 13.138098 1186.571117 1.003271
theta__6 6.239858 5.293900 0.156773 -3.199409 17.364483 803.695943 1.000267
theta__7 4.747230 5.385314 0.132700 -5.646697 15.551957 1463.209855 1.000123
tau 3.884377 3.099556 0.118285 0.424971 10.005174 572.684459 1.003743

The Gelman-Rubin statistic seems fine, but notice how small the effective-n’s are? Something is not quite right. Lets see traceplots.

pm.traceplot(trace1);
//anaconda/envs/py3l/lib/python3.6/site-packages/matplotlib/axes/_base.py:3449: MatplotlibDeprecationWarning: 
The `ymin` argument was deprecated in Matplotlib 3.0 and will be removed in 3.2. Use `bottom` instead.
  alternative='`bottom`', obj_type='argument')

png

Its hard to pick the thetas out but $\tau$ looks not so white-noisy. Lets zoom in:

pm.traceplot(trace1, varnames=['tau_log__'])
//anaconda/envs/py3l/lib/python3.6/site-packages/matplotlib/axes/_base.py:3449: MatplotlibDeprecationWarning: 
The `ymin` argument was deprecated in Matplotlib 3.0 and will be removed in 3.2. Use `bottom` instead.
  alternative='`bottom`', obj_type='argument')





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x119e794a8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x119e9f978>]], dtype=object)

png

There seems to be some stickiness at lower values in the trace. Zooming in even more helps us see this better:

plt.plot(trace1['tau_log__'], alpha=0.6)
plt.axvline(5000, color="r")
#plt.plot(short_trace['tau_log_'][5000:], alpha=0.6);
<matplotlib.lines.Line2D at 0x119f09978>

png

We plot the cumulative mean of $log(\tau)$ as time goes on. This definitely shows some problems. Its biased above the value you would expect from many many samples.

# plot the estimate for the mean of log(τ) cumulating mean
logtau = trace1['tau_log__']
mlogtau = [np.mean(logtau[:i]) for i in np.arange(1, len(logtau))]
plt.figure(figsize=(15, 4))
plt.axhline(0.7657852, lw=2.5, color='gray')
plt.plot(mlogtau, lw=2.5)
plt.ylim(0, 2)
plt.xlabel('Iteration')
plt.ylabel('MCMC mean of log(tau)')
plt.title('MCMC estimation of cumsum log(tau)')
Text(0.5, 1.0, 'MCMC estimation of cumsum log(tau)')

png

The problem with curvature

In-fact the “sticky’s” in the traceplot are trying to drag the $\tau$ trace down to the true value, eventually. If we wait for the heat-death of the universe long this will happen, given MCMC’s guarantees. But clearly we want to get there with a finite number of samples.

We can diagnose whats going on by looking for divergent traces in pymc3. In newer versions, these are obtained by a special boolean component of the trace.

divergent = trace1['diverging']
print('Number of Divergent %d' % divergent.nonzero()[0].size)
divperc = divergent.nonzero()[0].size/len(trace1)
print('Percentage of Divergent %.5f' % divperc)
Number of Divergent 219
Percentage of Divergent 0.04380

What does divergent mean? These are situations in which our symplectic integrator has gone Kaput, as illustrated in this diagram below from Betancourt’s review:

When a MCMC sampler (this problem is worse in non HMC models) encounters a region of high curvature it gets stuck, and goes off elsewhere, after a while. In the HMC scenario, the symplectic integrator diverges. What happens is that in encounters a region of high curvature which our timestep size $\epsilon$ is not able to resolve. Things become numerically unstable and the integrator veers off towards infinite energies, clearly not conserving energy any more.

Where is this curvature coming from? Things become a bit easier to understand if we plot the joint distribution of one of the thetas against a hyper-parameter. And we see that there is a triangular funnel structure with the hyperparameter spanning orders of magnitude in value. Using pymc we can plot where the divergences occur, and while they can occur anywhere they seem to be clustered in the neck of the distribution.

theta_trace = trace1['theta']
theta0 = theta_trace[:, 0]
plt.figure(figsize=(10, 6))
plt.scatter(theta0[divergent == 0], logtau[divergent == 0], color='r', s=10, alpha=0.05)
plt.scatter(theta0[divergent == 1], logtau[divergent == 1], color='g', s=10, alpha=0.9)
plt.axis([-20, 50, -6, 4])
plt.ylabel('log(tau)')
plt.xlabel('theta[0]')
plt.title('scatter plot between log(tau) and theta[0]')
plt.show()

png

Two things have now happened…we have an increasing inability to integrate in the neck of this funnel, and we have lost confidence that our sampler is now actually characterizing this funnel well.

pymc3 warning system also captures this, and the information can be drawn from there as well

trace1.report.ok
False
trace1.report._chain_warnings[0][0]
SamplerWarning(kind=<WarningType.DIVERGENCE: 1>, message='Energy change in leapfrog step is too large: 1458.41782758.', level='debug', step=2, exec_info=None, extra={'theta': array([ 4.87711111,  2.25999236,  2.34774179,  2.1574986 ,  3.79069814,
        4.34545496,  3.76649258,  3.32757565]), 'tau_log__': array(-0.3984309958799402), 'mu': array(4.635052778361571)})

Funnels in hierarchical models

As is discussed in Betancourt and Girolami (2015) from where the following diagrams are taken, a funnel structure is common in Hierarchical models, and reflects strong correlations between down-tree parameters such as thetas, and uptree parameters such as $\phi = \mu, \tau$ here.

The funnel between $v = log(\tau)$ and $\theta_i$ in the hierarchical Normal-Normal model looks like this:

The problem is that a sampler must sample both the light and dark regions as both have enough probability mass.

Divergences are good things

This is because they can help us diagnose problems in our samplers. Chances are that in a region with divergences the sampler is having problems exploring.

There is a second reason besides curvature and symplectic integration which affects the efficiency of HMC. This is the range of transitions a HMC sampler can make. For a euclidean mass-matrix sampler, the transitions are of the order of the range in kinetic energy, which itself is chi-squared distributed (since p is Gaussian and a sum of gaussians is a chi-squared). Thus, in expectation, the variation in $K$ is of oder $d/2$ where $d$ is the dimension of the target. Since hierarchical structures correlate variables between levels, they also induce large changes in energy density, which our transitions dont explore well.

Non-centered Model

So what is to be done? We could change the kinetic energy using methods such as Riemannian Monte-Carlo HMC, but thats beyond our scope. But, just as in the case with the regression example earlier, a re-parametrization comes to our rescue. We want to reduce the levels in the hierarchy, as shown here:

We change our model to:

\[\mu \sim \mathcal{N}(0, 5)\\ \tau \sim \text{Half-Cauchy}(0, 5)\\ \nu_{j} \sim \mathcal{N}(0, 1)\\ \theta_{j} = \mu + \tau\nu_j \\ \bar{y_{j}} \sim \mathcal{N}(\theta_{j}, \sigma_{j})\]

Notice how we have factored the dependency of $\theta$ on $\phi = \mu, \tau$ into a deterministic transformation between the layers, leaving the actively sampled variables uncorrelated.

This does two things for us: it reduces steepness and curvature, making for better stepping. It also reduces the strong change in densities, and makes sampling from the transition distribution easier.

with pm.Model() as schools2:
    mu = pm.Normal('mu', mu=0, sd=5)
    tau = pm.HalfCauchy('tau', beta=5)
    nu = pm.Normal('nu', mu=0, sd=1, shape=J)
    theta = pm.Deterministic('theta', mu + tau * nu)
    obs = pm.Normal('obs', mu=theta, sd=sigma, observed=y)
with schools2:
    trace2 = pm.sample(5000, init=None, njobs=2, tune=500)
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [nu, tau, mu]
Sampling 2 chains: 100%|██████████| 11000/11000 [00:09<00:00, 1102.28draws/s]
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
There were 3 divergences after tuning. Increase `target_accept` or reparameterize.
pm.summary(trace2)
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
mu 4.347176 3.286204 0.033158 -2.076250 10.809128 9280.625148 0.999958
nu__0 0.333415 0.998524 0.008853 -1.542916 2.345821 10770.381759 0.999930
nu__1 0.095830 0.925314 0.008496 -1.748745 1.913819 12623.663126 0.999962
nu__2 -0.090668 0.973473 0.010201 -2.003096 1.811096 11384.279142 0.999909
nu__3 0.071129 0.940188 0.007578 -1.732506 1.882510 12046.790749 0.999905
nu__4 -0.153937 0.951805 0.008784 -2.014699 1.727808 12065.548667 0.999984
nu__5 -0.070166 0.944777 0.009325 -1.890132 1.813137 10954.542775 0.999954
nu__6 0.358106 0.981190 0.011142 -1.515111 2.341151 9123.178041 1.000136
nu__7 0.065285 0.970545 0.008104 -1.823876 1.990456 13758.237973 0.999901
tau 3.552034 3.185068 0.036809 0.002162 9.845447 7236.920715 0.999933
theta__0 6.188267 5.563705 0.054012 -3.875958 18.548927 9263.543222 1.000048
theta__1 4.863323 4.650979 0.043388 -4.070784 14.487559 10923.364832 0.999942
theta__2 3.873658 5.244453 0.060124 -6.738831 14.406468 8794.224031 1.000048
theta__3 4.721012 4.754331 0.043573 -4.633450 14.515977 10975.024446 1.000119
theta__4 3.580124 4.644715 0.046571 -6.319482 12.434017 10236.673808 0.999948
theta__5 4.015354 4.769508 0.047606 -6.101058 13.014267 10498.887075 0.999901
theta__6 6.215937 5.086602 0.052506 -3.160603 17.322192 9862.673399 1.000478
theta__7 4.755671 5.289659 0.053902 -6.135223 15.392630 10221.686590 1.000014
pm.traceplot(trace2);
//anaconda/envs/py3l/lib/python3.6/site-packages/matplotlib/axes/_base.py:3449: MatplotlibDeprecationWarning: 
The `ymin` argument was deprecated in Matplotlib 3.0 and will be removed in 3.2. Use `bottom` instead.
  alternative='`bottom`', obj_type='argument')

png

pm.traceplot(trace2, varnames=['tau_log__'])
//anaconda/envs/py3l/lib/python3.6/site-packages/matplotlib/axes/_base.py:3449: MatplotlibDeprecationWarning: 
The `ymin` argument was deprecated in Matplotlib 3.0 and will be removed in 3.2. Use `bottom` instead.
  alternative='`bottom`', obj_type='argument')





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11b0064e0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11afb0400>]], dtype=object)

png

Ok, so this seems to look better!

plt.plot(trace2['tau_log__'], alpha=0.6)
plt.axvline(5000, color="r")
<matplotlib.lines.Line2D at 0x11af00dd8>

png

And the effective number of iterations hs improved as well:

pm.diagnostics.gelman_rubin(trace2), pm.diagnostics.effective_n(trace2)
({'mu': 0.99995806128958298,
  'nu': array([ 0.99993023,  0.99996187,  0.99990919,  0.99990459,  0.99998409,
          0.99995408,  1.00013553,  0.99990075]),
  'tau': 0.99993259236404675,
  'theta': array([ 1.00004779,  0.99994164,  1.00004808,  1.00011865,  0.99994786,
          0.9999015 ,  1.00047802,  1.00001387])},
 {'mu': 9280.6251483004617,
  'nu': array([ 10770.38175948,  12623.66312606,  11384.27914234,  12046.79074917,
          12065.54866653,  10954.54277482,   9123.17804143,  13758.23797319]),
  'tau': 7236.9207149620033,
  'theta': array([  9263.54322197,  10923.36483198,   8794.22403058,  10975.02444636,
          10236.67380755,  10498.88707531,   9862.67339945,  10221.6865905 ])})

And we reach the true value better as the number of samples increases, decreasing our bias

# plot the estimate for the mean of log(τ) cumulating mean
logtau = trace2['tau_log__']
mlogtau = [np.mean(logtau[:i]) for i in np.arange(1, len(logtau))]
plt.figure(figsize=(15, 4))
plt.axhline(0.7657852, lw=2.5, color='gray')
plt.plot(mlogtau, lw=2.5)
plt.ylim(0, 2)
plt.xlabel('Iteration')
plt.ylabel('MCMC mean of log(tau)')
plt.title('MCMC estimation of cumsum log(tau)')
Text(0.5, 1.0, 'MCMC estimation of cumsum log(tau)')

png

How about our divergences? They have decreased too.

divergent = trace2['diverging']
print('Number of Divergent %d' % divergent.nonzero()[0].size)
divperc = divergent.nonzero()[0].size/len(trace2)
print('Percentage of Divergent %.5f' % divperc)
Number of Divergent 4
Percentage of Divergent 0.00080
theta_trace = trace2['theta']
theta0 = theta_trace[:, 0]
plt.figure(figsize=(10, 6))
plt.scatter(theta0[divergent == 0], logtau[divergent == 0], color='r', s=10, alpha=0.05)
plt.scatter(theta0[divergent == 1], logtau[divergent == 1], color='g', s=20, alpha=0.9)
plt.axis([-20, 50, -6, 4])
plt.ylabel('log(tau)')
plt.xlabel('theta[0]')
plt.title('scatter plot between log(tau) and theta[0]')
plt.show()

png

Look how much longer the funnel actually is. And we have explored this much better.

theta01 = trace1['theta'][:, 0]
logtau1 = trace1['tau_log__']

theta02 = trace2['theta'][:, 0]
logtau2 = trace2['tau_log__']


plt.figure(figsize=(10, 6))
plt.scatter(theta01, logtau1, alpha=.05, color="b", label="original", s=10)
plt.scatter(theta02, logtau2, alpha=.05, color="r", label='reparametrized', s=10)
plt.axis([-20, 50, -6, 4])
plt.ylabel('log(tau)')
plt.xlabel('theta[0]')
plt.title('scatter plot between log(tau) and theta[0]')
plt.legend()

<matplotlib.legend.Legend at 0x116002860>

png

It may not be possible in all models to achieve this sort of decoupling. In that case, Riemannian HMC, where we generalize the mass matrix to depend upon position, explicitly tackling high-curvature, can help.