Data: wine_quality_missing.csv | Download Notebook
Contents
- Question 1: Give Me the Full Monte, Carlo Ancelotti
- Question 2: Rally to Me!
- Question 3: Still Missing!
Harvard University
Fall 2018
Instructors: Rahul Dave
**Due Date: ** Saturday, September 22nd, 2018 at 11:59pm
Instructions:
-
Upload your final answers in the form of a Jupyter notebook containing all work to Canvas.
-
Structure your notebook and your work to maximize readability.
import numpy as np
import scipy.stats
import scipy.special
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from matplotlib import cm
import pandas as pd
%matplotlib inline
Question 1: Give Me the Full Monte, Carlo Ancelotti
Coding required
In the quiet moments that transpire just before the sun rises that find us taking the walk of shame we can only send up quiet prayers to deities unknown that our path has not unknowingly taken us down the Boulevard of Broken Dreams (Green Day). Along this road you’ll find scattered the shattered hearts of formerly aspiring humorists – the sorts of folk who might admire “the giggle at a funeral” (Hozier) distributed according to the function $\heartsuit(\theta) \sim \frac{ {\rm sin^{24}}\theta}{\theta^2}$ from $0< \theta< \infty$ and otherwise. As a current aspiring humorist, it is your job to try to integrate $\heartsuit(\theta)$ in order to size up the jar you’ll take with you as you go heart collecting (Christina Perri). Who do you think you are anyway?
1.1. Visualize $\heartsuit(\theta)$. Make sure your plot includes a title and axes labels.
1.2. The domain of $\heartsuit(\theta)$ is unbounded. The version of Monte Carlo that we’ve explored so far requires a bounded domain. Make an argument that we can integrate this function over the bounded domain [0, M] and get an accurate result. What value of M should you choose to get a result within 0.001 of the exact solution?
1.3. Write a function simulate_heart_collection
to estimate $\int_{0}^{\infty}\heartsuit (\theta) $ using the standard Monte Carlo method with $N=100000$. Use the bounds you justified in 1.2. What is your estimate?
1.4. It turns out that integrals of the form $\int_{0}^{\infty} \frac{ {\rm sin}x^{2n}}{x^2} dx$ have the closed form solution $\frac{\pi}{2^{2n-1}} \binom{2n-2}{n-1}$. How accurate was your estimate?
1.5. The teaching staff of AM 207 has gone on and on in class and lab about putting error bars on estimates and finding confidence intervals. In order to do this you need to run your experiment a number of times. Repeat your estimation process 1000 times and plot a histogram of your results marking the exact answer and your estimate with a vertical line.
1.6. Based on your experiments, find the standard error of your estimate as well as a 95% confidence interval. Was the true value of $\int_{0}^{\infty}\heartsuit (\theta)$ within the 95% confidence interval?
1.7. It turns out that an appropriately chosen change of variables will allow you to estimate the integral on the part of the domain you truncated in 1.2 and 1.3. Execute this change of variables and use monte carlo integration to evaluate $\int_{M}^{\infty}\heartsuit (\theta) $
1.8. Based on your answer in 1.7 was your choice of M justified?
Answers
1.1. Visualize $\heartsuit(\theta)$. Make sure your plot includes a title and axes labels.
*Your Answer Here*
1.2. The domain of $\heartsuit(\theta)$ is unbounded. The version of Monte Carlo that we've explored so far requires a bounded domain. Make an argument that we can integrate this function over a bounded domain and get an accurate result. What bounds should you choose to get a result within 0.001 of the exact solution?
*Your Answer Here*
1.3. Write a function `simulate_heart_collection` to estimate $\int_{0}^{\infty}\heartsuit (\theta) $ using the standard Monte Carlo method with $N=100000$. Use the bounds you justified in 1.2. What is your estimate?
*Your Answer Here*
1.4. It turns out that integrals of the form $\int_{0}^{\infty} \frac{ {\rm sin}x^{2n}}{x^2} dx$ have the closed form solution $\frac{\pi}{2^{2n-1}} \binom{2n-2}{n-1}$. How accurate was your estimate?
*Your Answer Here*
1.5. The teaching staff of AM 207 has gone on and on in class and lab about putting error bars on estimates and finding confidence intervals. In order to do this you need to run your experiment a number of times. Repeat your estimation process 1000 times and plot a histogram of your results marking the exact answer and your estimate with a vertical line.
*Your Answer Here*
1.6. Based on your experiments, find the standard error of your estimate as well as a 95% confidence interval. Was the true value of $\int_{0}^{\infty}\heartsuit (\theta)$ within the 95% confidence interval?
*Your Answer Here*
1.7. It turns out that an appropriately chosen change of variables will allow you to estimate the integral on the part of the domain you truncated in 1.2 and 1.3. Execute this change of variables and use monte carlo integration to evaluate $\int_{M}^{\infty}\heartsuit (\theta) $
*Your Answer Here*
1.8. Based on your answer in 1.7 was your choice of M justified?
*Your Answer Here*
Question 2: Rally to Me!
Some Coding required
Suppose you observe the following data set $\mathbf{x}^{(0)} = (0.5, 2.5), \mathbf{x}^{(1)} = (3.2, 1.3), \mathbf{x}^{(2)} = (2.72, 5.84), \mathbf{x}^{(3)}= (10.047, 0.354)$. By convention, for any vector $\mathbf{x}$, we will denote the first component of $\mathbf{x}$ by $x_{1}$ and the second component by $x_{2}$. Suppose that the data is drawn from the same two-dimensional probability distribution with pdf $f_X$, that is, $\mathbf{x}^{(i)} \overset{iid}{\sim} f_X$, where \(f_X(\mathbf{x}) = 4\lambda_1^2 x_{1}x_{2} \mathrm{exp} \left\{-\lambda_0 (x^2_{1} + x^2_{2}) \right\}.\) You should assume that $\lambda_1, \lambda_0 > 0$ and that $f_X$ is supported on the nonnegative quandrant of $\mathbb{R}^2$ (i.e. $f_X$ is zero when either component is negative).
2.1. What are the values for $\lambda_0$ and $\lambda_1$ that maximize the likelihood of the observed data? Support your answer with full and rigorous analytic derivations.
2.2. Visualize the data along with the distribution you determined in 2.1 (in two dimensions or three).
Answers
2.1. What are the values for $\lambda_0$ and $\lambda_1$ that maximize the likelihood of the observed data?
*Your Answer Here*
2.2. Visualize the data along with the distribution you determined in 2.1 (in two dimensions or three).
*Your Answer Here*
Question 3: Still Missing!
Coding required
Recall from Homework 1 Question 2 that we explored working with missing data using the wine quality dataset from the UCI Machine Learning Repository. Re-Read the data in the wine_quality_missing.csv
into a pandas dataframe and store the dataframe in the variable wine_df
.
3.1. Drop impute wine_df
and re-calculate estimates of the mean and standard deviation of the values of the Ash feature in the drop imputed dataset.
3.2. Use non-parametric bootstrap on the drop imputed dataset to find the standard errors for both your mean and standard deviation estimates.
3.3. Mean impute wine_df
and re-calculate estimates of the mean and standard deviation of the values of the Ash feature in the mean imputed dataset.
3.4. Use non-parametric bootstrap on the mean imputed dataset to find the standard errors for both your mean and standard deviation estimates.
3.5. Compare the standard errors between the two different types of imputation. Do they differ? If so what might be the cause of the difference?
Answers
3.1. Drop impute `wine_df` and re-calculate estimates of the mean and standard deviation of the values of the Ash feature in the drop imputed dataset.
*Your Answer Here*
3.2. Use non-parametric bootstrap on the drop imputed dataset to find the standard errors for both your mean and standard deviation estimates.
*Your Answer Here*
3.3. Mean impute `wine_df` and re-calculate estimates of the mean and standard deviation of the values of the Ash feature in the mean imputed dataset.
*Your Answer Here*
3.4. Use non-parametric bootstrap on the mean imputed dataset to find the standard errors for both your mean and standard deviation estimates.
*Your Answer Here*
3.5. Compare the standard errors between the two different types of imputation. Do they differ? If so what might be the cause of the difference?
*Your Answer Here*