Data: wine_quality_missing.csv |  Download Notebook

Contents

Harvard University
Fall 2018
Instructors: Rahul Dave
**Due Date: ** Saturday, September 15th, 2018 at 11:59pm

Instructions:

import numpy as np
import scipy.stats
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from matplotlib import cm
import pandas as pd
%matplotlib inline

Question 1: Don’t Be Sensitive, We’re Looking For False Positivity

Coding not required

As the U.S. aims to increase early identification and treatment of people with HIV, a greater focus has been placed on determining the accuracy of HIV tests in real-world settings in order to better identify individuals during the early (acute) stages of infection when transmission risk is especially high. In order to quantify this, researchers from the University of California, San Francisco conducted a review of over 21,234 HIV tests given between the years 2003 and 2008 in some of the city’s highest prevalence populations. Two of the tests included in the study – the OraQuick Advance Blood Rapid Antibody Test (a 3rd generation fingerstick blood test we’ll henceforth call the BRT) and the OraQuick Advance Saliva Rapid Antibody Test (a 3rd generation saliva test we’ll henceforth call the SRT) – can be considered Rapidly Administered tests. According to Wikipedia, the overall prevalence of adult HIV in the United States is 0.3%

You and your partner decide to undergo tests for HIV and the two of you are administered the SRT at an HIV screening clinic. You’re aware that the test you’ve been administered (one of the tests featured in the study) has a sensitivity (true positive rate) of 86.6% and a specificity (true negative rate) of 98.6%.

1.1. In the unfortunate event that your partner tests positive what is the probability that he/she has HIV?

1.2. Should you be concerned enough to ask for further testing or treatment for your partner? Justify your answer quantitatively.

1.3. The medical staff at the screening clinic decides to independently administer the BRT as it is the only other test they can administer on site. Based on the UCSF study above, you’re aware that this test has a sensitivity (true positive rate) of 91.9% and a specificity of 99.6%. What is the probability that your partner has the HIV if both tests come up positive (Make the assumption that the independent administration means that the tests are independent of each other)?

It turns out the screening clinic didn’t have the second test and you and your partner were only able to be administered the SRT. However, on your way home you’re reminded that your partner has spent almost all his/her life in the Bahamas where the prevalence of adult HIV is 3.3%.

1.4. What is the probability that your Bahamian partner has HIV given that he/she received a positive test result on the SRT?

1.5. How does the new information quantifiably affect your decision of whether you should be concerned enough to ask for treatment or testing for your partner?

Make sure to justify your decisions using your calculations.

Solution:

------------------------

1.1. In the unfortunate event that your partner tests positive what is the probability that he/she has HIV?


*Your Answer Here*
------------------------

1.2. Should you be concerned enough to ask for further testing or treatment for your partner? Justify your answer quantitatively.


*Your Answer Here*
------------------------

1.3. The medical staff at the screening clinic decides to independently administer the BRT as it is the only other test they can administer on site. Based on the UCSF study above, you're aware that this test has a sensitivity (true positive rate) of 91.9% and a specificity of 99.6%. What is the probability that your partner has the HIV if both tests come up positive (Make the assumption that the independent administration means that the tests are independent of each other)?


*Your Answer Here*
------------------------

1.4. What is the probability that your Bahamian partner has HIV given that he/she received a positive test result on the SRT?


*Your Answer Here*
------------------------

1.5. How does the new information quantifiably affect your decision of whether you should be concerned enough to ask for treatment or testing for your partner?


*Your Answer Here*
------------------------

Question 2: Visualize Yourself Missing Your Data

Some Coding required

Missing data is a very important topic in statistics and machine learning and we may touch upon it a few times throughout the course. Let us begin to explore how to handle missing data in our analysis. We’ll be working with a dataset from the UCI Machine Learning Repository that uses a variety of wine chemical predictors to classify wines grown in the same region in Italy. Each line represents 13 (mostly chemical) predictors of the response variable wine class, including things like alcohol content, hue, and phenols. Unfortunately some of the predictor values were lost in measurement. You’ll find the dataset in a file called wine_quality_missing.csv.

2.1. Read the data in the wine_quality_missing.csv into a pandas dataframe and store the dataframe in the variable wine_df. How many observations are in the dataset?

2.2. One way to handle missing data is to just totally ignore it by removing any rows that have any missing values. This is called drop imputation. Use drop imputation on our wine quality dataframe and store the resulting dataframe in the variable wine_drop. How many observations does the drop imputed dataset have?

2.3. Visualize using a normed histogram the values of Ash predictor in the drop imputed dataframe. Overlay a fitted normal distribution on your plot. Make sure to title the plot, include a legend and label axes.

2.4. What are the mean and standard deviation of the values of the Ash feature in your drop imputed dataset?

2.5. Another way to handle missing data is to replace any missing value with the mean of the non-missing values in that column. This is called mean imputation. How many rows does our mean imputed dataset have?

2.6. Visualize using a normed histogram the values of Ash predictor in the mean imputed dataframe. Overlay a fitted normal distribution on your plot. Make sure to title the plot, include a legend and label axes.

2.7. What are the mean and standard deviation of the Ash predictor values in your mean imputed dataset? How do they compare to the mean and standard deviation of the drop imputed dataset?

Part 3: Walk Softly and Carry a Broken Stick

Some Coding required

3.1 A well-known probability problem, the Broken Stick Problem can be formulated as follows. A stick is broken up at two points, chosen at random along its length. Show that the probability that the pieces obtained form a triangle is 1/4. Write a function simulate_broken_stick that runs takes an integer parameter n_sims, runs that many simulations of the broken stick problem and returns a floating point number representing an estimate of the probability that the pieces of a stick randomly broken in two places form a triangle.

Hints:

  1. Use either np.random.rand or scipy.stats.uniform.rvs to simulate stick-breaking
  2. Think about the triangle inequality

3.2 Run your function simulate_broken_stick with n_sims set to be equal to 50 and 1000. What are your two estimates of the solution to the broken stick problem?

3.3 (Optional) Provide an analytic proof of the Broken Stick Problem.

Hints:

  1. There’s a very nice geometric proof taking advantage of Viviani’s Theorem. Argue that every point in the equilateral triangle with height equal to the length of the stick maps to a particular breaking of the stick. Divide the equilateral triangle into subtriangles by joining the midpoints. What does it mean for a point to be outside the central equilateral subtriangle?
  2. There’s a more straightforward proof treating the breaking of the stick as joint uniform random variables. What conditions does the triangle inequality impose on the joint random variable? How do you calculate the probability that the joint uniform meets those conditions?

Solution:

3.1 A well-known probability problem, the **Broken Stick Problem** can be formulated as follows. *A stick is broken up at two points, chosen at random along its length. Show that the probability that the pieces obtained form a triangle is 1/4.* Write a function `simulate_broken_stick` that runs takes an integer parameter `n_sims`, runs that many simulations of the broken stick problem and returns a floating point number representing an estimate of the probability that the pieces of a stick randomly broken in two places form a triangle.

def simulate_broken_stick(n_sims=500):
    
    # define your probability estimate and initialize it to 0
    probability_estimate = 0
    
    # run your simulation and calculate probability_estimate
    
    # YOUR CODE HERE
    
    return probability_estimate

3.2 Run your function `simulate_broken_stick` with `n_sims` set to be equal to 50 and 1000. What are your two estimates of the solution to the broken stick problem?

estimate_50 = simulate_broken_stick(n_sims=50)  # calculate estimate for 50 simulations
estimate_1000 = simulate_broken_stick(n_sims=1000)  # calculate estimate for 1000 simulations
## Make sure you present your answers somewhere here

3.3 (Optional) Provide an analytic proof of the **Broken Stick Problem**.


*Your Answer Here*