STA261: Assignment 2

Shahriar Shams

Winter, 2020

Submission deadline: April 18, 2020, 11.59pm (Toronto time) (Late submissions

will not be accepted)

This entire assignment should take 1 hour to 6 hours (max). I am giving you almost 15 days

to complete it in order to make it inclusive to students in different time zones and students

with accomodations and also considering the fact that you have other assignments for other

courses. So please start early and do not request for an extension.

Instructions on completing the assignment The goal of this assignment is not to test

you. Rather it’s my try to “force” you to study the materials of last few weeks (specially

Week-9 to Week-12). If you haven’t studied the materials of last few weeks, please study

them first at which point this assignment will seem like bunch of exercises.

Instructions on creating documents for submission

• Please create 3 separate pdfs (one for each question).

• I recommend using R-markdown(if you are familiar with it). If you are not familiar

with R-markdown, you can write your answers using Microsoft word and in the end

save it as pdfs. Pdf is the only acceptable format of files.

• Question 1(a) is the only part where you can submit a hand written answer. If you are

writing it by hand, take a picture of your answer paste it inside you word document or

attach it in R-markdown report. For any of the other questions, you can not submit

anything hand written, you need to type up.

• Use your judgement on formatting your answers. Make sure to attach any R-code that

you use either as part of appendices or as part of your actual report.

• We will use crowdmark for submission and grading. You will have to upload 3 separate

documents as your answers to 3 separate questions. Crowdmark links to upload your

documents will be emailed to you in couple of days.

Academic Integrity

Each student will work alone. If you need clarification on any of these questions, you are

allowed to ask questions on Piazza. Don’t ask for solutions to anyone. Do not share your

codes or answers on any platform.

1

Question 1 [12 points] (Relates to Likelihood Ratio Test)

(a) Suppose X1, X2, ..., Xn iid∼ N(µ, σ2) where σ2 is known and µ is unknown.

We want to test H0 : µ = µ0 vs H1 : µ 6= µ0 at the level of significance, α.

Here, we know that X¯ is the maximum likelihood estimator of µ (ie. µˆ = X¯).

Option-1: we can do a z-test with the test statistic

Z = X¯ − µ0

σ/

√

n

Option-2: we can do a likelihood ratio test with the test statistic (lets call it W )

W = −2logL(µ0)

L(µˆ)

Show that W = Z2

The goal of part (b) and (c) (below) is to "see" the distribution of the test statistic, W

(defined in part(a)) under H0. In other words, we want to see, if H0 is really true, what will

be the distribution of W . We will do this under two scenarios.

(b) Suppose X1, X2, ..., X10 iid∼ N(µ, σ2 = 9). Treat σ2 = 9 as the known constant.

We want to test H0 : µ = 5 vs H1 : µ 6= 5 at level of significance, α.

(i) Write a function in R that

• generates 10 samples from a N(µ = 5, σ2 = 9) distribution

• evaluates the likelihood function at µ = 5 (save it under the name L_theta0)

• evaluates the likelihood function at µ = x¯ (save it under the name L_theta1)

• calculates and returns −2 ∗ log(L_theta0/L_theta1)

(ii) Run this function (100000 times) using the replicate() command (or something similar)

and save the output under the name LRT_vec.

(iii) Plot a density histogram using LRT_vec.

code hint: use hist() with options freq=FALSE, breaks=100.

(iv) Overlay a χ2(df=1) density curve on top of this histogram.

code hint: generate 100000 random samples from a χ2(df=1), use density() and lines()

2

(c) (we will repeat the process of part(b) but with a different distribution here)

Suppose X1, X2, ..., X10 iid∼ Pois(λ).

We want to test H0 : λ = 5 vs H1 : λ 6= 5 at level of significance, α.

(i) Write a function in R that

• generates 10 samples from a Pois(λ = 5) distribution

• evaluates the likelihood function at λ = 5 (save it under the name L_theta0)

• evaluates the likelihood function at λ = x¯ (save it under the name L_theta1)

• calculates and returns −2 ∗ log(L_theta0/L_theta1)

(ii) Run this function (100000 times) using the replicate() command (or something similar)

and save the output under the name LRT_vec.

(iii) Plot a density histogram using LRT_vec.

(iv) Overlay a χ2(df=1) density curve on top of this histogram.

(d) In both part (b) and (c), your histograms should match(almost if not completely) with

the χ2(df=1) density. Make a brief comment on what role you expect the sample size to play in

the closeness of the histograms and the density. (In other words, do you expect these type of

closeness irrespective of the value of n?)

3

Question-2 [6 points] (Bayesian Inference)

Data: According to Public Health Canada, as of April 03, there are 295,065 Canadians who

have been tested for COVID-19 and 12,784 of them tested positive.

Prior: Public health officials believe that there is a 95% chance that the proportion of

COVID-19 cases in Canada (θ) is between 0.011 and and 0.065.

A little try and error approach in R (using function qbeta()) suggests that the 2.5th and

97.5th percentiles of a Beta(5,150) distribution are approximately 0.011 and 0.065.

(The outputs of (a), (b) and (c) are numeric numbers)

(a) Calculate the mean of θ (expected proportion) only using the prior distribution. (Let’s

call it the prior mean)

(b) Calculate the observed proportion only using the data. (Let’s call it the sample mean)

(c) Calculate the mean of θ using the posterior distribution of θ. (Let’s call it the posterior

mean)

(d) As a statistician if you are asked to pick any of these three numbers and forward it to

the health ministry to help with decision making (though the numbers are very close),

which one will you pick? Why do you think your pick is better than the other two

options.

(There is no right or wrong answer here, you can pick any of these you like, briefly

justify your preference)

4

Question-3 [12 points] (Regression Analysis by hand)

For this question you are not allowed to use the lm() command in R or the equivalent of lm()

in python or other software. I want you to “manually” calculate everything and show every

calculation. If you want to use a software, only use it as a calculator. So if you are using R,

you may only use mean, sum, qt(), qchisq() and of course +,-,*,/. (Similar restrictions for

Python, excel or others).

The data set will be different for each of you. The following little R code will generate

the data for you. Copy this code, paste it in R. Change only the line where it says

“student_id=261”. Remove the number 261 and write your student id there. Now

highlight and run the whole thing and it will give you two sets of numbers under the names

x and y.

Those numbers will be the one that you will use to answer the questions on the next page.

(The grader will rerun the code with your student id and check your numbers, you will get a

zero for the entire question if they don’t match).

# Only change this following line, remove 261 and put your student id

student_id=261

# do not change anything below

set.seed(student_id)

x= round(rnorm(15,mean=18,sd=4),2)

y= round(50+1.5*x+rnorm(15, mean=0, sd=5),2)

x

y

Congrats! You have successfully generated your data.

5

Note: It is possible to complete the rest of the questions (a-h) without using any software.

Suppose we are fitting a regression model

Yi|X = xi ∼ N(β1 + β2xi, σ2)

X represents number of hours studied during the week before the final exam of STA261. Y

represents the score on the final exam.

Yi’s are independent. We have 15 observations (the data that we generated on the previous

page). Think we have observed data from 15 students.

Show detailed calculation for each of the followings

(a) Calculate the maximum likelihood estimates of β1 and β2 (Let’s call them b1 and b2)

(b) Interpret b1 and b2

(c) Construct a 95% confidence interval for β2

(d) At 5% level of significance, test H0 : β2 = 1.5

(e) Calculate an estimate of σ2 using an unbiased estimator.

(f) Complete the following ANOVA table

Source df Sum of Square (SS) Mean SS = SS/df F= Mean SS for XMean SS for error

X ? ? ? ?

Error ? ? ? -

Total ? ? - -

(g) Compute and interpret the coefficient of determination (R2)

提供高质量的essay代写,Paper代写,留学作业代写-天才代写
(h) At 5% level of significance, test H0 : σ2 = 25