```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(cvTools)
library(glmnet)
library(sandwich)
library(stargazer)
library(lmtest)
library(alr4)
library(sciplot)
```
## Model Comparison (5 points)
Use the `Puromycin` data set built into R. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/Puromycin).
For this problem, you will run through various methods of model comparison - Adjusted R-squared, AIC, LRT, and LOOCV.
1. Create two models that predict `rate`. The first model will use `conc` as the predictor. The second model will use `conc` and `state` as predictors. Do not include any interactions of other transformations. Run each of your models through the `summary()` function.
```{r}
data(Puromycin)
# Write your code here
attach(Puromycin)
m1a<-lm(rate~conc)
m1b<-lm(rate~conc+state)
stargazer(m1a, m1b, type="text")
```
According to adjusted R-squared, the second model is preferred: it explains 69% f the variance in teh levels of teh dependent variable, as compared to 63% in teh case of the first model.
Replace this text with your answer.
2. Use the `AIC()` function to calculate the AIC for each model.
```{r}
AIC(m1a)
AIC(m1b)
```
The second model is aldo preferred according to AIC, having an AIC of 221, lower than the one of 224 in case of teh first model.
3. Run a likelihood ratio test between the two models.
```{r}
lrtest(m1a, m1b)
```
The likelihood ratio test is significant (p=0.03<0>
4. Use the `cvFit` function within the `cvTools` package to run LOOCV for both models.
```{r}
cvFit(m1a, y=rate, K=10, data=Puromycin)
cvFit(m1b, y=rate, K=10, data=Puromycin)
```
The estimated prediction error is larger for teh first model; therefore, the second model is preferred according to LOOCV.
***
## Lasso (5 points)
Use the `swiss` data set built into R. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/swiss).
For this problem, you will use Lasso regression to predict `Fertility` based on all other predictors. You will need to utilize the `glmnet()` and `cv.glmnet()` functions within the `glmnet` package. See the Lecture Notes for details.
1. Create the design matrix using the `model.matrix()` function. This is used as one of the arguments for `glmnet()` when running Lasso.
```{r}
data(swiss)
# Write your code here
attach(swiss)
x <- model.matrix(Fertility ~ Agriculture + Examination + Education + Catholic+Infant.Mortality)[, -1]
```
2. Run Lasso using different values of $\lambda$ to determine which variable is the first to have its estimated slope coefficient go to 0. This may take multiple iterations. Remember to set the value of $\alpha$ to 1, as we did in the Lecture Notes.
```{r}
# Write your code here
lasso00 <- glmnet(x, y=Fertility, alpha=1, lambda=0)
lasso02 <- glmnet(x, y=Fertility, alpha=1, lambda=.2)
lasso04 <- glmnet(x, y=Fertility, alpha=1, lambda=.4)
lasso06 <- glmnet(x, y=Fertility, alpha=1, lambda=.6)
lasso08 <- glmnet(x, y=Fertility, alpha=1, lambda=.8)
lasso10 <- glmnet(x, y=Fertility, alpha=1, lambda=1)
lasso20 <- glmnet(x, y=Fertility, alpha=1, lambda=2)
lasso60 <- glmnet(x, y=Fertility, alpha=1, lambda=6)
lasso80 <- glmnet(x, y=Fertility, alpha=1, lambda=8)
lasso100 <- glmnet(x, y=Fertility, alpha=1, lambda=10)
coeffs00 <- coef(lasso00, s = 0.1)
coeffs02 <- coef(lasso02, s = 0.1)
coeffs04 <- coef(lasso04, s = 0.1)
coeffs06 <- coef(lasso06, s = 0.1)
coeffs08 <- coef(lasso08, s = 0.1)
coeffs10 <- coef(lasso10, s = 0.1)
coeffs20 <- coef(lasso20, s = 0.1)
coeffs60 <- coef(lasso60, s = 0.1)
coeffs80 <- coef(lasso80, s = 0.1)
coeffs100 <- coef(lasso100, s = 0.1)
coeffs00x <- data.frame(name = coeffs00@Dimnames[[1]][coeffs00@i + 1], coefficient = coeffs00@x)
coeffs02x <- data.frame(name = coeffs02@Dimnames[[1]][coeffs02@i + 1], coefficient = coeffs02@x)
coeffs04x <- data.frame(name = coeffs04@Dimnames[[1]][coeffs04@i + 1], coefficient = coeffs04@x)
coeffs06x <- data.frame(name = coeffs06@Dimnames[[1]][coeffs06@i + 1], coefficient = coeffs06@x)
coeffs08x <- data.frame(name = coeffs08@Dimnames[[1]][coeffs08@i + 1], coefficient = coeffs08@x)
coeffs10x <- data.frame(name = coeffs10@Dimnames[[1]][coeffs10@i + 1], coefficient = coeffs10@x)
coeffs20x <- data.frame(name = coeffs20@Dimnames[[1]][coeffs20@i + 1], coefficient = coeffs20@x)
coeffs60x <- data.frame(name = coeffs60@Dimnames[[1]][coeffs60@i + 1], coefficient = coeffs60@x)
coeffs80x <- data.frame(name = coeffs80@Dimnames[[1]][coeffs80@i + 1], coefficient = coeffs80@x)
coeffs100x <- data.frame(name = coeffs100@Dimnames[[1]][coeffs100@i + 1], coefficient = coeffs100@x)
colnames(coeffs00x)[2]<-"Lambda00"
colnames(coeffs02x)[2]<-"Lambda02"
colnames(coeffs04x)[2]<-"Lambda04"
colnames(coeffs06x)[2]<-"Lambda06"
colnames(coeffs08x)[2]<-"Lambda08"
colnames(coeffs10x)[2]<-"Lambda10"
colnames(coeffs20x)[2]<-"Lambda20"
colnames(coeffs60x)[2]<-"Lambda60"
colnames(coeffs80x)[2]<-"Lambda80"
colnames(coeffs10x)[2]<-"Lambda100"
coeffsx<-Reduce(function(x, y) merge(x, y, all=TRUE),
list(coeffs00x, coeffs02x, coeffs04x, coeffs06x, coeffs08x, coeffs10x,
coeffs20x, coeffs60x, coeffs80x, coeffs100x))
print(coeffsx)
```
Agriculture seems to be the first variable to have its estimated slope coefficient go to 0.
3. At what (approximate) value of $\lambda$ do **all** of the estimated slope coefficients go to 0? An integer value is fine here!
```{r}
# Write your code here
print(coeffsx)
```
Lambda=10.
4. Use the `cv.glmnet()` function to determine the optimal value for $\lambda$.
```{r}
# Write your code here
#Find the optimal value of lambda that minimizes the cross-validation error:
cv.lasso<-cv.glmnet(x, y=Fertility, alpha=1)
plot(cv.lasso)
cv.lasso$lambda.min
```
What value of $\lambda$ is returned? Note: This will vary from trial-to-trial (that's okay!).
$\lambda$=0.023.
5. Run lasso regression again using your optimal $\lambda$ from Part 4.
```{r}
# Write your code here
lasso2 <- glmnet(x, y=Fertility, alpha = 1, lambda = cv.lasso$lambda.min)
coef(lasso2)
```
What variables are in the model?
All variables are still in the model
***
## Variances (12 points)
Use the `stopping` data set from the `alr4` package. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/alr4/versions/1.0.5/topics/stopping).
For this problem, you will run through various methods of dealing with nonconstant variance - using OLS, WLS (assuming known weights), WLS (with a sandwich estimator), and bootstrap.
1. Create a scatterplot of `Distance` versus `Speed` (Y vs X).
```{r}
data(stopping, package = "alr4")
# Write your code here
attach(stopping)
scatterplot(Speed, Distance)
```
2. Breifly explain why this graph supports fitting a quadratic regression model.
The distribution of points in the scatterplot looks closer to a quadratic curve than to a liniar one.
3. Fit a quadratic model with constant variance (OLS). Remember to use the `I()` function to inhibit the squared term.
```{r}
# Write your code here
quadraticM<-lm(Distance~Speed+I(Speed^2))
stargazer(quadraticM, type="text", report="vcs*")
```
4. What are the standard errors for the two terms?
.556 for the fisrt order term, and .013 for teh quadratic one.
5. Refit the quadratic model, but use WLS with the assumption that $Var(Distance|Speed) = Speed*\sigma^2$.
```{r, echo=FALSE, warning=FALSE}
# Write your code here
wlsQM<-lm(Distance~Speed+I(Speed^2), weights= Speed*sigma(quadraticM)^2)
stargazer(wlsQM, type="text", report="vcs*")
```
6. What are the standard errors for the two terms?
.766 (for the linear effect), and .016 (for the quadratic term).