Assignment 3/4
This is a double assignment. It is splitted in two parts. The first part covers double lasso and DAGs topics, while the second part covers the bootstrapping and decision trees topics.
Part 1: Double Lasso and DAGs (20 points)
-
Consider the US census data from the year 2015 to analyse the effect of college graduate (clg) status
and it’s interaction effects with gender (sex), location (mw, so,we, ne) and both on wage jointly. All
other variables denote some other socio-economic characteristics, e.g. marital status, occupation, and
experience.
-
Generate the dataset with all the two-way interactions between variables. Make sure that the categorical variables are transformed to dummies properly. Also, note that the resulting dataset contains the treatment and it’s interactions with the other variables of interest, so you don’t need to generate them separately. (2 pts)
-
Use the double lasso technique to find the effect of the treatment and it’s relevant interactions on
the wage. To tune the penalization parameter in the lasso step, cross-validate it. (4 pts)
-
Report a summary of the estimation of the parameters of interest. (2 pts)
-
Interpret your results. In which group does the college graduate status has more impact over the
wage? (2 pts)
-
For the following examples, draw a coherent Directed Acyclic Graph and indicate the confounders, colliders and the proper controls (if they exist).
-
You are trying to study the effect of youth smoking on lung function. Your dataset contains the
following variables : (5 pts)
i. Individual smoking behavior (Treatment)
ii. Forced respiratory volume (Outcome)
iii. Age
iv. Height
v. Sex
-
You are trying to study the effect of breast feeding in the number of infections a baby is likely to
have. Your dataset contains the following variables : (5 pts)
i. Breast fed (Treatment)
ii. Number of infections of the baby (Outcome)
iii. Marital status
iv. Family income
v. Education
vi. Number of children in the house
vii. Childcare outside the home
Part 2: Bootstrapping and Decision Trees (20 points)
-
Consider the Hitters dataset provided by the ISLR package. This dataset contains several features related
to the hitters of the Major League Baseball from the 1986 and 1987 seasons.
-
Generate the dataset such that the categorical variables are transformed to dummies. Make sure you
drop the missing observations if there is any. (2 pts)
-
Divide the sample in two sets: training (90%) and testing (10%) sets. (2 pts)
-
Fit an OLS regression to predict the salary of the hitters using all the features of your dataset and
provide bootstrap confidence intervals. Follow these steps:
-
Calculate the OLS point estimate using the training set $\hat{\beta}$. (2 pts)
-
Use a loop to generate 10 000 bootstrap estimates. That is, sample 10 000 times pairs $\left(y_i,X_i\right)_{i=1}^{N _ {train}}$ with replacement and, for each, estimate the vector of parameters $\hat{\beta}$. You must end up with an array of size (10 000,# features) that contains the sequence ${\hat{\beta}} _ {boots}=\left(\hat{\beta}^{(1)},\hat{\beta}^{(2)},\dots,\hat{\beta}^{(10 000)}\right)$ so each row is a bootstrapped vector of $\hat{\beta}$. (2 pts)
-
Calculate the 95% confidence intervals ${\hat{\beta}} _ {boots}^{lower}$ and ${\hat{\beta}} _ {boots}^{upper}$ using the empirical approach. These are defined as follows:
$$\hat{\beta} _ {boots}^{lower} = \hat{\beta} - \hat{\beta} _ {boots}^{97.5}$$
$$\hat{\beta} _ {boots}^{upper} = \hat{\beta} - \hat{\beta} _ {boots}^{2.5}$$
where $\hat{\beta} _ {boots}^{\alpha}$ is the $\alpha$% percentile of the $\hat{\beta} _ {boots}$ distribution. (2 pts)
-
Calculate the out of sample mean squared error of the model
-
Fit a regression tree to predict the salary using all the features of your dataset. Follow these steps:
-
Using the training data, fit a tree and prune it. To choose the prunning parameter, cross validate it as we did in class. (4 pts)
-
Calculate the out of sample mean squared error of the model. (2 pts)
-
Which model performs better in terms of predictive accuracy? (2 pts)
Assignment 3/4
This is a double assignment. It is splitted in two parts. The first part covers double lasso and DAGs topics, while the second part covers the bootstrapping and decision trees topics.
Part 1: Double Lasso and DAGs (20 points)
Consider the US census data from the year 2015 to analyse the effect of college graduate (clg) status
and it’s interaction effects with gender (sex), location (mw, so,we, ne) and both on wage jointly. All
other variables denote some other socio-economic characteristics, e.g. marital status, occupation, and
experience.
Generate the dataset with all the two-way interactions between variables. Make sure that the categorical variables are transformed to dummies properly. Also, note that the resulting dataset contains the treatment and it’s interactions with the other variables of interest, so you don’t need to generate them separately. (2 pts)
Use the double lasso technique to find the effect of the treatment and it’s relevant interactions on
the wage. To tune the penalization parameter in the lasso step, cross-validate it. (4 pts)
Report a summary of the estimation of the parameters of interest. (2 pts)
Interpret your results. In which group does the college graduate status has more impact over the
wage? (2 pts)
For the following examples, draw a coherent Directed Acyclic Graph and indicate the confounders, colliders and the proper controls (if they exist).
You are trying to study the effect of youth smoking on lung function. Your dataset contains the
following variables : (5 pts)
i. Individual smoking behavior (Treatment)
ii. Forced respiratory volume (Outcome)
iii. Age
iv. Height
v. Sex
You are trying to study the effect of breast feeding in the number of infections a baby is likely to
have. Your dataset contains the following variables : (5 pts)
i. Breast fed (Treatment)
ii. Number of infections of the baby (Outcome)
iii. Marital status
iv. Family income
v. Education
vi. Number of children in the house
vii. Childcare outside the home
Part 2: Bootstrapping and Decision Trees (20 points)
Consider the Hitters dataset provided by the ISLR package. This dataset contains several features related
to the hitters of the Major League Baseball from the 1986 and 1987 seasons.
Generate the dataset such that the categorical variables are transformed to dummies. Make sure you
drop the missing observations if there is any. (2 pts)
Divide the sample in two sets: training (90%) and testing (10%) sets. (2 pts)
Fit an OLS regression to predict the salary of the hitters using all the features of your dataset and
provide bootstrap confidence intervals. Follow these steps:
Calculate the OLS point estimate using the training set$\hat{\beta}$ . (2 pts)
Use a loop to generate 10 000 bootstrap estimates. That is, sample 10 000 times pairs$\left(y_i,X_i\right)_{i=1}^{N _ {train}}$ with replacement and, for each, estimate the vector of parameters $\hat{\beta}$ . You must end up with an array of size (10 000,# features) that contains the sequence ${\hat{\beta}} _ {boots}=\left(\hat{\beta}^{(1)},\hat{\beta}^{(2)},\dots,\hat{\beta}^{(10 000)}\right)$ so each row is a bootstrapped vector of $\hat{\beta}$ . (2 pts)
Calculate the 95% confidence intervals${\hat{\beta}} _ {boots}^{lower}$ and ${\hat{\beta}} _ {boots}^{upper}$ using the empirical approach. These are defined as follows:
where$\hat{\beta} _ {boots}^{\alpha}$ is the $\alpha$ % percentile of the $\hat{\beta} _ {boots}$ distribution. (2 pts)
Calculate the out of sample mean squared error of the model
Fit a regression tree to predict the salary using all the features of your dataset. Follow these steps:
Using the training data, fit a tree and prune it. To choose the prunning parameter, cross validate it as we did in class. (4 pts)
Calculate the out of sample mean squared error of the model. (2 pts)
Which model performs better in terms of predictive accuracy? (2 pts)