Skip to content

Group Assignment 3/4 2024 - 2 #1114

@jgoicochea

Description

@jgoicochea

Assignment 3/4

This is a double assignment. It is splitted in two parts. The first part covers double lasso and DAGs topics, while the second part covers the bootstrapping and decision trees topics.

Part 1: Double Lasso and DAGs (20 points)

  1. Consider the US census data from the year 2015 to analyse the effect of college graduate (clg) status
    and it’s interaction effects with gender (sex), location (mw, so,we, ne) and both on wage jointly. All
    other variables denote some other socio-economic characteristics, e.g. marital status, occupation, and
    experience.

    • Generate the dataset with all the two-way interactions between variables. Make sure that the categorical variables are transformed to dummies properly. Also, note that the resulting dataset contains the treatment and it’s interactions with the other variables of interest, so you don’t need to generate them separately. (2 pts)

    • Use the double lasso technique to find the effect of the treatment and it’s relevant interactions on
      the wage. To tune the penalization parameter in the lasso step, cross-validate it. (4 pts)

    • Report a summary of the estimation of the parameters of interest. (2 pts)

    • Interpret your results. In which group does the college graduate status has more impact over the
      wage? (2 pts)

  2. For the following examples, draw a coherent Directed Acyclic Graph and indicate the confounders, colliders and the proper controls (if they exist).

    • You are trying to study the effect of youth smoking on lung function. Your dataset contains the
      following variables : (5 pts)

      i. Individual smoking behavior (Treatment)
      ii. Forced respiratory volume (Outcome)
      iii. Age
      iv. Height
      v. Sex

    • You are trying to study the effect of breast feeding in the number of infections a baby is likely to
      have. Your dataset contains the following variables : (5 pts)

      i. Breast fed (Treatment)
      ii. Number of infections of the baby (Outcome)
      iii. Marital status
      iv. Family income
      v. Education
      vi. Number of children in the house
      vii. Childcare outside the home

Part 2: Bootstrapping and Decision Trees (20 points)

  1. Consider the Hitters dataset provided by the ISLR package. This dataset contains several features related
    to the hitters of the Major League Baseball from the 1986 and 1987 seasons.

    • Generate the dataset such that the categorical variables are transformed to dummies. Make sure you
      drop the missing observations if there is any. (2 pts)

    • Divide the sample in two sets: training (90%) and testing (10%) sets. (2 pts)

    • Fit an OLS regression to predict the salary of the hitters using all the features of your dataset and
      provide bootstrap confidence intervals. Follow these steps:

      • Calculate the OLS point estimate using the training set $\hat{\beta}$. (2 pts)

      • Use a loop to generate 10 000 bootstrap estimates. That is, sample 10 000 times pairs $\left(y_i,X_i\right)_{i=1}^{N _ {train}}$ with replacement and, for each, estimate the vector of parameters $\hat{\beta}$. You must end up with an array of size (10 000,# features) that contains the sequence ${\hat{\beta}} _ {boots}=\left(\hat{\beta}^{(1)},\hat{\beta}^{(2)},\dots,\hat{\beta}^{(10 000)}\right)$ so each row is a bootstrapped vector of $\hat{\beta}$. (2 pts)

      • Calculate the 95% confidence intervals ${\hat{\beta}} _ {boots}^{lower}$ and ${\hat{\beta}} _ {boots}^{upper}$ using the empirical approach. These are defined as follows:

        $$\hat{\beta} _ {boots}^{lower} = \hat{\beta} - \hat{\beta} _ {boots}^{97.5}$$

        $$\hat{\beta} _ {boots}^{upper} = \hat{\beta} - \hat{\beta} _ {boots}^{2.5}$$

        where $\hat{\beta} _ {boots}^{\alpha}$ is the $\alpha$% percentile of the $\hat{\beta} _ {boots}$ distribution. (2 pts)

      • Calculate the out of sample mean squared error of the model

    • Fit a regression tree to predict the salary using all the features of your dataset. Follow these steps:

      • Using the training data, fit a tree and prune it. To choose the prunning parameter, cross validate it as we did in class. (4 pts)

      • Calculate the out of sample mean squared error of the model. (2 pts)

    • Which model performs better in terms of predictive accuracy? (2 pts)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions