dnieuw · stenw · Oct 16, 2025 · Oct 16, 2025 · Nov 14, 2025
diff --git a/1-10_Indexing_Subsetting_answers.qmd b/1-10_Indexing_Subsetting_answers.qmd
@@ -67,61 +67,150 @@ Sometimes we want to obtain a subset of the data sets before investigating the d
 
 Using the **heart** data set:\
 
--   Select the first row.\
--   Select the first column.\
--   Select the column `surgery`.
+-   Select the first row of the `data.frame` using `[]`.\
+-   Select the second and third column of the `data.frame` using `[]`.\
 
 #### Solution 1
 
 ```{r ind1-solution, solution = TRUE}
 heart[1, ]
-heart[, 1]
-heart["surgery"]
-heart[["surgery"]]
-heart[, "surgery"]
+heart[, c(2, 3)] 
+
+
 ```
 :::
 
 ::: {.panel-tabset .nav-pills}
 #### Task 2
 
-Create a matrix that takes the values 1:4 and has 2 rows and 2 columns. You can name this object `mat`. Select the second row of all columns.
+Using the **heart** data set:\
+
+-   Select the column `surgery` of the `data.frame` in multiple ways: \
+  * As a `data.frame` with a single column using single square brackets and a single index
+  * As a `data.frame` with a single column using single square brackets and a double index using `drop=FALSE` \
+  * As a vector using double square brackets. \
+  * As a vector using single square brackets and a double index. \
+  * Using the dolar sign (`$`) operator. \
+- Verify the class of the returned  objects in each case using the `class()` function.
 
 #### Solution 2
 
-```{r ind2-solution, solution = TRUE}
-mat <- matrix(1:4, 2, 2)
-mat[2, ]
+```{r ind1b-solution, solution = TRUE}
+(a <- heart["surgery"])
+(b <- heart[ , "surgery", drop=FALSE])
+(c <- heart[["surgery"]])
+(d <- heart[, "surgery"])
+(e <- heart$surgery)
+
+class(a)  # data.frame
+class(b)  # data.frame
+class(c)  # numeric
+class(d)  # numeric
+class(e)  # numeric  
+
 ```
 :::
 
+
 ::: {.panel-tabset .nav-pills}
 #### Task 3
 
-Create an array that consists of 2 matrices. Matrix 1 will consist of the values 1:4 and matrix 2 will consist of the values 5:8. Both matrices will have 2 columns and 2 rows. Give the name `ar1` to the this array. Select the 2nd row of all columns from each matrix.
+Create a matrix that takes the values 1:6 and has 3 rows and 2 columns. You can name this object `mat`. 
+- Select the second row of all columns.
+- Select the first column.
+- Select the element in the 3rd row and 2nd column.
+- select the first and second row of the second column
 
 #### Solution 3
 
+```{r ind2-solution, solution = TRUE}
+mat <- matrix(1:6, 3, 2)
+mat[2, ]
+mat[, 1]
+mat[3, 2]
+mat[1:2, 2] # or mat[-3, 2]
+```
+:::
+
+::: callout-advanced
+
+::: {.panel-tabset .nav-pills}
+#### Task 4
+
+From the matrix in the previous task, select the element on the 1st row and 2nd column and that on the 3rd row on the 1st column. Use a matrix as an index
+
+#### Solution 4
+
 ```{r ind3-solution, solution = TRUE}
-ar1 <- array(data = 1:8, dim = c(2, 2, 2))
-ar1[2, , ]
+i <- matrix(c(1,2,
+              3,1), ncol=2, byrow=TRUE)
+mat[i]
+
+```
+:::
+
+
+::: {.panel-tabset .nav-pills}
+#### Task 5
+
+The following table contains the average temperatures (in °C) for January and July in 3 different cities.
+
+| City | Month | 2020 | 2021 | 2023 |
+|:---|---|---|---|---|
+| Rotterdam | January | 6.2 | 3.8 | 5.8 |
+|  | July | 17.0 | 18.0 | 18.0 |
+| Berlin | January | 4.0 | 1.0 | 4.0 |
+|  | July | 17.8 | 19.2 | 20.0 |
+| Athens | January | 8.0 | 11.0 | 11.0 |
+|  | July | 28.0 | 29.0 | 31.0 |
+
+Create an array where the rows denote the different years, the columns the month and the layers the cities. Now list the temperature in January, 2023 in each of the cities. Also calculate the average temperature in July in Rotterdam (use the `mean` function).
+
+#### Solution 5
+
+```{r ind5-solution, solution = TRUE}
+all_data <- c(6.2, 3.8, 5.8,  # Rotterdam, Jan (2020, 2021, 2023)
+              17.0, 18.0, 18.0, # Rotterdam, July (2020, 2021, 2023)
+              4.0, 1.0, 4.0,  # Berlin, Jan (2020, 2021, 2023)
+              17.8, 19.2, 20.0, # Berlin, July (2020, 2021, 2023)
+              8.0, 11.0, 11.0, # Athens, Jan (2020, 2021, 2023)
+              28.0, 29.0, 31.0) # Athens, July (2020, 2021, 2023)
+
+# Optionally: Define the names for each dimension
+dim_years <- c("2020", "2021", "2023")
+dim_months <- c("January", "July")
+dim_cities <- c("Rotterdam", "Berlin", "Athens")
+
+temps_array <- array(all_data, 
+                     dim = c(3, 2, 3), 
+                     dimnames = list(dim_years, dim_months, dim_cities))
+
+print(temps_array["2023", "January", ])
+
+rotterdam_july_temps <- temps_array[, "July", "Rotterdam"]
+print(mean(rotterdam_july_temps))
+
 ```
 :::
 
-### Subsetting {.tabset .tabset-fade .tabset-pills}
+
+::: 
+
+### Subsetting a data set {.tabset .tabset-fade .tabset-pills}
 
 ::: {.panel-tabset .nav-pills}
 #### Task 1
 
 Using the **retinopathy** data set:\
 
 -   Select the `futime` for all `adult` patients.\
--   Select all the variables for patients that received treatment.\
+-   Select all the variables for patients that received treatment (`trt==1`).\
 
 #### Solution 1
 
 ```{r sub1-solution, solution = TRUE}
 retinopathy$futime[retinopathy$type == "adult"]
+# or use 
 retinopathy[retinopathy$trt == 1, ]
 ```
 :::
@@ -131,15 +220,21 @@ retinopathy[retinopathy$trt == 1, ]
 
 Using the **retinopathy** data set:\
 
--   Select the `age` for patients that have `futime` more than 20.\
+-   Select the `age` for patients that have `futime` more than 20. (When you have time can you think of a second way to do this?)\
 -   Select the `age` for patients that have `futime` more than 20 and are adults.\
--   Select patients that have no missing values in `age`.
+-   Select only the rows of the left eye. If needed look in the documentation of the data set to find out how variables are encoded.
+-   Select only the rows of adult patients.
+-   Select all rows for patients that have no missing values in `age`.
 
 #### Solution 2
 
 ```{r sub2-solution, solution = TRUE}
 retinopathy$age[retinopathy$futime > 20]
+# or
+retinopathy[retinopathy$futime > 20, "age"]
 retinopathy$age[retinopathy$futime > 20 & retinopathy$type == "adult"]
+retinopathy[retinopathy$eye == "left", ]
+retinopathy[retinopathy$type == "adult", ]
 retinopathy[!is.na(retinopathy$age), ]
 ```
 :::
@@ -148,14 +243,43 @@ retinopathy[!is.na(retinopathy$age), ]
 #### Task 3
 
 Using the **retinopathy** data set:\
-
--   Select only the rows of the left eye.
--   Select only the rows of adult patients.
+-   Calculate the mean risk score for the treated eyes.
+-   Calculate the mean age of patients for which the variable age is not missing
+-   Create a `summary` of the juvenile patients.
 
 #### Solution 3
 
 ```{r sub3-solution, solution = TRUE}
-retinopathy[retinopathy$eye == "left", ]
-retinopathy[retinopathy$type == "adult", ]
+mean(retinopathy$risk[retinopathy$trt==1])
+mean(retinopathy[!is.na(retinopathy$age), 'age'])
+# or split this up in steps. For example:
+age_vec <- retinopathy$age
+age_vec_no_na <- age_vec[!is.na(age_vec)]
+mean(age_vec_no_na)
+# You could also have used the na.rm argument of the mean function:
+mean(retinopathy$age, na.rm = TRUE)
+###
+summary(retinopathy[retinopathy$type == "juvenile", ])
+
 ```
 :::
+
+::: callout-advanced
+
+::: {.panel-tabset .nav-pills}
+#### Task 4
+
+- Why does `colnames(retinopathy[1])` return `id`, while `colnames(retinopathy[[1]])` returns `NULL`?
+
+- Without executing the code, try to determine what the output of `colnames(retinopathy[,1])` would be.
+
+#### Solution 4
+
+When we use single brackets `[]` to subset a data frame, the result is still a data frame. In this case, `retinopathy[1]` returns a data frame with one column (the first column of `retinopathy`). Therefore, when we use `colnames()` on this result, it returns the name of that column, which is `id`. In contrast , when we use double brackets `[[]]`, we extract the actual vector (or list element) from the data frame and we cannot use `colnames()` on a vector, so it returns `NULL`.
+
+
+The output of `colnames(retinopathy[,1])` would also be `NULL`. This is because when we use the comma notation to subset a data frame and specify only one column (like `retinopathy[,1]`), R returns a vector by default (unless we set `drop = FALSE`). Since the result is a vector, it does not have column names, and thus `colnames()` would return `NULL`.
+
+:::
+
+:::