Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 147 additions & 23 deletions 1-10_Indexing_Subsetting_answers.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -67,61 +67,150 @@ Sometimes we want to obtain a subset of the data sets before investigating the d

Using the **heart** data set:\

- Select the first row.\
- Select the first column.\
- Select the column `surgery`.
- Select the first row of the `data.frame` using `[]`.\
- Select the second and third column of the `data.frame` using `[]`.\

#### Solution 1

```{r ind1-solution, solution = TRUE}
heart[1, ]
heart[, 1]
heart["surgery"]
heart[["surgery"]]
heart[, "surgery"]
heart[, c(2, 3)]


```
:::

::: {.panel-tabset .nav-pills}
#### Task 2

Create a matrix that takes the values 1:4 and has 2 rows and 2 columns. You can name this object `mat`. Select the second row of all columns.
Using the **heart** data set:\

- Select the column `surgery` of the `data.frame` in multiple ways: \
* As a `data.frame` with a single column using single square brackets and a single index
* As a `data.frame` with a single column using single square brackets and a double index using `drop=FALSE` \
* As a vector using double square brackets. \
* As a vector using single square brackets and a double index. \
* Using the dolar sign (`$`) operator. \
- Verify the class of the returned objects in each case using the `class()` function.

#### Solution 2

```{r ind2-solution, solution = TRUE}
mat <- matrix(1:4, 2, 2)
mat[2, ]
```{r ind1b-solution, solution = TRUE}
(a <- heart["surgery"])
(b <- heart[ , "surgery", drop=FALSE])
(c <- heart[["surgery"]])
(d <- heart[, "surgery"])
(e <- heart$surgery)

class(a) # data.frame
class(b) # data.frame
class(c) # numeric
class(d) # numeric
class(e) # numeric

```
:::


::: {.panel-tabset .nav-pills}
#### Task 3

Create an array that consists of 2 matrices. Matrix 1 will consist of the values 1:4 and matrix 2 will consist of the values 5:8. Both matrices will have 2 columns and 2 rows. Give the name `ar1` to the this array. Select the 2nd row of all columns from each matrix.
Create a matrix that takes the values 1:6 and has 3 rows and 2 columns. You can name this object `mat`.
- Select the second row of all columns.
- Select the first column.
- Select the element in the 3rd row and 2nd column.
- select the first and second row of the second column

#### Solution 3

```{r ind2-solution, solution = TRUE}
mat <- matrix(1:6, 3, 2)
mat[2, ]
mat[, 1]
mat[3, 2]
mat[1:2, 2] # or mat[-3, 2]
```
:::

::: callout-advanced

::: {.panel-tabset .nav-pills}
#### Task 4

From the matrix in the previous task, select the element on the 1st row and 2nd column and that on the 3rd row on the 1st column. Use a matrix as an index

#### Solution 4

```{r ind3-solution, solution = TRUE}
ar1 <- array(data = 1:8, dim = c(2, 2, 2))
ar1[2, , ]
i <- matrix(c(1,2,
3,1), ncol=2, byrow=TRUE)
mat[i]

```
:::


::: {.panel-tabset .nav-pills}
#### Task 5

The following table contains the average temperatures (in °C) for January and July in 3 different cities.

| City | Month | 2020 | 2021 | 2023 |
|:---|---|---|---|---|
| Rotterdam | January | 6.2 | 3.8 | 5.8 |
| | July | 17.0 | 18.0 | 18.0 |
| Berlin | January | 4.0 | 1.0 | 4.0 |
| | July | 17.8 | 19.2 | 20.0 |
| Athens | January | 8.0 | 11.0 | 11.0 |
| | July | 28.0 | 29.0 | 31.0 |

Create an array where the rows denote the different years, the columns the month and the layers the cities. Now list the temperature in January, 2023 in each of the cities. Also calculate the average temperature in July in Rotterdam (use the `mean` function).

#### Solution 5

```{r ind5-solution, solution = TRUE}
all_data <- c(6.2, 3.8, 5.8, # Rotterdam, Jan (2020, 2021, 2023)
17.0, 18.0, 18.0, # Rotterdam, July (2020, 2021, 2023)
4.0, 1.0, 4.0, # Berlin, Jan (2020, 2021, 2023)
17.8, 19.2, 20.0, # Berlin, July (2020, 2021, 2023)
8.0, 11.0, 11.0, # Athens, Jan (2020, 2021, 2023)
28.0, 29.0, 31.0) # Athens, July (2020, 2021, 2023)

# Optionally: Define the names for each dimension
dim_years <- c("2020", "2021", "2023")
dim_months <- c("January", "July")
dim_cities <- c("Rotterdam", "Berlin", "Athens")

temps_array <- array(all_data,
dim = c(3, 2, 3),
dimnames = list(dim_years, dim_months, dim_cities))

print(temps_array["2023", "January", ])

rotterdam_july_temps <- temps_array[, "July", "Rotterdam"]
print(mean(rotterdam_july_temps))

```
:::

### Subsetting {.tabset .tabset-fade .tabset-pills}

:::

### Subsetting a data set {.tabset .tabset-fade .tabset-pills}

::: {.panel-tabset .nav-pills}
#### Task 1

Using the **retinopathy** data set:\

- Select the `futime` for all `adult` patients.\
- Select all the variables for patients that received treatment.\
- Select all the variables for patients that received treatment (`trt==1`).\

#### Solution 1

```{r sub1-solution, solution = TRUE}
retinopathy$futime[retinopathy$type == "adult"]
# or use
retinopathy[retinopathy$trt == 1, ]
```
:::
Expand All @@ -131,15 +220,21 @@ retinopathy[retinopathy$trt == 1, ]

Using the **retinopathy** data set:\

- Select the `age` for patients that have `futime` more than 20.\
- Select the `age` for patients that have `futime` more than 20. (When you have time can you think of a second way to do this?)\
- Select the `age` for patients that have `futime` more than 20 and are adults.\
- Select patients that have no missing values in `age`.
- Select only the rows of the left eye. If needed look in the documentation of the data set to find out how variables are encoded.
- Select only the rows of adult patients.
- Select all rows for patients that have no missing values in `age`.

#### Solution 2

```{r sub2-solution, solution = TRUE}
retinopathy$age[retinopathy$futime > 20]
# or
retinopathy[retinopathy$futime > 20, "age"]
retinopathy$age[retinopathy$futime > 20 & retinopathy$type == "adult"]
retinopathy[retinopathy$eye == "left", ]
retinopathy[retinopathy$type == "adult", ]
retinopathy[!is.na(retinopathy$age), ]
```
:::
Expand All @@ -148,14 +243,43 @@ retinopathy[!is.na(retinopathy$age), ]
#### Task 3

Using the **retinopathy** data set:\

- Select only the rows of the left eye.
- Select only the rows of adult patients.
- Calculate the mean risk score for the treated eyes.
- Calculate the mean age of patients for which the variable age is not missing
- Create a `summary` of the juvenile patients.

#### Solution 3

```{r sub3-solution, solution = TRUE}
retinopathy[retinopathy$eye == "left", ]
retinopathy[retinopathy$type == "adult", ]
mean(retinopathy$risk[retinopathy$trt==1])
mean(retinopathy[!is.na(retinopathy$age), 'age'])
# or split this up in steps. For example:
age_vec <- retinopathy$age
age_vec_no_na <- age_vec[!is.na(age_vec)]
mean(age_vec_no_na)
# You could also have used the na.rm argument of the mean function:
mean(retinopathy$age, na.rm = TRUE)
###
summary(retinopathy[retinopathy$type == "juvenile", ])

```
:::

::: callout-advanced

::: {.panel-tabset .nav-pills}
#### Task 4

- Why does `colnames(retinopathy[1])` return `id`, while `colnames(retinopathy[[1]])` returns `NULL`?

- Without executing the code, try to determine what the output of `colnames(retinopathy[,1])` would be.

#### Solution 4

When we use single brackets `[]` to subset a data frame, the result is still a data frame. In this case, `retinopathy[1]` returns a data frame with one column (the first column of `retinopathy`). Therefore, when we use `colnames()` on this result, it returns the name of that column, which is `id`. In contrast , when we use double brackets `[[]]`, we extract the actual vector (or list element) from the data frame and we cannot use `colnames()` on a vector, so it returns `NULL`.


The output of `colnames(retinopathy[,1])` would also be `NULL`. This is because when we use the comma notation to subset a data frame and specify only one column (like `retinopathy[,1]`), R returns a vector by default (unless we set `drop = FALSE`). Since the result is a vector, it does not have column names, and thus `colnames()` would return `NULL`.

:::

:::
Loading