From 08cd747d8ff668a9fb3d13cb8ae5f382a1367937 Mon Sep 17 00:00:00 2001 From: Sten Willemsen Date: Thu, 16 Oct 2025 11:53:55 +0200 Subject: [PATCH 1/2] Add Book.Rproj and myfile.rds (is this smart?) --- Book.Rproj | 1 + myfile.rds | Bin 120 -> 120 bytes 2 files changed, 1 insertion(+) diff --git a/Book.Rproj b/Book.Rproj index 8e3c2eb..5797260 100644 --- a/Book.Rproj +++ b/Book.Rproj @@ -1,4 +1,5 @@ Version: 1.0 +ProjectId: 24347c35-7a41-468a-b7b4-74373c1938b8 RestoreWorkspace: Default SaveWorkspace: Default diff --git a/myfile.rds b/myfile.rds index 1ac268782c6bf18bc410f4b5a96e45e303e39db9..daf4f88ab5d062d11dd01fffb2ec9866c355772a 100644 GIT binary patch delta 36 ncmb=Z5S8!dU;qQQ?gvQ;2?=QliHRu_MWZ>kY?F@O3X}l=uzLzM delta 36 ncmb=Z5S8!dU;qQQ?gvQ;2?=QliD^j_MWZ>?`IkM~3X}l=t&<8C From 0c7c82045a94fb8b835edc7bd8327db3aa2b1f51 Mon Sep 17 00:00:00 2001 From: Sten Willemsen Date: Fri, 14 Nov 2025 10:23:28 +0100 Subject: [PATCH 2/2] Revision of the first chapter --- 1-10_Indexing_Subsetting_answers.qmd | 170 ++++++++++++++++++++++---- 1-1_Introduction.qmd | 129 +++++++++++++++---- 1-2_Working_with_R.qmd | 13 +- 1-3_Working_with_R_practical.qmd | 43 ++----- 1-4_Working_with_R_answers.qmd | 8 +- 1-5_Common_Objects.qmd | 149 +++++++++++++--------- 1-6_Common_Objects_practical.qmd | 89 ++++++++------ 1-7_Common_Objects_answers.qmd | 132 +++++++++++++++----- 1-8_Indexing_Subsetting.qmd | 43 +++++-- 1-9_Indexing_Subsetting_practical.qmd | 70 ++++++++--- _brand.yml | 26 ++++ _quarto.yml | 17 ++- custom.css | 52 ++++++++ custom.tex | 61 +++++++++ 14 files changed, 755 insertions(+), 247 deletions(-) create mode 100644 _brand.yml create mode 100644 custom.css create mode 100644 custom.tex diff --git a/1-10_Indexing_Subsetting_answers.qmd b/1-10_Indexing_Subsetting_answers.qmd index c49315b..bfe6348 100644 --- a/1-10_Indexing_Subsetting_answers.qmd +++ b/1-10_Indexing_Subsetting_answers.qmd @@ -67,48 +67,136 @@ Sometimes we want to obtain a subset of the data sets before investigating the d Using the **heart** data set:\ -- Select the first row.\ -- Select the first column.\ -- Select the column `surgery`. +- Select the first row of the `data.frame` using `[]`.\ +- Select the second and third column of the `data.frame` using `[]`.\ #### Solution 1 ```{r ind1-solution, solution = TRUE} heart[1, ] -heart[, 1] -heart["surgery"] -heart[["surgery"]] -heart[, "surgery"] +heart[, c(2, 3)] + + ``` ::: ::: {.panel-tabset .nav-pills} #### Task 2 -Create a matrix that takes the values 1:4 and has 2 rows and 2 columns. You can name this object `mat`. Select the second row of all columns. +Using the **heart** data set:\ + +- Select the column `surgery` of the `data.frame` in multiple ways: \ + * As a `data.frame` with a single column using single square brackets and a single index + * As a `data.frame` with a single column using single square brackets and a double index using `drop=FALSE` \ + * As a vector using double square brackets. \ + * As a vector using single square brackets and a double index. \ + * Using the dolar sign (`$`) operator. \ +- Verify the class of the returned objects in each case using the `class()` function. #### Solution 2 -```{r ind2-solution, solution = TRUE} -mat <- matrix(1:4, 2, 2) -mat[2, ] +```{r ind1b-solution, solution = TRUE} +(a <- heart["surgery"]) +(b <- heart[ , "surgery", drop=FALSE]) +(c <- heart[["surgery"]]) +(d <- heart[, "surgery"]) +(e <- heart$surgery) + +class(a) # data.frame +class(b) # data.frame +class(c) # numeric +class(d) # numeric +class(e) # numeric + ``` ::: + ::: {.panel-tabset .nav-pills} #### Task 3 -Create an array that consists of 2 matrices. Matrix 1 will consist of the values 1:4 and matrix 2 will consist of the values 5:8. Both matrices will have 2 columns and 2 rows. Give the name `ar1` to the this array. Select the 2nd row of all columns from each matrix. +Create a matrix that takes the values 1:6 and has 3 rows and 2 columns. You can name this object `mat`. +- Select the second row of all columns. +- Select the first column. +- Select the element in the 3rd row and 2nd column. +- select the first and second row of the second column #### Solution 3 +```{r ind2-solution, solution = TRUE} +mat <- matrix(1:6, 3, 2) +mat[2, ] +mat[, 1] +mat[3, 2] +mat[1:2, 2] # or mat[-3, 2] +``` +::: + +::: callout-advanced + +::: {.panel-tabset .nav-pills} +#### Task 4 + +From the matrix in the previous task, select the element on the 1st row and 2nd column and that on the 3rd row on the 1st column. Use a matrix as an index + +#### Solution 4 + ```{r ind3-solution, solution = TRUE} -ar1 <- array(data = 1:8, dim = c(2, 2, 2)) -ar1[2, , ] +i <- matrix(c(1,2, + 3,1), ncol=2, byrow=TRUE) +mat[i] + +``` +::: + + +::: {.panel-tabset .nav-pills} +#### Task 5 + +The following table contains the average temperatures (in Β°C) for January and July in 3 different cities. + +| City | Month | 2020 | 2021 | 2023 | +|:---|---|---|---|---| +| Rotterdam | January | 6.2 | 3.8 | 5.8 | +| | July | 17.0 | 18.0 | 18.0 | +| Berlin | January | 4.0 | 1.0 | 4.0 | +| | July | 17.8 | 19.2 | 20.0 | +| Athens | January | 8.0 | 11.0 | 11.0 | +| | July | 28.0 | 29.0 | 31.0 | + +Create an array where the rows denote the different years, the columns the month and the layers the cities. Now list the temperature in January, 2023 in each of the cities. Also calculate the average temperature in July in Rotterdam (use the `mean` function). + +#### Solution 5 + +```{r ind5-solution, solution = TRUE} +all_data <- c(6.2, 3.8, 5.8, # Rotterdam, Jan (2020, 2021, 2023) + 17.0, 18.0, 18.0, # Rotterdam, July (2020, 2021, 2023) + 4.0, 1.0, 4.0, # Berlin, Jan (2020, 2021, 2023) + 17.8, 19.2, 20.0, # Berlin, July (2020, 2021, 2023) + 8.0, 11.0, 11.0, # Athens, Jan (2020, 2021, 2023) + 28.0, 29.0, 31.0) # Athens, July (2020, 2021, 2023) + +# Optionally: Define the names for each dimension +dim_years <- c("2020", "2021", "2023") +dim_months <- c("January", "July") +dim_cities <- c("Rotterdam", "Berlin", "Athens") + +temps_array <- array(all_data, + dim = c(3, 2, 3), + dimnames = list(dim_years, dim_months, dim_cities)) + +print(temps_array["2023", "January", ]) + +rotterdam_july_temps <- temps_array[, "July", "Rotterdam"] +print(mean(rotterdam_july_temps)) + ``` ::: -### Subsetting {.tabset .tabset-fade .tabset-pills} + +::: + +### Subsetting a data set {.tabset .tabset-fade .tabset-pills} ::: {.panel-tabset .nav-pills} #### Task 1 @@ -116,12 +204,13 @@ ar1[2, , ] Using the **retinopathy** data set:\ - Select the `futime` for all `adult` patients.\ -- Select all the variables for patients that received treatment.\ +- Select all the variables for patients that received treatment (`trt==1`).\ #### Solution 1 ```{r sub1-solution, solution = TRUE} retinopathy$futime[retinopathy$type == "adult"] +# or use retinopathy[retinopathy$trt == 1, ] ``` ::: @@ -131,15 +220,21 @@ retinopathy[retinopathy$trt == 1, ] Using the **retinopathy** data set:\ -- Select the `age` for patients that have `futime` more than 20.\ +- Select the `age` for patients that have `futime` more than 20. (When you have time can you think of a second way to do this?)\ - Select the `age` for patients that have `futime` more than 20 and are adults.\ -- Select patients that have no missing values in `age`. +- Select only the rows of the left eye. If needed look in the documentation of the data set to find out how variables are encoded. +- Select only the rows of adult patients. +- Select all rows for patients that have no missing values in `age`. #### Solution 2 ```{r sub2-solution, solution = TRUE} retinopathy$age[retinopathy$futime > 20] +# or +retinopathy[retinopathy$futime > 20, "age"] retinopathy$age[retinopathy$futime > 20 & retinopathy$type == "adult"] +retinopathy[retinopathy$eye == "left", ] +retinopathy[retinopathy$type == "adult", ] retinopathy[!is.na(retinopathy$age), ] ``` ::: @@ -148,14 +243,43 @@ retinopathy[!is.na(retinopathy$age), ] #### Task 3 Using the **retinopathy** data set:\ - -- Select only the rows of the left eye. -- Select only the rows of adult patients. +- Calculate the mean risk score for the treated eyes. +- Calculate the mean age of patients for which the variable age is not missing +- Create a `summary` of the juvenile patients. #### Solution 3 ```{r sub3-solution, solution = TRUE} -retinopathy[retinopathy$eye == "left", ] -retinopathy[retinopathy$type == "adult", ] +mean(retinopathy$risk[retinopathy$trt==1]) +mean(retinopathy[!is.na(retinopathy$age), 'age']) +# or split this up in steps. For example: +age_vec <- retinopathy$age +age_vec_no_na <- age_vec[!is.na(age_vec)] +mean(age_vec_no_na) +# You could also have used the na.rm argument of the mean function: +mean(retinopathy$age, na.rm = TRUE) +### +summary(retinopathy[retinopathy$type == "juvenile", ]) + ``` ::: + +::: callout-advanced + +::: {.panel-tabset .nav-pills} +#### Task 4 + +- Why does `colnames(retinopathy[1])` return `id`, while `colnames(retinopathy[[1]])` returns `NULL`? + +- Without executing the code, try to determine what the output of `colnames(retinopathy[,1])` would be. + +#### Solution 4 + +When we use single brackets `[]` to subset a data frame, the result is still a data frame. In this case, `retinopathy[1]` returns a data frame with one column (the first column of `retinopathy`). Therefore, when we use `colnames()` on this result, it returns the name of that column, which is `id`. In contrast , when we use double brackets `[[]]`, we extract the actual vector (or list element) from the data frame and we cannot use `colnames()` on a vector, so it returns `NULL`. + + +The output of `colnames(retinopathy[,1])` would also be `NULL`. This is because when we use the comma notation to subset a data frame and specify only one column (like `retinopathy[,1]`), R returns a vector by default (unless we set `drop = FALSE`). Since the result is a vector, it does not have column names, and thus `colnames()` would return `NULL`. + +::: + +::: diff --git a/1-1_Introduction.qmd b/1-1_Introduction.qmd index 3170afd..f11cec1 100644 --- a/1-1_Introduction.qmd +++ b/1-1_Introduction.qmd @@ -1,22 +1,51 @@ +# Introduction and History **authors: Karl Brand, Elizabeth Ribble and S. Willemsen** -# Introduction and History +::: callout-advanced +test advanced2 +::: + +::: callout-advanced +test1 fgf +::: + +::: callout-xxx +test2 +::: + +::: callout-xxx +test3 +::: + +::: callout-advanced +test advanced3 +::: ## Course Overview -In the first part of the course we are introduced to the `R` software package, or **environment**, and learn about `R`, how to interact with it and it's basic programming elements. In the second part of the course we'll learn how to use `R` for it's intended function: doing statistics quickly and effectively. +In the first part of the course we are introduced to the `R` software package, or **environment**, and learn about `R`, how to interact with it and it's basic programming elements. In the second part of the course you will learn about cleaning data, doing simple statistical analyses and making plot. Throughout the course we will use various illustrate the concepts and methods. The third part of the course focuses more on programming in R using functions and language constructs that control the flow of your program. We will cover essential topics including how to use `R` and why we use it; objects, classes and functions; and creating, importing, saving, manipulating, combining, sub-setting and plotting data. -We will cover essential topics including how to use `R` and why we use it; objects, classes and functions; and creating, importing, saving, manipulating, combining, sub-setting and plotting data. +By the end of this course we expect that you can write your own simple scripts and have a reasonable understanding of what's going on when you cut and paste other people's code, which is typical of how people get started with `R` (or any other language for that matter). -By the end of the first couple of introductory days I expect that you can write your own code and have a reasonable understanding of what's going on when you cut and paste other people's code, which is typical of how people get started with `R` (or any other language for that matter). +As mentioned, in this chapter we will cover the basics and learn: how to install R and RStudio; the basics of wirking with R and RStudio; how to get help; We will also provide some trouble shooting tips. Then in the next chapters of this part of the course we will discuss the basic objects you will work with in R and to make selections so you can work with subsets of your data. ## Introduction +### What is R + +R is a programming language and software environment that is mainly used for statistics and creating graphics. It runs on all standard computing platforms, including Windows, MacOS and Linux. It is free open source software. So it is available for everybody to download and use without cost. It is also highly extensible, with thousands of packages available for various statistical techniques and data manipulation tasks. + +One of the success factors of `R` is the large and active user community (especially among statisticians). This means that if you have are problem, it is relatively easy to find help. There are many forums and mailing lists where you can ask questions and get answers from other users. There are also many online resources, such as tutorials, blogs, and documentation, that can help you learn `R` and solve problems. + +The R language is modular: base R only has limited functionality, but this functionality can be extended by means of various **packages**. Because of the popularity of R among statisticians new statistical methods will frequently be implemented in `R` first, often by the method's original author. Nowadays there is an enormous amount of extension packages. And the number of packages still seems to be growing at an exponential rate. The quality of these packages varies although there is some kind of quality control on the packages distributed through the official channel [CRAN](https://cran.r-project.org/). + +::: {callout-background} ### History -Before `R` there was the statistical analysis program, `S`, developed at Bell Laboratories in 1976 by John Chambers. `S` was later commercialised as `S-PLUS`. John Chambers is also a current member of the board of the `R` Foundation for Statistical Computing. +Before `R` there was the statistical analysis program, `S`, developed at Bell Laboratories in 1976 by John Chambers. It was developed with the purpose of being useful for interactive data analyses and programming. `S` was later commercialised as `S-PLUS`. S-PLUS contained some features that made working with the language easier (such as a graphical user interface (GUI)). John Chambers is also a current member of the board of the `R` Foundation for Statistical Computing. -Ross Ihaka and Robert Gentleman of Dept of Statistics, University of Auckland, New Zealand begin coding (1991) the `S` clone, `R`, as an open source alternative for academic use. Together with the `R` core group' they released version 1 under the GNU Public License (GPL) v2 and v3 in 2000. The name, `R`, probably derives from the first letter of the creator's names. http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-is-R-named-R_003f +In 1991, Ross Ihaka and Robert Gentleman of Dept of Statistics, University of Auckland, New Zealand created the `S` clone, `R`, as an open source alternative for academic use. Together with the `R` core group' they released version 1 under the GNU Public License (GPL) v2 and v3 in 2000. The name, `R`, probably derives from the first letter of the creator's names. http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-is-R-named-R_003f +::: ## Get it @@ -42,26 +71,7 @@ Emacs/ESS: http://ess.r-project.org/index.php?Section=download Here's a full list available editors: https://en.wikipedia.org/wiki/R\_(programming_language)#Interfaces -## What **is** `R` and how do we use it - -### Definition(s) - -- **Wikipedia:** \`R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software\[7\] and data analysis.' http://en.wikipedia.org/wiki/R\_(programming_language) - -- **Comprehensive `R` Archive Network (CRAN):** \`R is \`\`GNU S'', a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc.' https://cloud.r-project.org/. - -- **The R Journal - Facets of R:** \`This paper considers six characteristics, which we will call facets. They characterize R as: - - - an interface to computational procedures of many kinds; - - interactive, hands-on in real time; - - functional in its model of programming; - - object-oriented, β€œeverything is an object”; - - modular, built from standardized pieces; and, - - collaborative, a world-wide, open-source effort.' - -By [John M. Chambers](https://journal.r-project.org/archive/2009-1/RJournal_2009-1_Chambers.pdf) - -- **My definition:** A command based, object oriented and functional language; in contrast to excel which is a cell centric, spreadsheet managing graphical user interface (GUI). +## How do we use `R` ### Working with **R** @@ -104,6 +114,27 @@ Assigning a result in a variable is done using `<-`. For example: > five + 1 [1] 6 ``` +::: callout-note + +Note that when we write five at the R command promt, R will print the value of the variable, that is the value stored in this variable. When we wanted to print the exact text `five` we should have put it in quotes, like this: + +``` +> "five" +[1] "five" +``` + +::: + +In R it does not matter if you use single or double quotes as long as they match. Working with text values will be further discussed in the next section. + +As we will see an advantage of using R is that we can do calculations on multiple values (or vectors) at once. For example: + +``` +> c(1, 2) + c(1, 3) +[1] 2 5 +``` + +This is of course very convinient in the common setting where we are working on a data set with many observations where we want to do the same calculation on all observations. This will be further discussed later. ### How we interact with `R` @@ -130,6 +161,48 @@ A few other things about RStudio: - the little asterisk next your source code file name indicates unsaved code - **save your source code!** +## R Packages + +Base **R** has many factions but many more are come from various extension packages. These can be installed using `install.packages()` : + +```{r} +#| eval: false +install.packages("packageName", + lib = "/directory/to/my custom R library", + repos = "http://cran.xl-mirror.nl") + +# usually lib and repos can be omitted (left at the default) +``` + +The package name must be quoted when installing. + +Later when you want to use the package you have to load it first using `library()` : + +```{r} +#| eval: false +library("packageName") ## quotes are optional when loading a package +``` + +::: callout-note +You have to install a package only once, but you have to load it every time you start a new R session and want to use it. Usually somewhere at the top of a script, you will find a section with these library commands so it is clear which packages are needed to run in. It is considered bad practice to use `install.packages()` inside a script because it may install packages without the user's consent. +::: + +## Comments + + +To make your scripts more clear you can add comments, to explain what the code is doing, and why it is doing this. Text that represents a comment is indicated by the hash (#) symbol. For example in the code: + +```{r, eval = FALSE} +# load the survival package because we will need the data sets +library(survival) +``` + +The first line does not actually do anything (it is skipped by the interpreter). But it explains to the user why the `library()` call on the next line is included. It is also possible to add comments at the end of a line of code: + +```{r, eval = FALSE} +library(survival) # for data sets +``` + ## Backslashes One more thing I want to mention is how `R` interprets backslashes and forwardslashes, i.e. \\ and / respectively, as relates to defining the locations of files on your PC. James McDonald does a better job of clarifying this than I otherwise would http://cran.r-project.org/doc/contrib/Lemon-kickstart/kr_scrpt.html: @@ -178,6 +251,9 @@ rm(list=ls()) ls() # quotes are needed ``` +::: callout-advanced +If the above does not work, try these: + - Start `R` without any customisations, i.e., omit loading the `.Rprofile` and `.Renviron` customisation files. From the command line: ```{bash} @@ -187,6 +263,7 @@ R --vanilla ``` - `trace.back()` function: very helpful pinpointing the source of an error, and thus its cause. +::: ### Further resources diff --git a/1-2_Working_with_R.qmd b/1-2_Working_with_R.qmd index e1a0ef8..42aeb66 100644 --- a/1-2_Working_with_R.qmd +++ b/1-2_Working_with_R.qmd @@ -1,12 +1,10 @@ # Importing and Saving Data - - Because the typical way of using `R` involves writing text into a file with the `.r` or `.R` extension (e.g. `my-r-file.r`). Hopefully you're you are also doing this for the R commands you will use for this course. Sometimes you will also want to save your data. ## Creating and saving an R script -You can create a new R file using the menu (File > New File > R script) or by using the keyboard combination {{< kbd win=Ctrl+Shift+N mac=Cmd+Shift+N >}}. You are then prompted for a file name. When you want to save the file you can do so using the menu again (File > Save) or by pressing {{< kbd win=Ctrl+S mac=Cmd+S >}}. Selecting 'File > Save as ...' saves the active file under another name. +You can create a new R file using the menu (File \> New File \> R script) or by using the keyboard combination {{< kbd win=Ctrl+Shift+N mac=Cmd+Shift+N >}}. You are then prompted for a file name. When you want to save the file you can do so using the menu again (File \> Save) or by pressing {{< kbd win=Ctrl+S mac=Cmd+S >}}. Selecting 'File \> Save as ...' saves the active file under another name. ## Saving and restoring your session @@ -49,7 +47,7 @@ Restore or load previous sessions or objects: load("mySession.RData") # load from the working directory ``` -When you save and load objects this way, loaded variables will have the same names as when they were stored. There is also a different way of saving objects where loaded objects are assigned explicitly. +When you save and load objects this way, loaded variables will have the same names as when they were stored, possibly overwriting existing variables. There is also a different way of saving objects where loaded objects are assigned explicitly. ```{r} myobject <- list(vec=c(1, 2, 3), mat=matrix(1:4,2)) @@ -74,10 +72,3 @@ read_excel(path='path/to/file/excelworkbook.xls') ``` This function has an optional argument `sheet` to indicate the sheet in the excel file if it contains multiple sheets. Optionally, with `range` a cell range can be indicated. - -Load `.Rhistory` file to access the history from the console: - -```{r} -#| eval: false -loadhistory() -``` diff --git a/1-3_Working_with_R_practical.qmd b/1-3_Working_with_R_practical.qmd index 34b211e..1debed5 100644 --- a/1-3_Working_with_R_practical.qmd +++ b/1-3_Working_with_R_practical.qmd @@ -56,7 +56,7 @@ https://stat.ethz.ch/R-manual/R-devel/library/survival/html/heart.html https://stat.ethz.ch/R-manual/R-devel/library/survival/html/retinopathy.html - +We will learn more about the format of data sets (or `data.frame`s) later. ## Basic calculations @@ -72,7 +72,6 @@ Use R as a calculator to calculate the following values: - $17 ^4$, - $2 ^ (1/3)$, - #### *Hint 1* Enter the calculations in an R-script and then run the lines using {{< kbd win=Ctrl+Enter mac=Cmd+Enter >}}. @@ -85,13 +84,12 @@ Enter the calculations in an R-script and then run the lines using {{< kbd win=C ``` ::: - ### Assignment {.tabset .tabset-fade .tabset-pills} -::: panel-tabset +::: panel-tabset #### Task 2 -Now use assignment to calculate (45-2)*3 again. Assign names `a`, `b` and `c` to each of the three values involved, then do the calculation while assigning the name `d` to the result. +Now use assignment to calculate (45-2)\*3 again. Assign names `a`, `b` and `c` to each of the three values involved, then do the calculation while assigning the name `d` to the result. #### Hint 2 @@ -105,13 +103,10 @@ d <- (a - b) * c ``` `a`, `b`, `c`, and `d` are variables. To see their values, you can just type the variable name (e.g. `a`) and use the shortcut {{< kbd win=Ctrl+Enter mac=Cmd+Enter >}} or use the command `print(a)`. - - ::: ::: panel-tabset - -R can do calculations on whole vectors at once. +R can do calculations on whole vectors at once. #### Task 3 @@ -127,11 +122,9 @@ v1 <- c(3, 4) # now compute the square of each value # create v2 in the same way as v1 and calculate the sum ``` - ::: ::: panel-tabset - #### Task 4 Create a vector of numbers using the function call `seq(1, 9)`. Theck that `1:9` does the same thing. What do `seq(1, 9, by=2)` and `9:1` do? @@ -140,30 +133,25 @@ Create a vector of numbers using the function call `seq(1, 9)`. Theck that `1:9` Execute the following code. - ```{r} # Just type in the given code fragments and see what they do. ``` - - ::: ## Loading packages and using the help function ### Loading packages {.tabset .tabset-fade .tabset-pills} -We will use data sets from the `survival` package. It is important to know how to load packages as they contain most of the functionality of R. - +We will use data sets from the `survival` package. It is important to know how to load packages as they contain most of the functionality of R. ::: {.panel-tabset .nav-pills} #### Task 1 - Load the `survival` package using the `library` function. +Load the `survival` package using the `library` function. #### *Hint 1* Use the `library(...)` function. There should be no need to install the package as it is already installed with `R`. - ::: ### Getting help {.tabset .tabset-fade .tabset-pills} @@ -171,28 +159,25 @@ Use the `library(...)` function. There should be no need to install the package ::: {.panel-tabset .nav-pills} #### Task 2 - Ask for the documentation of the `heart` data set, using the `help` function. Then do the same for the `retinopathy` data set using `?`. +Ask for the documentation of the `heart` data set, using the `help` function. Then do the same for the `retinopathy` data set using `?`. #### *Hint 2* -Use the `help(...)` function and `?`. Remember that the first form requires quotation marks. - +Use the `help(...)` function and `?`. Remember that the first form requires quotation marks. ::: ::: {.panel-tabset .nav-pills} #### Task 3 -Also ask for help on the `c`, `ls`, `rm` and `seq` functions. Check if you can also obtain the documentation for the `c` function if you use `??combine` +Also ask for help on the `c`, `ls`, `rm` and `seq` functions. Check if you can also obtain the documentation for the `c` function if you use `??combine` #### *Hint 3* -Use the `help(...)` fuction, and `?` or `??`. - +Use the `help(...)` fuction, and `?` or `??`. ::: ## Importing and Saving Data - ### Save your work {.tabset .tabset-fade .tabset-pills} It is important to save your work. You can save the whole workspace using the function `save.image`. If you want to save only specific objects, you can use the function `save`. @@ -208,11 +193,11 @@ Use the function `save(...)`. Note that you need to set the working directory. ::: ::: {.panel-tabset .nav-pills} -#### Task 2 \* +#### Task 2 Save the vectors `events <- heart$event` and `eyes <- retinopathy$eye`. Use the name `vectors_survival`. -#### *Hint 2* \* +#### *Hint Use the function `save(...)`. Note that you need to set the working directory. ::: @@ -233,7 +218,6 @@ Use the function `load(...)`. At this point it is interesting to check the objects that are in the workspace. To do so look at the Environment tab in the upper-right hand pane. You can also use the `ls()` function to list all objects in the workspace. - ### Remove your work {.tabset .tabset-fade .tabset-pills} Remove unnecessary objects. @@ -258,5 +242,4 @@ Remove the vectors `events` and `eyes`. Use the function `rm(...)`. ::: - -At this point you can inspect the History tab in the upper-right hand pane to see the commands you have executed. If you have made a syntax file check if you can run it in it's entirety without any errors (if you have not you can piece it together from the history). Save the file to disk. \ No newline at end of file +At this point you can inspect the History tab in the upper-right hand pane to see the commands you have executed. If you have made a syntax file check if you can run it in it's entirety without any errors (if you have not you can piece it together from the history). Save the file to disk. diff --git a/1-4_Working_with_R_answers.qmd b/1-4_Working_with_R_answers.qmd index f57d63c..d4bcbe8 100644 --- a/1-4_Working_with_R_answers.qmd +++ b/1-4_Working_with_R_answers.qmd @@ -56,6 +56,8 @@ https://stat.ethz.ch/R-manual/R-devel/library/survival/html/heart.html https://stat.ethz.ch/R-manual/R-devel/library/survival/html/retinopathy.html +We will learn more about the format of data sets (or `data.frame`s) later. + ## Basic calculations ### Expressions {.tabset .tabset-fade .tabset-pills} @@ -346,15 +348,15 @@ rm(numbers, numbers_2, treatment) ::: ::: {.panel-tabset .nav-pills} -#### Task 2 \* +#### Task 2 Remove the vectors `events` and `eyes`. -#### *Hint 2* \* +#### *Hint 2* Use the function `rm(...)`. -#### Solution 2 \* +#### Solution 2 ```{r, eval=F} rm(events, eyes) diff --git a/1-5_Common_Objects.qmd b/1-5_Common_Objects.qmd index 4f41c3b..b9bfe1a 100644 --- a/1-5_Common_Objects.qmd +++ b/1-5_Common_Objects.qmd @@ -1,37 +1,37 @@ # Common Objects - **Authors: Sten Willemsen, Karl Brand and Elizabeth Ribble** - # Objects in R ## Introduction -All the things we work with in R (such as the numbers in the calculations in the previous section) are called **objects**. All objects have a **mode** and a **class**. The **mode** describes how the data is stored in **R** and how it can be used. Text variables for example have to be handled differently than numbers, we cannot multiply two words. +All the things we work with in R (such as the numbers in the calculations in the previous section) are called **objects**. All objects have a **mode** and a **class**. The mode (or type) determines how data is stored, and the class determines how functions work with them. For simple objects, these are usually the same. Text variables for example have to be handled differently than numbers, we cannot multiply two words. Elementary objects can be combined with each other to form more complex objects this leads to several types of containers, like `lists` and `data.frames` The **class** of an object is an attribute that can be used to further specify how an object is used in **R**. For example it can be used to indicate how it should be printed and plotted. For many objects the class will simply be equal to the mode. -Functions are special objects that can do things with other objectsl, for example the `print` function displays the contents of an object in the console. Functions are the topic of an other chapter but because we obviously want to do something with the various objects we will see we cannot avoid them altogether. +Functions are special objects that can do things with other objects, for example the `print` function displays the contents of an object in the console. Functions are the topic of an other chapter but because we obviously want to do something with the various objects we will see we cannot avoid them altogether. ## The elementary data types -The simplest variables just have a single value of a certain data type (Or mode^[`mode`, `storage.mode` and `type` are closely related concepts, we will not discuss the differences here. See also `?mode`.]). The most important data types in R are:^[There are a few more like `complex` and `raw` - which we will not discuss.]: +The simplest variables just have a single value of a certain data type (Or mode[^1-5_common_objects-1]). The most important data types in R are:[^1-5_common_objects-2]: + +[^1-5_common_objects-1]: `mode`, `storage.mode` and `type` are closely related concepts, we will not discuss the differences here. See also `?mode`. ------------- --------------------------------------------------- - mode description ------------- --------------------------------------------------- - numeric: Posibly fractional numerical values like 1.0, 1.2 or 1e12 (that is 10 raised to the power of 12) - character: text, for example 'man', 'woman', 'censored', etc. - logical: TRUE and FALSE -------------------------------------------------------------------- +[^1-5_common_objects-2]: There are a few more like `complex` and `raw` which we will not discuss. + +| mode | description | | +|-----------|----------------------------------------------------------------------------------------------| +| numeric | Possibly fractional numerical values like 1.0, 1.2 or 1e12 (that is 10 raised to the power of 12) | +| character | Text, for example 'man', 'woman', 'censored', etc. | +| logical | TRUE and FALSE | Let's examine variables of these data types in a bit more detail: ### Numbers + ```{r} mode(1) # for most basic data types the class is the same as the mode @@ -54,6 +54,7 @@ class(1L) # explicit integer ``` ### Text + ```{r} # character values should be surrounded by quotes "or ' class("a") @@ -65,7 +66,9 @@ mode("1") ``` ### Logical + It can only be a **yes** or a **no**. More specifically, a `TRUE` or a `FALSE`. + ```{r} class(TRUE) class(FALSE) @@ -78,23 +81,28 @@ Often logical values are he result of comparisons: |----------|--------------------------| | == | Equal to | | != | Not equal to | -| > | Greater than | -| < | Smaller than | -| >= | Greater than or equal to | -| <= | Smaller than or equal to | +| \> | Greater than | +| \< | Smaller than | +| \>= | Greater than or equal to | +| \<= | Smaller than or equal to | Logival values can be combined using `&` (and) and `|` or, and inverted using `!` (not). ```{r} -a <- c(TRUE, TRUE, FALSE, FALSE) -b <- c(TRUE, FALSE, TRUE, FALSE) - -a & b -a | b -!a +true_val <- TRUE +false_val <- FALSE +true_val & true_val +true_val & false_val +false_val & true_val +false_val & false_val +true_val | true_val +true_val | false_val +false_val | true_val +false_val | false_val +!true_val +!false_val ``` - ## Vectors Vectors of these data types are the most elementary data structure in R. All other structures (like the `data.table`) are constructed using these vectors. In R there is also no structure that is smaller than a vector. A single number is not treated differently from a numeric vector of length ten; In fact R sees the single number simply as a numeric vector of length 1. The `length()` of a vector can be obtained by using the function `length()`. @@ -116,23 +124,30 @@ In the output we see that R shows the row number of the first element of each ro c(1, 2, 3) * c(4, 5, 6) ``` -Note that when the lengths do not match they are **recycled**. +When you perform operations on vectors of different lengths, R automatically **recycles** (repeats) the shorter vector to match the length of the longer one. This can be useful, but it can also lead to unexpected results if you're not careful. One instance in which recycling is handy is when we are working with a vector and a single value. When the longer vector's length is not an exact multiple of the shorter vector's length, R will still recycle, but it will give you a warning: + ```{r} -c(1, 2, 3, 4) * c(4, 5) +c(1, 2, 3, 4) * 2 # if the larger length is not an exact multiple # of the smaller this often indicates a mistake # and a warning is given c(1, 2, 3) * c(4, 5) +# When the larger length is not an exact multiple +# usually no warning is given, but this still +# may be a mistake, so be careful! +c(1, 2, 3, 4) * c(4, 5) ``` We can also give names to the elements of a vector: + ```{r} named_v <- c(foo=1, bar=2) print(named_v) mode(named_v) class(named_v) ``` -When we try to create a vector that consists of different data types they will be converted to a data type that is capable of containing all of them. For example: + +When we try to create a vector that consists of different data types they will be converted to a data type that is capable of containing all of them. For example: ```{r} c("eleven", 12) @@ -140,13 +155,13 @@ c("eleven", 12) The second element of the resulting vector is now also of type `character`. In general it is better not to trust this implicit conversion. Instead to it explicitly, in this case by using the function `as.character()`. -An other way to create a vector is by using the function `vector`. `vector('numeric', 8)` creates a numeric vector of length 8. The `vector` function is often used to pre-allocate room where the results of future computations can be stored. - +An other way to create a vector is by using the function `vector`. `vector('numeric', 8)` creates a numeric vector of length 8. The `vector` function is often used to pre-allocate room where the results of future computations can be stored. +One kind of vector that is often used is a sequence of numbers. One way to create them is by using the colon operator `:`. A syntax `a:b` will create a sequence of integers from `a` up to and including `b`. For example `1:4` is equal to the vector `c(1, 2, 3, 4)`. ## Matrices -Vectors have just a single dimension (every element is characterized by a single number (index) that indicates its position within the vector). They can be generalized to vectors that are two dimensional, that is -they have both rows and columns. Like simple vectors all elements of a matrix have the same \emph{mode}. + +Vectors have just a single dimension (every element is characterized by a single number (index) that indicates its position within the vector). They can be generalized to vectors that are two dimensional, that is they have both rows and columns. Like simple vectors all elements of a matrix have the same \emph{mode}. ```{r} my_matrix <- matrix(data = c(1,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)) @@ -163,8 +178,11 @@ mode(char_mat) length(char_mat) #just counts the number of elements ``` +::: callout-advanced + ### Arrays -We do not have to stop with two dimensions. Arrays are even more general than matrices as they can have any number of dimensions. All elements have to be of the same type. + +We do not have to stop with two dimensions. Arrays are even more general than matrices as they can have any number of dimensions. All elements have to be of the same type. You can create an array using the function `array`. ```{r} letters[1:12] @@ -174,9 +192,17 @@ class(my_array) mode(my_array) ``` +An array can be useful when you are working with data that has more than two dimensions. For example you could use an array to store the results of an experiment that was repeated several times (the first dimension would indicate the patient, the second the variable and the third dimension would then represent the repetition number). Another example where arrays are useful is spatial data or image data where you might have separate dimensions for the patient, x-coordinate of the pixel, y-coordinate of the pixel and the channel (either red, green or blue). + +However in practice arrays with more than two dimensions are not used that often. Most of the time when we have data with more than two dimensions we will use other structures like `data.frames` or `lists` (see below). + +An array with two dimensions is equivalent to a matrix. + +::: + ## Lists -Elements of a vector, matrix or array are always of the same type. A `list` differs from a vector by also allowing its elements to be of a different type. We can make a `list` using the function `list`. +Elements of a vector, matrix or array are always of the same type. A `list` differs from a vector by also allowing its elements to be of a different type. We can make a `list` using the function `list`. ```{r} list1 <- list("eleven", 12) @@ -184,11 +210,13 @@ mode(list1) class(list1) list2 <- list(c(1, 2, 3), c('foo', 'bar')) ``` + We can also assign a name to the elements of a list: ```{r} list3 <- list(numbers=c(1, 2, 3), chars=c('foo', 'bar')) ``` + It is also possible for a list to contain other lists. ```{r} @@ -197,15 +225,16 @@ list4 <- list(numbers=c(1, 2, 3), chars=c('aap', 'noot'), ``` A useful function for lists is `str`. It gives us its structure. + ```{r} str(list4) ``` +This versatility makes that `list`s are used in many contexts in R. As an example we will see that the object returned by many statistical procedures in R is a `list`. The different elements of the list then contain different parts of the results of the analysis such as the estimated coefficients, standard errors, test statistics, p-values etc. ### data.frames -This is a rectangular table in which every column contains a variable and every row an observation. They are similar to the spreadsheets: a -series, or `list` of equal length columns, or `vectors`. Notably, the columns (`vector`s) can be of different -classes, **unlike** a matrix or array. + +We saved this data structure for last. But when we will perform data analyses we will work with `data.frame`s a lot. This is a rectangular table in which every column contains a variable and every row an observation. They are similar to the spreadsheets: a series, or `list` of equal length columns, or `vectors`. Notably, the columns (`vector`s) can be of different classes, **unlike** a matrix or array. ```{r} my_df <- data.frame("vec" = c(12, 48), "lets" = letters[1:12]) @@ -220,7 +249,7 @@ Most data sets come in the form of `data.frame`s or can be converted to them so ## factors -A `factor`is a special kind of vector for categorical data. The factor contains different integers one for every category. Each unique value has an associated 'label' that tell us what the code means. Factors are frequently used when we model categorical data. An advantage of a factor over a `character` is that we can limit the number of possible outcomes. It is also less likely to make mistakes due to typing errors. Factors can be created by means of the function `factor()`. +A `factor`is a special kind of vector for categorical data. The factor contains different integers one for every category. Each unique value has an associated 'label' that tell us what the code means. Factors are frequently used when we model categorical data. An advantage of a factor over a `character` is that we can limit the number of possible outcomes. It is also less likely to make mistakes due to typing errors. Factors can be created by means of the function `factor()`. ```{r} f <- c('male', 'female', 'male') @@ -228,12 +257,11 @@ factor(f) levels(f) ``` -When a `factor` is displayed R also shows us the unique values the variable can take. These are called the 'levels' of the factor. +When a `factor` is displayed R also shows us the unique values the variable can take. These are called the 'levels' of the factor. ## functions -Functions are used to do something. We have already seen several of them. -Like `mode`, `class` and `as.integer` and you will see lots more. Note that in **R** functions are objects too. +Functions are used to do something. We have already seen several of them. Like `mode`, `class` and `as.integer` and you will see lots more. Note that in **R** functions are objects too. ```{r} mode(mode) # like all objects functions have a mode @@ -242,40 +270,47 @@ print(mode) # do not worry if you do not understand the meaning of what is printed yet ``` -Note that operators are functions to. When you want to use them in the same way as normal functions just put them between back-ticks ("`"): +Note that operators are functions to. When you want to use them in the same way as normal functions just put them between back-ticks ("\`"): + ```{r} mode(`+`) `+`(1, 1) ``` +Functions will be discussed in more detail later. -Base **R** has many factions but many more are come from various extension packages. These can be installed using `install.packages()` : +::: callout-advanced +Some functions behave differently depending on the **class** of the object they are applied to. For example the `summary` function will give a different kinds of summaries depending on its argument. Try for example: ```{r} -#| eval: false -install.packages("packageName", - lib = "/directory/to/my custom R library", - repos = "http://cran.xl-mirror.nl") - -# usually lib and repos can be omitted (left at the default) +summary(c(1, 2, 3, 4, 5)) +summary(factor(c('m', 'm', 'f'))) +library(survival) +summary(rats) ``` -The package name must be quoted when installing. -```{r} -#| eval: false -library("packageName") ## quotes are optional when loading a package -``` +These kind of functions are called **generic** functions in R. We will not go into detail here but you should be aware that this exists. + +::: -Functions will be discussed in more detail later. ## Missing values -Whenever the value of a variable is missing this is denoted by `NA` in R. Usually this means that the values exists however we do not know it. Sometimes the result of a calculation is not finite (for example when we define a positive or negative number by zero). In this case the result is defined to be `Inf` of `-Inf` in R. When a value cannot be computed at all (for example when we divide zero by zero) R will define the result as `NaN`, which stands for 'Not a Number'. Finally, R sometimes uses the special value `NULL` to indicate that a variable is not yet defined. Here we will mostly deal with data that is just missing, that is `NA`. +Whenever the value of a variable is missing this is denoted by `NA` in R. Usually this means that the values exists however we do not know it. Sometimes the result of a calculation is not finite (for example when we define a positive or negative number by zero). In this case the result is defined to be `Inf` of `-Inf` in R. + +```{r} +a <- NA +is.na(a) +a +1 # any operation with NA results in NA +``` + +::: callout-advanced + +When a value cannot be computed at all (for example when we divide zero by zero) R will define the result as `NaN`, which stands for 'Not a Number'. Finally, R sometimes uses the special value `NULL` to indicate that a variable is not yet defined. Here we will mostly deal with data that is just missing, that is `NA`. ```{r} a <- c(1, -1, 0, NA, NULL) -a/0 is.na(a) is.finite(a) is.null(a) # note this looks at the whole object @@ -284,3 +319,5 @@ l[1] <- NULL # deletes elements from a list l ``` + +::: \ No newline at end of file diff --git a/1-6_Common_Objects_practical.qmd b/1-6_Common_Objects_practical.qmd index 8a22ada..5223c0f 100644 --- a/1-6_Common_Objects_practical.qmd +++ b/1-6_Common_Objects_practical.qmd @@ -66,8 +66,6 @@ Explore the **heart** and **retinopathy** data sets - print the first six and la #### Hint 1 Use the functions `head(...)` and `tail(...)` to investigate the data set. Replace the dots with the name of the data set. - - ::: ## Common R Objects @@ -86,8 +84,6 @@ View the vectors `event` and `age` from the **heart** data set. #### *Hint 1* Use the dollar sign to select the variables. - - ::: ::: {.panel-tabset .nav-pills} @@ -98,8 +94,6 @@ View the vectors `eye` and `risk` from the **retinopathy** data set. #### *Hint 2* Use the dollar sign to select the variables. - - ::: ::: {.panel-tabset .nav-pills} @@ -110,8 +104,6 @@ Create a numerical vector that consists of the values: 34, 24, 19, 23, 16. Give #### *Hint 3* Use the c(...) function. Replace the dots with the numbers. - - ::: ::: {.panel-tabset .nav-pills} @@ -122,8 +114,6 @@ Create a numerical vector that takes the integer values from 1 until 200. Give t #### *Hint 4* Use the c(...) function. Replace the dots with the numbers. - - ::: ::: {.panel-tabset .nav-pills} @@ -134,8 +124,6 @@ Create a categorical vector that consists of the values: yes, yes, no, no, no, y #### *Hint 5* \* Use the c(...) function. Replace the dots with the categories. - - ::: ### Matrices and Data Frames {.tabset .tabset-fade .tabset-pills} @@ -145,13 +133,11 @@ Let's investigate some matrices and data frames. ::: {.panel-tabset .nav-pills} #### Task 1 -Create a matrix using the vectors `id` and `age` from the **heart** data set. This matrix should have 2 columns where each column represents each variable. +Create a matrix using the vectors `id` and `age` from the **heart** data set. This matrix should have 2 columns where each column represents each variable. An easy way to do this is to first create a long vector combining the data of both variables and then using the `matrix` function. -#### *Hint 1* +#### Hint 1 Use the function matrix(...). - - ::: ::: {.panel-tabset .nav-pills} @@ -159,65 +145,88 @@ Use the function matrix(...). Create a data frame using the vectors `id`, `type` and `trt` from the **retinopathy** data set. This data frame should have 3 columns, where each column represents each variable. -#### *Hint 2* +#### Hint 2 Use the function data.frame(...). - - ::: -### Arrays {.tabset .tabset-fade .tabset-pills} +### Lists {.tabset .tabset-fade .tabset-pills} -Let's investigate some arrays. +Let's investigate some lists. ::: {.panel-tabset .nav-pills} #### Task 1 -Create an array that consists of 2 matrices. Matrix 1 will consist of the values 1:4 and matrix 2 will consist of the values 5:8. Both matrices will have 2 columns and 2 rows. +Create a list using the vectors `stop` from the **heart** data set and `id`, `risk` from the **retinopathy** data set. Give the names `stop_heart`, `id_reti` and `risk_reti`. -#### *Hint 1* +#### Hint 1 -Use the function array(...). +Use the function list(...). +::: +::: {.panel-tabset .nav-pills} +#### Task 2 πŸŽ“ +Create a list using the vectors `numbers`, `numbers_2` and `treatment`. These variables can be taken from the exercise called `Vectors`. Give the names: `numbers`, `many_numbers` and `treatment`. + +#### Hint 2 πŸŽ“ + +Use the function list(...). ::: -::: {.panel-tabset .nav-pills} -#### Task 2 \* +::: callout-advanced -Give the name `ar1` to the previous array. Furthermore, investigate the argument dimnames and change the names of the rows, columns and matrices. +### Arrays πŸŽ“ {.tabset .tabset-fade .tabset-pills} -#### *Hint 2* \* +An array is a multi-dimensional data structure, perfect for storing data with more than two dimensions (e.g., rows, columns, and "layers"). We will set up a data structure to store grades for 3 students, across 2 different subjects, over 2 academic semesters. An `array` the perfect tool for this job because our data has three dimensions: +students, subjects and Semesters. Let's build it step-by-step. -Use the function array(...). Check the help page for the dimnames argument. Note that this must be in a list format. +::: {.panel-tabset .nav-pills} +#### Task 1 πŸŽ“ +First, we need a single vector containing all the data points (grades) that will fill our array.We have 3 students, 2 subjects, and 2 semesters.The total number of grades we need is $3 \times 2 \times 2 = 12$.Let's create a vector with 12 sample grades. R will fill the array "column-wise" (filling the first dimension, then the second, then the third). Create a vector named all_grades containing the following 12 grades. +```{r} +#| eval: false + +all_grades <- c(8, 9, 5, 8, 9, 8, 7, 9, 8, 8, 9, 8) +``` + +Next, we need to tell R the shape of our array. This is done with a numeric vector passed to the `dim` argument of the `array` function. We want 3 students, 2 subjects, and 2 semesters. Setup the vector `dimensions` containing the number of students, subjects and semesters in that order. ::: -### Lists {.tabset .tabset-fade .tabset-pills} +::: {.panel-tabset .nav-pills} +#### Task 2 πŸŽ“ -Let's investigate some lists. +We can build the array using the `array()` function. Pass our variables (`all_grades` and `dimensions`) to the correct arguments (`data` and `dim`). Use `?array` if you need help. Call the resulting object `grades_array`. + +#### Hint 2 + +Use `array(data = ... , dim = ...)` +::: ::: {.panel-tabset .nav-pills} -#### Task 1 +#### Task 3 πŸŽ“ -Create a list using the vectors `stop` from the **heart** data set and `id`, `risk` from the **retinopathy** data set. Give the names `stop_heart`, `id_reti` and `risk_reti`. +An array is much more useful if its dimensions have names. Let';s set up the names for our rows, columns and layers. We use the `dimnames` argument, which takes a list of character vectors. The order of the list must match the order of our dimensions (students, subjects, semesters). The students are called: Anna, Ben and Clara. The subjects are called: Math and History. The semesters are called: Fall and Spring. Recreate the array `grades_array` with the correct dimension names. -#### *Hint 1* +#### *Hint 3* -Use the function list(...). +Use the function `array(...)`. Check the help page for the `dimnames` argument. Note that this must be in a `list` format. ::: ::: {.panel-tabset .nav-pills} -#### Task 2 \* +#### Task 4 πŸŽ“ -Create a list using the vectors `numbers`, `numbers_2` and `treatment`. These variables can be taken from the exercise called `Vectors`. Give the names: `numbers`, `many_numbers` and `treatment`. +Now that you have created the `array`, let's see what it looks like and confirm its structure. Use +the `print`, `str`, `dim` and `dimnames` functions to display the contents, structure, and dimensions of the `grades_array`. -#### *Hint 2* \* - -Use the function list(...). +``` +::: ::: + + diff --git a/1-7_Common_Objects_answers.qmd b/1-7_Common_Objects_answers.qmd index ec15b34..cc22220 100644 --- a/1-7_Common_Objects_answers.qmd +++ b/1-7_Common_Objects_answers.qmd @@ -143,12 +143,12 @@ Create a numerical vector that takes the integer values from 1 until 200. Give t #### *Hint 4* -Use the c(...) function. Replace the dots with the numbers. +Use the colon (`:`) operator. `a:b` generates a vector consisting of the sequence of numbers from `a` to `b`. #### Solution 4 ```{r vec4-solution, solution = TRUE} -numbers_2 <- c(1:200) +numbers_2 <- 1:200 ``` ::: @@ -175,23 +175,23 @@ Let's investigate some matrices and data frames. ::: {.panel-tabset .nav-pills} #### Task 1 -Create a matrix using the vectors `id` and `age` from the **heart** data set. This matrix should have 2 columns where each column represents each variable. +Create a matrix using the vectors `id` and `age` from the **heart** data set. This matrix should have 2 columns where each column represents each variable. An easy way to do this is to first create a long vector combining the data of both variables and then using the `matrix` function. #### *Hint 1* -Use the function matrix(...). +Use `matrix(c(..., ...), ncol=...)`. Fill in the dots. #### Solution 1 ```{r mat-solution-1, solution = TRUE} -matrix(c(heart$id, heart$age), , 2) +matrix(c(heart$id, heart$age), ncol=2) ``` ::: ::: {.panel-tabset .nav-pills} #### Task 2 -Create a data frame using the vectors `id`, `type` and `trt` from the **retinopathy** data set. This data frame should have 3 columns, where each column represents each variable. +Create a `data.frame` using the vectors `id`, `type` and `trt` from the **retinopathy** data set. This data frame should have 3 columns, where each column represents each variable. #### *Hint 2* @@ -204,75 +204,145 @@ data.frame(id = retinopathy$id, type = retinopathy$type, trt = retinopathy$trt) ``` ::: -### Arrays {.tabset .tabset-fade .tabset-pills} +### Lists {.tabset .tabset-fade .tabset-pills} -Let's investigate some arrays. +Let's investigate some lists. ::: {.panel-tabset .nav-pills} #### Task 1 -Create an array that consists of 2 matrices. Matrix 1 will consist of the values 1:4 and matrix 2 will consist of the values 5:8. Both matrices will have 2 columns and 2 rows. +Create a list using the vectors `stop` from the **heart** data set and `id`, `risk` from the **retinopathy** data set. Give the names `stop_heart`, `id_reti` and `risk_reti`. #### *Hint 1* -Use the function array(...). +Use the function list(...). #### Solution 1 -```{r ar-solution-1, solution = TRUE} -array(data = 1:8, dim = c(2, 2, 2)) +```{r list-solution, solution = TRUE} +list(stop_heart = heart$stop, id_reti = retinopathy$id, risk_reti = retinopathy$risk) ``` ::: ::: {.panel-tabset .nav-pills} #### Task 2 \* -Give the name `ar1` to the previous array. Furthermore, investigate the argument dimnames and change the names of the rows, columns and matrices. +Create a list using the vectors `numbers`, `numbers_2` and `treatment`. These variables can be taken from the exercise called `Vectors`. Give the names: `numbers`, `many_numbers` and `treatment`. #### *Hint 2* \* -Use the function array(...). Check the help page for the dimnames argument. Note that this must be in a list format. +Use the function list(...). #### Solution 2 \* -```{r ar-solution-2, solution = TRUE} -ar1 <- array(data = 1:8, dim = c(2, 2, 2), - dimnames = list(c("Row1", "Row2"), c("Col1", "Col2"), c("Mat1", "Mat2"))) +```{r list2-solution, solution = TRUE} +list(numbers = numbers, many_numbers = numbers_2, treatment = treatment) ``` ::: -### Lists {.tabset .tabset-fade .tabset-pills} +::: callout-advanced + +### Arrays {.tabset .tabset-fade .tabset-pills} + +An array is a multi-dimensional data structure, perfect for storing data with more than two dimensions (e.g., rows, columns, and "layers"). We will set up a data structure to store grades for 3 students, across 2 different subjects, over 2 academic semesters. An `array` the perfect tool for this job because our data has three dimensions: +students, subjects and Semesters. Let's build it step-by-step. + -Let's investigate some lists. ::: {.panel-tabset .nav-pills} #### Task 1 -Create a list using the vectors `stop` from the **heart** data set and `id`, `risk` from the **retinopathy** data set. Give the names `stop_heart`, `id_reti` and `risk_reti`. +First, we need a single vector containing all the data points (grades) that will fill our array.We have 3 students, 2 subjects, and 2 semesters.The total number of grades we need is $3 \times 2 \times 2 = 12$.Let's create a vector with 12 sample grades. R will fill the array "column-wise" (filling the first dimension, then the second, then the third). Create a vector named all_grades containing the following 12 grades. +```{r} +#| eval: false -#### *Hint 1* +all_grades <- c(8, 9, 5, 8, 9, 8, 7, 9, 8, 8, 9, 8) +``` + +Next, we need to tell R the shape of our array. This is done with a numeric vector passed to the `dim` argument of the `array` function. We want 3 students, 2 subjects, and 2 semesters. Setup the vector `dimensions` containing the number of students, subjects and semesters in that order. -Use the function list(...). #### Solution 1 -```{r list-solution, solution = TRUE} -list(stop_heart = heart$stop, id_reti = retinopathy$id, risk_reti = retinopathy$risk) +```{r ar-solution-1, solution = TRUE} +dimensions <- c(3, 2, 2) ``` ::: ::: {.panel-tabset .nav-pills} -#### Task 2 \* +#### Task 2 -Create a list using the vectors `numbers`, `numbers_2` and `treatment`. These variables can be taken from the exercise called `Vectors`. Give the names: `numbers`, `many_numbers` and `treatment`. +We can build the array using the `array()` function. Pass our variables (`all_grades` and `dimensions`) to the correct arguments (`data` and `dim`). Use `?array` if you need help. Call the resulting object `grades_array`. -#### *Hint 2* \* -Use the function list(...). +#### *Hint 2* -#### Solution 2 \* +Use `array(data = ... , dim = ...)` + + + +#### Solution 2 + +```{r ar-solution-1b, solution = TRUE} +all_grades <- c(8, 9, 5, 8, 9, 8, 7, 9, 8, 8, 9, 8) +dimensions <- c(3, 2, 2) +grades_array <- array(data = all_grades , dim = dimensions) +``` +::: + + + + +::: {.panel-tabset .nav-pills} +#### Task 3 + +An array is much more useful if its dimensions have names. Let';s set up the names for our rows, columns and layers. We use the `dimnames` argument, which takes a list of character vectors. The order of the list must match the order of our dimensions (students, subjects, semesters). The students are called: Anna, Ben and Clara. The subjects are called: Math and History. The semesters are called: Fall and Spring. Recreate the array `grades_array` with the correct dimension names. + +#### *Hint 3* + +Use the function `array(...)`. Check the help page for the `dimnames` argument. Note that this must be in a `list` format. + +#### Solution 3 + +```{r ar-solution-2, solution = TRUE} +# Create a list of names for each dimension +student_names <- c("Ana", "Ben", "Clara") +subject_names <- c("Math", "History") +semester_names <- c("Fall", "Spring") + +dimension_names <- list(student_names, subject_names, semester_names) +grades_array <- array( + data = all_grades, + dim = dimensions, + dimnames = dimension_names +) -```{r list2-solution, solution = TRUE} -list(numbers = numbers, many_numbers = numbers_2, treatment = treatment) ``` ::: + +::: {.panel-tabset .nav-pills} +#### Task 4 + +Now that you have created the `array`, let's see what it looks like and confirm its structure. Use +the `print`, `str`, `dim` and `dimnames` functions to display the contents, structure, and dimensions of the `grades_array`. + +#### Solution 4 + +```{r ar-solution-2b, solution = TRUE} +# Print the array to the console +print(grades_array) + +# Check the structure of the object +str(grades_array) + +# Check the dimensions +dim(grades_array) + +# Check the dimension names +dimnames(grades_array) +``` +::: + +::: + + diff --git a/1-8_Indexing_Subsetting.qmd b/1-8_Indexing_Subsetting.qmd index 8be52cc..bbc5490 100644 --- a/1-8_Indexing_Subsetting.qmd +++ b/1-8_Indexing_Subsetting.qmd @@ -1,6 +1,6 @@ # Indexing and Subsetting {.unnumbered} -**Author: {Karl Brand, Elizabeth Ribble and Sten Willemsen** \# Manipulating / Selecting Data} +**Author: {Karl Brand, Elizabeth Ribble and Sten Willemsen** Often we want to calculate some things only for a specific subgroup of our patient group. (For example we want to calculate the average for the variables age and weight, but only for the women in the data set and not for the men). So it is important to select specific variables and observations. In R making these selections is called indexing. We start by showing how we can make selections within a vector. Afterwards we will see how selecting variables and observations in more complex data structures such as matrices, `data.frame`s and `list`s. @@ -120,8 +120,8 @@ m <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), ncol = 3) m[1, ] # first row m[ , 2] # second column m[c(2, 4), 1:2] # combination of rows and colums -m[, -c(1, 3)] # negative indices -m[m[,1]>=3,] #logical indices +m[ , -c(1, 3)] # negative indices +m[m[ , 1]>=3, ] #logical indices ``` Note that when a single row or column is selected the object is converted to a vector; a frequent source of errors. if you want to prevent this you can use `drop=FALSE`. @@ -138,6 +138,17 @@ m m["A", ] ``` +As shown above we usually use two indices to select elements in a matrix. However, there it is also possible to use a single index. In this case R treats the matrix as a vector by stacking the columns on top of each other. + +```{r} +m[6] + +``` + +::: callout-advanced + +## Using a single index + There exist a special way of using a matrix with two columns to select individual elements out of a matrix based on their two-dimensional coordinates. ```{r} @@ -146,6 +157,8 @@ a m[a] ``` +## Arrays + Indexing for arrays is similar as for matrices. ```{r} @@ -158,9 +171,19 @@ a[,,'a'] ``` +We can use `drop=FALSE` here as well to prevent R from converting the result to a lower dimension. + +```{r} +a[c(TRUE, FALSE), , ] +a[c(TRUE, FALSE), , , drop=FALSE] +``` + + +::: + ## Making selections in a list -To make selections in a list, we can use single square brackets, double square brackets and the dollar sign. The use of single square brackets works in the same way as it does for vectors. +To make selections in a list, we can use single square brackets, double square brackets (`[[` and `]]`) and the dollar sign. The use of single square brackets works in the same way as it does for vectors. ```{r} mylist <- list( @@ -205,7 +228,7 @@ It is not possible to select more than one element using double brackets; The re ## Selecting observations and variables in a `data.frame` -Selecting observations and variables in a `data.frame` works more or less the same for `data.frames` as it does for lists. However because a `data.frame` is two dimensional we can two indices between the square brackets. The first one corresponds to the observations (rows) and the second corresponds to the variables (columns). So, as an example, we can select the first two observations from the third variable in the `data.frame` using the syntax: +making selections in a `data.frame` sometimes behaves in a similar way as making selections in a matrix, but sometimes it behaves like a list. When we use a double index selecting observations and variables in a `data.frame` works more or less the same for `data.frames` as for a `matrix`. Because a `data.frame` is two dimensional we can two indices between the square brackets. The first one corresponds to the observations (rows) and the second corresponds to the variables (columns). So, as an example, we can select the first two observations from the third variable in the `data.frame` using the syntax: ```{r} mydata <- data.frame(id=c(1, 2, 3, 4, 5), @@ -217,7 +240,7 @@ mydata[c(1, 2), 3] When the first or second position is left blank all rows or columns are selected. For example: ```{r} -mydata[, 3] # sex (3rd variable) for all patients +mydata[, c(2, 3)] # sex and weight (2nd and 3rd variable) for all patients mydata[c(1,2), ] # all variables for the first two patients mydata[c(-3,-4), 'sex'] # Negative numbers and names can also be used ``` @@ -229,11 +252,17 @@ mydata[ , 3] mydata[ , 3, drop=FALSE] ``` +Note that if we select a single row the result is still a `data.frame`. Conversion to a vector would not be possible in this case as a row of a `data.frame` may contain multiple data types. +```{r} +mydata[2 , ] + +``` + We can also use a single index between the square brackets. This works as if the `data.frame` was a list of variables (it's columns). ```{r} mydata[1] -mydata[[1]] +mydata[[1]] # double brackets converts to vector ``` A `data.frame` has `row.names` (note the dot) as well as variable names we can use for selection. Let's look at an example where row names are gene symbols and column names are sample IDs: diff --git a/1-9_Indexing_Subsetting_practical.qmd b/1-9_Indexing_Subsetting_practical.qmd index 8b1513b..1d8032d 100644 --- a/1-9_Indexing_Subsetting_practical.qmd +++ b/1-9_Indexing_Subsetting_practical.qmd @@ -67,28 +67,66 @@ Sometimes we want to obtain a subset of the data sets before investigating the d Using the **heart** data set:\ -- Select the first row.\ -- Select the first column.\ -- Select the column `surgery`. - +- Select the first row of the `data.frame` using `[]`.\ +- Select the second and third column of the `data.frame` using `[]`.\ ::: ::: {.panel-tabset .nav-pills} #### Task 2 -Create a matrix that takes the values 1:4 and has 2 rows and 2 columns. You can name this object `mat`. Select the second row of all columns. +Create a matrix that takes the values 1:6 and has 3 rows and 2 columns. You can name Using the **heart** data set:\ +- Select the column `surgery` of the `data.frame` in multiple ways: \ + * As a `data.frame` with a single column using single square brackets and a single index + * As a `data.frame` with a single column using single square brackets and a double index using `drop=FALSE` \ + * As a vector using double square brackets. \ + * As a vector using single square brackets and a double index. \ + * Using the dolar sign (`$`) operator. \ +- Verify the class of the returned objects in each case using the `class()` function. ::: ::: {.panel-tabset .nav-pills} #### Task 3 -Create an array that consists of 2 matrices. Matrix 1 will consist of the values 1:4 and matrix 2 will consist of the values 5:8. Both matrices will have 2 columns and 2 rows. Give the name `ar1` to the this array. Select the 2nd row of all columns from each matrix. +Create a matrix that takes the values 1:6 and has 3 rows and 2 columns. You can name this object `mat`. +- Select the second row of all columns. +- Select the first column. +- Select the element in the 3rd row and 2nd column. +::: + + + +::: callout-advanced + + +::: {.panel-tabset .nav-pills} +#### Task 4 + +From the matrix in the previous task, select the element on the 1st row and 2nd column and that on the 3rd row on the 1st column. Use a matrix as an index. + +::: + +::: {.panel-tabset .nav-pills} +#### Task 5 + +The following table contains the average temperatures (in Β°C) for January and July in 3 different cities. + +| City | Month | 2020 | 2021 | 2023 | +|:---|---|---|---|---| +| Rotterdam | January | 6.2 | 3.8 | 5.8 | +| | July | 17.0 | 18.0 | 18.0 | +| Berlin | January | 4.0 | 1.0 | 4.0 | +| | July | 17.8 | 19.2 | 20.0 | +| Athens | January | 8.0 | 11.0 | 11.0 | +| | July | 28.0 | 29.0 | 31.0 | +Create an array where the rows denote the different years, the columns the month and the layers the cities. Now list the temperature in January, 2023 in each of the cities. Also calculate the average temperature in July in Rotterdam (use the `mean` function). ::: -### Subsetting {.tabset .tabset-fade .tabset-pills} +::: + +### Subsetting a data set {.tabset .tabset-fade .tabset-pills} ::: {.panel-tabset .nav-pills} #### Task 1 @@ -96,9 +134,7 @@ Create an array that consists of 2 matrices. Matrix 1 will consist of the values Using the **retinopathy** data set:\ - Select the `futime` for all `adult` patients.\ -- Select all the variables for patients that received treatment.\ - - +- Select all the variables for patients that received treatment (`trt==1`).\ ::: ::: {.panel-tabset .nav-pills} @@ -106,20 +142,20 @@ Using the **retinopathy** data set:\ Using the **retinopathy** data set:\ -- Select the `age` for patients that have `futime` more than 20.\ +- Select the `age` for patients that have `futime` more than 20. (When you have time can you think of a second way to do this?)\ - Select the `age` for patients that have `futime` more than 20 and are adults.\ -- Select patients that have no missing values in `age`. - - +- Select only the rows of the left eye. If needed look in the documentation of the data set to find out how variables are encoded. +- Select only the rows of adult patients. +- Select all rows for patients that have no missing values in `age`. ::: ::: {.panel-tabset .nav-pills} #### Task 3 Using the **retinopathy** data set:\ - -- Select only the rows of the left eye. -- Select only the rows of adult patients. +- Calculate the mean risk score for the treated eyes. +- Calculate the mean age of patients for which the variable age is not missing. +- Create a `summary` of the juvenile patients. ::: diff --git a/_brand.yml b/_brand.yml new file mode 100644 index 0000000..58a5950 --- /dev/null +++ b/_brand.yml @@ -0,0 +1,26 @@ +color: + palette: + offwhite: "#fafafa" + black: "#000" + emcdblue: "#0c2074" + emclblue: "#86d2ed" + emcmblue: "#0078d1" + emcdbluevar: "#00216d" + emclbluevar: "#98DDF5" + emcverylightblue: "#cfedf8" + emcveryverylightblue: "#dcf2fa" + emcred: "#DB3324" + emcpurple: "#5C37B4" + emclila: "#b281e2" + emcgreen: "#008080" + emcaccent: "#df4512" + emcaccent2: "#afcc46" + emclightgrey: "#faf9f9" + emcdarkgrey: "#868686" + background: offwhite + foreground: black + primary: emcdblue + secondary: emcdarkgrey + tertiary: emcaccent + success: emcgreen + danger: emcred diff --git a/_quarto.yml b/_quarto.yml index 5b4403b..5952ee9 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -3,8 +3,8 @@ project: output-dir: docs book: title: "Basic R Course" - author: "Sara Baart, David Nieuwenhuijse, Sten Willemsen, Sanne Hoeks, Elrozy Andrinopoulou, Elizabeth Ribble, Karl Brand" - date: "04-06-2024" + author: "Sara Baart, David Nieuwenhuijse, Sten Willemsen, Elrozy Andrinopoulou, Elizabeth Ribble, Karl Brand" + date: "04-10-2025" chapters: - index.qmd - part: 1-1_Introduction.qmd @@ -62,12 +62,23 @@ book: format: html: + css: custom.css theme: light: cosmo dark: solar pdf: documentclass: scrreprt + pdf-engine: lualatex # or xelatex + include-in-header: + - custom.tex + - text: | + \usepackage{tcolorbox} + \usepackage{fontawesome5} + \definecolor{erasmusblue}{HTML}{00A1E4} + \definecolor{erasmuslightblue}{HTML}{E6F5FB} + +editor: source + -editor: visual number-depth: 2 \ No newline at end of file diff --git a/custom.css b/custom.css new file mode 100644 index 0000000..b626488 --- /dev/null +++ b/custom.css @@ -0,0 +1,52 @@ +/* Base box */ +.callout-bg { + position: relative; + background-color: #dcf2fa; + border-left: 5px solid #afcc46; + padding: 0.85rem 1rem 0.85rem 2.5rem; /* space for icon */ + border-radius: .25rem; +} + +/* The icon */ +.callout-bg::before { + content: "πŸ“š"; /* your emoji/icon */ + position: absolute; + left: 0.75rem; /* inside the left gutter */ + top: 0.9rem; /* tweak for vertical alignment */ + font-size: 1.15rem; + line-height: 1; +} + +/* Advanced */ +.callout-advanced { + position: relative; + background-color: #dcf2fa; + border-left: 5px solid #df4512; + padding: 0.85rem 1rem 0.85rem 2.5rem; + border-radius: .25rem; +} +.callout-advanced::before { + content: "πŸŽ“"; /* or use the SVG background (below) */ + position: absolute; + left: 0.75rem; + top: 0.9rem; + font-size: 1.15rem; + line-height: 1; +} + +/* XXX */ +.callout-xxx { + position: relative; + background-color: #AFA4F6; + border-left: 5px solid #FF9800; + padding: 0.85rem 1rem 0.85rem 2.5rem; + border-radius: .25rem; +} +.callout-xxx::before { + content: "πŸŽ“"; + position: absolute; + left: 0.75rem; + top: 0.9rem; + font-size: 1.15rem; + line-height: 1; +} \ No newline at end of file diff --git a/custom.tex b/custom.tex new file mode 100644 index 0000000..2b20647 --- /dev/null +++ b/custom.tex @@ -0,0 +1,61 @@ +\usepackage{fontspec} % needed for LuaLaTeX/XeLaTeX +% Choose an emoji font you have; Noto Color Emoji is common +\newfontfamily\emojifont{Noto Color Emoji}[Renderer=Harfbuzz] +\newcommand{\emoji}[1]{{\emojifont #1}} + +% Colors +\usepackage{xcolor} +\definecolor{erasmusblue}{HTML}{00A1E4} +\definecolor{erasmuslightblue}{HTML}{E6F5FB} +\definecolor{advancedorange}{HTML}{FF9800} +\definecolor{advancedlightorange}{HTML}{FFF4E6} + +\usepackage[most]{tcolorbox} +\tcbset{ + enhanced, + breakable, + sharp corners, + boxsep=0pt, + left=28pt, % room for icon + gutter + right=10pt, + top=10pt, + bottom=10pt, + leftrule=5pt, + toprule=0pt, + bottomrule=0pt, + rightrule=0pt, +} + +% A small helper to place an icon at top-left for all split segments +\newcommand{\CalloutIconOverlays}[1]{% + overlay unbroken={\node[anchor=north west] at (frame.north west) {\raisebox{.2ex}{\large \emoji{#1}}};}, + overlay first={\node[anchor=north west] at (frame.north west) {\raisebox{.2ex}{\large \emoji{#1}}};}, + overlay middle={\node[anchor=north west] at (frame.north west) {\raisebox{.2ex}{\large \emoji{#1}}};}, + overlay last={\node[anchor=north west] at (frame.north west) {\raisebox{.2ex}{\large \emoji{#1}}};}, +} + +% Redefine callout environments for custom types +\newenvironment{callout-background} +{% + \begin{tcolorbox}[ + colback=erasmuslightblue, + colframe=erasmusblue, + \CalloutIconOverlays{πŸ“š} + ] +} +{% + \end{tcolorbox} +} + +% Advanced callout +\newenvironment{callout-advanced} +{% + \begin{tcolorbox}[ + colback=advancedlightorange, + colframe=advancedorange, + \CalloutIconOverlays{πŸŽ“} + ] +} +{% + \end{tcolorbox} +} \ No newline at end of file