statslab/data.qmd at main · cluhmann/statslab · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
# Working with Data {#sec-working-with-data}

```{r}
#| echo: false
#| output: false
library(tidyverse)
```

## Loading and saving data {#sec-load}


There are several different types of files that are likely to be relevant to us when doing data analysis. There are three in particular that are especially important from the perspective of this book:

- *Comma separated value (CSV) files* are those with a .csv file extension. These are just regular old text files, and they can be opened with almost any software. This means that storing data in CSV files does not tie users to any particular software and keeps things simple.
- *Workspace files* are those with a .Rdata file extension. This is the standard kind of file that R uses to store data and variables. They're called "workspace files" because you can use them to save your whole workspace.

### Importing data from CSV files using `read_csv` {#sec-loadingcsv}

One quite commonly used data format is the humble "comma separated value" file, also called a CSV file, and usually bearing the file extension .csv. CSV files are just plain old-fashioned text files, and what they store is basically just a table of data. This is illustrated in Figure @fig-booksalescsv, which shows a file called booksales.csv that I've created. As you can see, each row corresponds to a variable, and each row represents the book sales data for one month. The first row doesn't contain actual data though: it has the names of the variables.

![The booksales.csv data file. On the left, I've opened the file in using a spreadsheet program (OpenOffice), which shows that the file is basically a table. On the right, the same file is open in a standard text editor (the TextEdit program on a Mac), which shows how the file is formatted. The entries in the table are wrapped in quote marks and separated by commas.](images/mechanics/booksalescsv.jpg){#fig-booksalescsv}


The easiest way to open this file will be to use the `read_csv()` function. This function is pretty flexible and there are lots of options that you can tweak. For now there's only a single argument to the function that I'll mention:

- `file`. This should be a character string that specifies a path to the file that needs to be loaded. You can use an absolute path or a relative path to do so.

Therefore, to import the CSV file, the command I need is:

```{r}
#| eval: false
books <- read_csv("./data/booksales.csv")
```

```{r}
#| echo: false
#| warning: false
#| output: false
books <- read_csv("./data/booksales.csv")
```

There are two very important points to notice here. First, notice that I *didn't* try to use the `load()` function, because that function is only meant to be used for .Rdata files. If you try to use `load()` on other types of data, you get an error. Second, notice that when I imported the CSV file I assigned the result to a variable, which I imaginatively called `books` file. There's a reason for this. The idea behind an `.Rdata` file is that it stores a whole workspace. So, if you had the ability to look inside the file yourself you'd see that the data file keeps track of all the variables and their names. So when you `load()` the file, R restores all those original names. CSV files are treated differently: as far as R is concerned, the CSV only stores *one* variable, but that variable is big table. So when you import that table into the workspace, R expects *you* to give it a name. Let's have a look at what we've got:

```{r}
print(books)
```

You can instead open the data in RStudio's spreadsheet viewer:

```{r}
#| eval: false
View(books)
```

The `books` data set is quite small, so just calling `print()` is fine.  But for larger data sets, the `View()` and the spreadsheet viewer allows for a more thorough inspection.

Clearly, it's worked, but the format of this output is a bit unfamiliar. We haven't seen anything like this before. What you're looking at is a *data frame*, which is a very important kind of variable in R, and one I'll discuss at length in @sec-dataframes. For now, let's just be happy that we imported the data and that it looks about right.

::: {.callout-note}
In a lot of books you'll see the `read.table()` or `read.csv()` functions used for this purpose instead of `read_csv()`. They're similar functions, but `read_csv()` is a function from the `readr` package, one of the many packages that make up "the tidyverse".  The tidyverse is a set of interoperable packages we will be using throughout this course. Further information about the tidyverse can be found in @sec-tidyverse.
:::


### Alternative CSV importation {#sec-importing}

::: {.callout-warning}
As detailed above, `read_csv()` is how we will import CSV data files into R. This section details an alternative way to import CSV data files, but it will not produce the same kind of variable.  Using RStudio's built-in "Import Dataset" functionality will produce a dataframe, whereas `read_csv` produces a "tibble". We will Further information about the difference between plain dataframes and tibbles can be found in @sec-dataframes.
:::

In the environment panel in RStudio you should see a button called "Import Dataset". Click on that, and it will give you a couple of options: select the "From Text File..." option, and it will open up a very familiar dialog box asking you to select a file: if you're on a Mac, it'll look like the usual Finder/Explorer window that you use to choose a file. I'm assuming that you're familiar with your own computer, so you should have no problem finding the CSV file that you want to import! Find the one you want, then click on the "Open" button. When you do this, you'll see a window that looks like the one in @fig-import.

![A dialog box on a Mac asking you to select the CSV file R should try to import. Mac users will recognise this immediately: it's the usual way in which a Mac asks you to find a file. Windows users won't see this: they'll see the usual explorer window that Windows always gives you when it wants you to select a file.](images/rstudio_screenshots/openscreen.png){#fig-import}


The import data set window is relatively straightforward to understand.


![The RStudio window for importing a CSV file into R.](images/rstudio_screenshots/import.png)

In the top left corner, you need to type the name of the variable you R to create. By default, that will be the same as the file name: our file is called `booksales.csv`, so RStudio suggests the name `booksales`. If you're happy with that, leave it alone. If not, type something else. Immediately below this are a few things that you can tweak to make sure that the data gets imported correctly:

- Heading. Does the first row of the file contain raw data, or does it contain headings for each variable? The `booksales.csv` file has a header at the top, so I selected "yes".
- Separator. What character is used to separate different entries? In most CSV files this will be a comma (it is "comma separated" after all). But you can change this if your file is different.
- Decimal. What character is used to specify the decimal point? In English speaking countries, this is almost always a period (i.e., `.`). That's not universally true: many European countries use a comma. So you can change that if you need to.
- Quote. What character is used to denote a block of text? That's usually going to be a double quote mark. It is for the `booksales.csv` file, so that's what I selected.

One nice thing about the RStudio window is that it shows you the raw data file at the top of the window, and it shows you a preview of the data at the bottom. If the data at the bottom doesn't look right, try changing some of the settings on the left hand side. Once you're happy, click "Import". When you do, two commands appear in the R console:

```{r}
#| eval: false
booksales <- read.csv("./data/booksales.csv")
View(booksales)
```

The first of these commands is the one that loads the data. The second one will display a pretty table showing the data in RStudio.

Note, however, that this variable looks a bit different from the `sales` variable we created using `read_csv`. This is because RStudio has used `read.csv()` rather than `read_csv()` as we did in the previous section.  Again, the differences will be discussed more in @sec-dataframes.


### Loading workspace files using R

When I used the `list.files()` command to list the contents of the `data` directory, the output referred to a file called booksales.Rdata. Let's say I want to load the data from this file into my workspace. The way I do this is with the `load()` function. There are two arguments to this function, but the only one we're interested in is

- `file`. This should be a character string that specifies a path to the file that needs to be loaded. You can use an absolute path or a relative path to do so.

Using the absolute file path, the command would look like this:

```{r}
#| eval: false
load( file = "/home/christian/Documents/teaching/statslab/data/booksales.Rdata" )
```

but this is long and ugly. Given that the working directory is `/home/christian/Documents/teaching/statslab`, I could use a relative file path, like so:

```{r}
#| eval: false
load( file = "./data/booksales.Rdata" )
```

Another strategy would be to first change the working directory to whatever directory contains the desired file (`setwd()` or Session->Set Working Directory) and then load the file using only the file name (e.g., `load(file = "booksales.Rdata")`).  This may seem tempting, but we will avoid doing this because changing the working directory can be confusing if you're not paying close attention. Instead, we will assume that a) our working directory is the "statslab" directory we created previously (@sec-rlangactivities) and b) all of our data files (CSVs and workspaces) are located in the "data" subdirectory. Outside of this class, however, you may not be able to move data files and/or R scripts. In such cases, it may be easier to modify the working directory.  Just be aware that you may need to change the working directory both **before** and **after** loading the desired file(s). Don't get lost (you can always figure out what the current working directory is with `getwd()`)!

### Loading workspace files using RStudio

Okay, so how do we open an .Rdata file using the RStudio file panel? It's terribly simple. First, use the file panel to find the folder that contains the file you want to load. If you look at Figure @fig-filepanel, you can see that there are several .Rdata files listed. Let's say I want to load the `booksales.Rdata` file. All I have to do is click on the file name. RStudio brings up a little dialog box asking me to confirm that I do want to load this file. I click yes. The following command then turns up in the console,

```{r}
#| eval: false
load("./data/booksales.Rdata")
```

and the new variables will appear in the workspace (you'll see them in the Environment panel in RStudio, or if you type `who()`). So easy it barely warrants having its own section.

![The file panel is the area shown in the lower right hand corner. It provides a very easy way to browse and navigate your computer using R. See main text for details.](images/mechanics/filepanel.png){#fig-filepanel}


### Saving a workspace file using `save`

Not surprisingly, saving data is very similar to loading data. Although RStudio provides a simple way to save files (see below), it's worth understanding the actual commands involved. There are two commands you can use to do this, `save()` and `save.image()`. If you're happy to save *all* of the variables in your workspace into the data file, then you should use `save.image()`. And if you're happy for R to save the file into the current working directory, all you have to do is this:

```{r}
#| eval: false
save.image( file = "myfile.Rdata" )
```

Since `file` is the first argument, you can shorten this to `save.image("myfile.Rdata")`; and if you want to save to a different directory, then (as always) you need to be more explicit about specifying the path to the file. Suppose, however, I have several variables in my workspace, and I only want to save some of them. For instance, I might have this as my workspace:

```{r}
#| eval: false
who()
##   -- Name --   -- Class --   -- Size --
##   data         data.frame    3 x 2
##   handy        character     1
##   junk         numeric       1
```

I want to save `data` and `handy`, but not `junk`. But I don't want to delete `junk` right now, because I want to use it for something else later on. This is where the `save()` function is useful, since it lets me indicate exactly which variables I want to save. Here is one way I can use the `save` function to solve my problem:

```{r}
#| eval: false
save(data, handy, file = "myfile.Rdata")
```
Importantly, you *must* specify the name of the `file` argument. The reason is that if you don't do so, R will think that `"myfile.Rdata"` is actually a *variable* that you want to save, and you'll get an error message. Finally, I should mention a second way to specify which variables the `save()` function should save, which is to use the `list` argument. You do so like this:
```{r}
#| eval: false
vars_to_save <- c("data", "handy")   # the variables to be saved
save( file = "booksales2.Rdata", list = vars_to_save )   # the command to save them
```

### Saving a workspace file using RStudio {#sec-save1}

RStudio allows you to save the workspace pretty easily. In the environment panel (@fig-workspace) you can see the "save" button. There's no text, but it's the same icon that gets used on every computer everywhere: it's the one that looks like a floppy disk. You know, those things that haven't been used in about 25 years.

![The RStudio Environment panel shows you the contents of the workspace. The view shown above is the list view. To switch to the grid view, click on the menu item on the top right that currently reads list. Select grid from the dropdown menu, and then it will switch to a view like the one shown in the other workspace figure](images/mechanics/workspacepanel.png){#fig-workspace}

Alternatively, go to the "Session" menu and click on the "Save Workspace As..." option.  This will bring up the standard "save" dialog box for your operating system. Type in the name of the file that you want to save it to, and all the variables in your workspace will be saved to disk. You'll see an R command like this:

```{r}
#| eval: false
save.image("~/Desktop/Untitled.RData")
```

Pretty straightforward, really.

## Exercises {#sec-dataactivities}

- Download data (see @sec-datafiles)
- Inside the "statslab" directory/folder you created at the end of @sec-rlangactivities, create a sub-directory/folder and name it "data"
- Unzip the data files and place them in the "data" directory/folder you just created
- Open NBA data file (`nba_all_seasons.csv`) in your favorite spreadsheet software (e.g., Microsoft Excel).
- Briefly review the content of this file.