-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathprogramming.qmd
More file actions
856 lines (550 loc) · 61.3 KB
/
Copy pathprogramming.qmd
File metadata and controls
856 lines (550 loc) · 61.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
# R as a Calculator
```{r}
#| echo: false
#| output: false
library(knitr)
```
## Commands at the console
One of the easiest things you can do with R is use it as a simple calculator, so it's a good place to start. For instance, try typing `10 + 20`, and hitting enter. The simple act of typing it rather than "just reading" makes a big difference. It makes the concepts more concrete, and it ties the abstract ideas (programming and statistics) to the actual context in which you need to use them. Statistics is something you *do*, not just something you read about in a textbook.] When you do this, you've entered a ***command***, and R will "execute" that command. What you see on screen now will be this:
```
> 10 + 20
[1] 30
```
Should be much surprise here. But there's a few things worth talking about, even with such a simple example. First, it's important that you understand how to read the code example. In this example, what was typed into the RStudio console was the `10 + 20` part. The `>` symbol as not typed. That's just the command prompt and isn't part of the actual command. The `[1] 30` part was also not typed into the console. That's what R printed out in response to the `10 + 20` code.
Second, it's important to understand how the output is formatted. Obviously, the correct answer to the sum `10 + 20` is `30`, and not surprisingly R has printed that out as part of its output. But it's also printed out this `[1]` part, which probably doesn't make a lot of sense to you right now. You're going to see that a lot. I'll talk about what this means in a bit more detail later on, but for now you can think of `[1] 30` as if R were saying "the answer to the 1st question you asked is 30". That's not quite accurate, but it's close enough for now. And in any case it's not really very interesting at the moment: we only asked R to calculate one thing, so obviously there's only one answer. Later on this will change, and the `[1]` part will start to make a bit more sense. For now, I just don't want you to get confused or concerned by it.
### An important digression about formatting
Now that I've taught you these rules I'm going to change them pretty much immediately. That is because I want you to be able to copy code from the book directly into R if if you want to test things or conduct your own analyses. However, if you copy this kind of code (that shows the command prompt and the results) directly into R you will get an error:
```{r}
#| error: TRUE
> 10 + 20
```
So instead, I'm going to provide code in a slightly different format so that it looks like this...
```{r}
10 + 20
```
There are two main differences.
- In your console, the ">" is the prompt and you type your code after (to the right of) this prompt.
- We'll often show the output of a bit of code, but the output will be displayed after the block of code itself.
For your purposes, this also means that you can easily copy code from any of these code blocks and paste it into your RStudio console in order to execute.
### Be very careful to avoid typos
Before we go on to talk about other types of calculations that we can do with R, there's a few other things I want to point out. The first thing is that, though R is good software, it's still software. R, like any programming language, is pretty stupid and because it's stupid it can't handle typos. It takes it on faith that you meant to type *exactly* what you actually typed. For example, suppose that you forgot to hit the shift key when trying to type `+`, and as a result your command ended up being `10 = 20` rather than `10 + 20`. Here's what happens:
```{r}
#| error: true
10 = 20
```
What's happened here is that R has attempted to interpret `10 = 20` as a command, and spits out an error message because the command doesn't make any sense to it. When a *human* looks at this, and then looks down at his or her keyboard and sees that `+` and `=` are on the same key, it's pretty obvious that the command was a typo. But R doesn't know this, so it gets upset. And, if you look at it from its perspective, this makes sense. All that R "knows" is that `10` is a legitimate number, `20` is a legitimate number, and `=` is a legitimate part of the language too. In other words, from its perspective this really does look like the user meant to type `10 = 20`, since all the individual parts of that statement are legitimate and it's too stupid to realize that this is probably a typo. Therefore, R takes it on faith that this is exactly what you meant... it only "discovers" that the command is nonsense when it tries to follow your instructions, typo and all. And then it complains by spitting out an error.
Even more subtle is the fact that some typos won't produce errors at all, because they happen to correspond to "well-formed" R commands. For instance, suppose that not only did I forget to hit the shift key when trying to type `10 + 20`, I also managed to press the key next to one I meant do. The resulting typo would produce the command `10 - 20`. Clearly, R has no way of knowing that you meant to *add* 20 to 10, not *subtract* 20 from 10, so what happens this time is this:
```{r}
10 - 20
```
In this case, R produces the right answer, but to the the wrong question.
To some extent, I'm stating the obvious here, but it's important. The people who wrote R are smart. You, the user, are smart. But R is a programming language and programming languages are a way to tell computers what to do and computers are dumb. And because they are dumb, they are mindlessly obedient. R does *exactly* what you tell it to do. R will not try and second-guess what you "actually meant"; there is no "autocorrect". This is for good reason. When doing advanced stuff -- and even the simplest of statistics is pretty advanced in a lot of ways -- it's risky to let a mindless automaton like R try to overrule the human user. So it's your responsibility to be careful. Always make sure you type *exactly what you mean*. When dealing with computers, it's not enough to type "approximately" the right thing. In general, you absolutely *must* be precise in what you tell R to do ... like all machines it is too stupid to be anything other than absurdly literal in its interpretation.
### R is (a bit) flexible with spacing
Of course, now that I've been so uptight about the importance of always being precise, I should point out that there are some exceptions. Or, more accurately, there are some situations in which R does show a bit more flexibility than my previous description suggests. The first thing R is smart enough to do is ignore redundant spacing. What I mean by this is that, when I typed `10 + 20` before, I could equally have done this
```{r}
10 + 20
```
or this
```{r}
10+20
```
## Simple calculations
Okay, now that we've discussed some of the tedious details associated with typing R commands, let's get back to learning how to use the most powerful piece of statistical software in the world as a \$2 calculator. So far, all we know how to do is addition. Clearly, a calculator that only did addition would be a bit stupid, so we'll discuss other simple calculations you can perform using R. But first, some more terminology. Addition is an example of an "operation" that you can perform (specifically, an arithmetic operation), and the ***operator*** that performs it is `+`. To people with a programming or mathematics background, this terminology probably feels pretty natural, but to other people it might feel like I'm trying to make something very simple (addition) sound more complicated than it is (by calling it an arithmetic operation). To some extent, that's true: if addition was the only operation that we were interested in, it'd be a bit silly to introduce all this extra terminology. However, as we go along, we'll start using more and more different kinds of operations, so it's probably a good idea to get the language straight now, while we're still talking about very familiar concepts like addition!
### Adding, subtracting, multiplying and dividing
So, now that we have the terminology, let's learn how to perform some arithmetic operations in R. To that end, @fig-arithmetic1 lists the operators that correspond to the basic arithmetic we learned in primary school: addition, subtraction, multiplication and division.
```{r}
#| label: fig-arithmetic1
#| echo: FALSE
#| fig-cap: "Basic arithmetic operations in R. These five operators are used very frequently throughout the text, so it's important to be familiar with them at the outset."
knitr::kable(rbind(
c("addition", "`+`", "10 + 2", 12),
c("subtraction", "`-`", "9 - 3", 6),
c("multiplication", "`*`", "5 * 5", 25),
c("division", "`/`", "10 / 3", 3.333333),
c("power", "`^`", "5 ^ 2", 25)
),
col.names = c("operation", "operator", "example input" , "example output"), align="lccc",
booktabs = TRUE)
```
As you can see, R uses fairly standard symbols to denote each of the different operations you might want to perform: addition is done using the `+` operator, subtraction is performed by the `-` operator, and so on. So if I wanted to find out what 57 times 61 is (and who wouldn't?), I can use R instead of a calculator, like so:
```{r}
57 * 61
```
So that's handy.
### Taking powers
The first four operations listed in @fig-arithmetic1 are things we all learned at a young age, but they aren't the only arithmetic operations built into R. There are three other arithmetic operations that I should probably mention: taking powers, doing integer division, and calculating a modulus. Of the three, the most important is probably taking powers.
For those of you who can still remember your high school math, this should be familiar. And if not, it's not complicated. As I'm sure everyone will probably remember the moment they read this, the act of multiplying a number $x$ by itself $n$ times is called "raising $x$ to the $n$-th power". Mathematically, this is written as $x^n$. Some values of $n$ have special names: in particular $x^2$ is called $x$-squared, and $x^3$ is called $x$-cubed. So, the 4th power of 5 is calculated like this:
$$
5^4 = 5 \times 5 \times 5 \times 5
$$
One way that we could calculate $5^4$ in R would be to type in the complete multiplication as it is shown in the equation above. That is, we could do this
```{r}
5 * 5 * 5 * 5
```
but it does seem a bit tedious. It would be very annoying indeed if you wanted to calculate $5^{15}$, since the command would end up being quite long. Therefore, to make our lives easier, we use the power operator instead. When we do that, our command to calculate $5^4$ goes like this:
```{r}
5 ^ 4
```
Much easier.
### Doing calculations in the right order {#sec-bedmas}
Okay. At this point, you know how to take one of the most powerful pieces of statistical software in the world, and use it as a \$2 calculator. And as a bonus, you've learned a few very basic programming concepts. That's not nothing (you could argue that you've just saved yourself \$2) but on the other hand, it's not very much either. In order to use R more effectively, we need to introduce more programming concepts.
In most situations where you would want to use a calculator, you might want to do multiple calculations. R lets you do this, just by typing in longer commands. In fact, we've already seen an example of this earlier, when I typed in `5 * 5 * 5 * 5`. However, let's try a slightly different example:
```{r}
1 + 2 * 4
```
Clearly, this isn't a problem for R either. However, it's worth stopping for a second, and thinking about what R just did. Clearly, since it gave us an answer of `9` it must have multiplied `2 * 4` (to get an interim answer of 8) and then added 1 to that. But, suppose it had decided to just go from left to right: if R had decided instead to add `1+2` (to get an interim answer of 3) and then multiplied by 4, it would have come up with an answer of `12`.
To answer this, you need to know the **_order of operations_** that R uses. If you remember back to your high school maths classes, it's actually the same order that you got taught when you were at school: the "**_BEDMAS_**" order. That is, first calculate things inside **B**rackets `()`, then calculate **E**xponents `^`, then **D**ivision `/` and **M**ultiplication `*`, then **A**ddition `+` and **S**ubtraction `-`. So, to continue the example above, if we want to force R to calculate the `1+2` part before the multiplication, all we would have to do is enclose it in brackets:
```{r}
(1 + 2) * 4
```
This is a fairly useful thing to be able to do. The only other thing I should point out about order of operations is what to expect when you have two operations that have the same priority: that is, how does R resolve ties? For instance, multiplication and division are actually the same priority, but what should we expect when we give R a problem like `4 / 2 * 3` to solve? If it evaluates the multiplication first and then the division, it would calculate a value of two-thirds. But if it evaluates the division first it calculates a value of 6. The answer, in this case, is that R goes from *left to right*, so in this case the division step would come first:
```{r}
4 / 2 * 3
```
All of the above being said, it's helpful to remember that *parentheses always come first*. So, if you're ever unsure about what order R will do things in, an easy solution is to enclose the thing *you* want it to do first in parentheses In addition, making the order of operations explicit makes your code more readable. By enclosing the division in parentheses (e.g., `(4 / 2) * 3`) we make it clear which thing happens first.
## Storing a number as a variable {#sec-assign}
One of the most important things to be able to do in R (or any programming language, for that matter) is to store information in **_variables_**. At a conceptual level you can think of a variable as *label* for a certain piece of information, or even several different pieces of information. For example, when using R as a calculator, there may be times when you want to store an intermediate result along the way. For example, when calculating an average (the sum divided by the count), you might wish to save the sum before dividing that sum by the count. Let's look at the very basics for how we create variables and work with them.
### Variable assignment using `<-`
Since we've been working with numbers so far, let's start by creating variables to store our numbers. And since most people like concrete examples, let's invent one. Suppose I'm trying to calculate how much money I'm going to make selling this book. There's several different numbers I might want to store. Firstly, I need to figure out how many copies I'll sell. This isn't exactly *Harry Potter*, so let's assume I'm only going to sell one copy per student in my class. Let's assume there are 30 students, so that's 30 sales. Let's create a variable called `sales`. What I want to do is assign a **_value_** to my variable `sales`, and that value should be `30`. We do this by using the **_assignment operator_**, which is `<-`. Here's how we do it:
```{r}
sales <- 30
```
When you hit enter, R doesn't print out any output. It just gives you another command prompt. However, behind the scenes R has created a variable called `sales` and assign the value `30` to it. You can check that this has happened by asking R to print the variable on screen. And the simplest way to do *that* is to type the name of the variable and hit enter.
```{r}
sales
```
So that's nice to know. Anytime you can't remember what R has got stored in a particular variable, you can just type the name of the variable and hit enter.
Okay, so now we know how to assign variables. Actually, there's a bit more you should know. Firstly, one of the curious features of R is that there are several different ways of making assignments. In addition to the `<-` operator, we can also use `->` and `=`, and it's pretty important to understand the differences between them. Let's start by considering `->`, since that's the easy one (we'll discuss the use of `=` in @sec-functionarguments). As you might expect from just looking at the symbol, it's almost identical to `<-`. It's just that the arrow (i.e., the assignment) goes from left to right. So if I wanted to define my `sales` variable using `->`, I would write it like this:
```{r}
30 -> sales
```
This has the same effect. And, just to be confusing, this also has the same effect:
```{r}
sales = 30
```
... and so does this:
```{r}
assign("sales", 30)
```
::: {.callout-caution}
Apart from superficial differences, these various approaches to assignment are functionally identical. Despite this equivalence, you are **strongly** encouraged to use the `<-` operator. Because the use of `<-` is so conventional within the R language, those familiar with R will have a much more difficult time reading R code that uses anything else. Soon enough, you will too will be familiar with R and will thus come to expect the use of `<-`.
:::
### Calculations using variables
Okay, let's get back to my original story. In my quest to become rich, I've written this textbook. To figure out how good a strategy is, I've started creating some variables in R. In addition to defining a `sales` variable that counts the number of copies I'm going to sell, I can also create a variable called `royalty`, indicating how much money I get per copy. Let's say that my royalties are about $7 per book:
```{r}
sales <- 30
royalty <- 7
```
The nice thing about variables (in fact, the whole point of having variables) is that we can do anything with a variable that we ought to be able to do with the information that it stores. That is, since R allows me to multiply `30` by `7`
```{r}
30 * 7
```
it also allows me to multiply `sales` by `royalty`
```{r}
sales * royalty
```
As far as R is concerned, the `sales * royalty` command is the same as the `30 * 7` command. Not surprisingly, I can assign the output of this calculation to a new variable, which I'll call `revenue`. And when we do this, the new variable `revenue` gets the value `35`. So let's do that, and then get R to print out the value of `revenue` so that we can verify that it's done what we asked:
```{r}
revenue <- sales * royalty
revenue
```
That's fairly straightforward. A slightly more subtle thing we can do is reassign the value of my variable, based on its current value. For instance, suppose that one of my students (no doubt under the influence of psychotropic drugs) loves the book so much that he or she donates an extra \$550 to me. The simplest way to capture this is by a command like this:
```{r}
revenue <- revenue + 550
revenue
```
In this calculation, R has taken the old value of `revenue` (i.e., 210) and added 550 to that value, producing a value of 760 This new value is assigned to the `revenue` variable, overwriting its previous value. In any case, we now know that I'm expecting to make $760 off this. Pretty sweet, I thinks to myself. Or at least, that's what I thinks until I do a few more calculation and work out what the implied hourly wage I'm making off this looks like.
### Rules and conventions for naming variables
In the examples that we've seen so far, my variable names (`sales` and `revenue`) have just been English-language words written using lowercase letters. However, R allows a lot more flexibility when it comes to naming your variables, as the following list of rules illustrates:
- Variable names can only use the upper case alphabetic characters `A`-`Z` as well as the lower case characters `a`-`z`. You can also include numeric characters `0`-`9` in the variable name, as well as the period `.` or underscore `_` character. In other words, you can use `SaL.e_s` as a variable name (though I can't think why you would want to), but you can't use `Sales?`.
- Variable names cannot include spaces: therefore `my sales` is not a valid name, but `my.sales` is.
- Variable names are case sensitive: that is, `Sales` and `sales` are *different* variable names.
- Variable names must start with a letter or a period. You can't use something like `_sales` or `1sales` as a variable name. You can use `.sales` as a variable name if you want, but it's not usually a good idea. By convention, variables starting with a `.` are used for special purposes, so you should avoid doing so.
- Variable names cannot be one of the reserved keywords. These are special names that R needs to keep "safe" from us mere users, so you can't use them as the names of variables. The keywords are: `if`, `else`, `repeat`, `while`, `function`, `for`, `in`, `next`, `break`, `TRUE`, `FALSE`, `NULL`, `Inf`, `NaN`, `NA`, `NA_integer_`, `NA_real_`, `NA_complex_`, and finally, `NA_character_`. Don't feel especially obliged to memorize these: if you make a mistake and try to use one of the keywords as a variable name, R will complain about it like the whiny little automaton it is.
In addition to those rules that R enforces, there are some informal conventions that people tend to follow when naming variables. One of them you've already seen: i.e., don't use variables that start with a period. But there are several others. You aren't obliged to follow these conventions, and there are many situations in which it's advisable to ignore them, but it's generally a good idea to follow them when you can:
- Use informative variable names. As a general rule, using meaningful names like `sales` and `revenue` is preferred over arbitrary ones like `variable1` and `variable2`. Otherwise it's very hard to remember what the contents of different variables are, and it becomes hard to understand what your commands actually do.
- Use short variable names. Typing is a pain and no-one likes doing it. So we much prefer to use a name like `sales` over a name like `sales.for.this.book.that.you.are.reading`. Obviously there's a bit of a tension between using informative names (which tend to be long) and using short names (which tend to be meaningless), so use a bit of common sense when trading off these two conventions.
- Use one of the conventional naming styles for multi-word variable names. Suppose I want to name a variable that stores "my new salary". Obviously I can't include spaces in the variable name, so how should I do this? There are two main conventions that you sometimes see R users employing. First, there is "camel case" in which you use capital letters at the beginning of each constituent word (except the first one), which gives you `myNewSalary` as the variable name. Second, there is "snake case" in which you separate words using underscores, as in `my_new_salary`. Finally, you may also see some R users separating words using periods, which would give you `my.new.salary`.
::: {.callout-caution}
Though you may sometimes see R users separating words within a variable name using periods (e.g., `my.new.salary`), it is syntactically ambiguous for those who know other programming languages. Many languages use periods to indicate hierarchical relationships (e.g., `obj.func` will refer to the `func` that belongs to `obj`). I thus **strongly discourage** such practices because it makes your code difficult to read/understand. Camel case is recommended.
:::
## Functions {#sec-usingfunctions}
The symbols `+`, `-`, `*`, etc. are examples of operators. As we've seen, you can do quite a lot of calculations just by using these operators. However, in order to do more advanced calculations (and later on, to do actual statistics), you're going to need to start using **_functions_**. We'll see more detail about functions and how they work later, but for now let's just dive in and use a few. To get started, suppose I wanted to take the square root of 225. The square root, in case your high school math is a bit rusty, is just the opposite of squaring a number. So, for instance, since "5 squared is 25" I can say that "5 is the square root of 25". This is the usual notation:
$$
\sqrt{25} = 5
$$
Sometimes you'll also see it written like this:
$25^{0.5} = 5$
This second way of writing it is kind of useful to "remind" you of the mathematical fact that "square root of $x$" is actually the same as "raising $x$ to the power of 0.5". Personally, I've never found this to be terribly meaningful psychologically, though I have to admit it's quite convenient mathematically. Anyway, it's not important. What is important is that you remember what a square root is, since we're going to need it later on.
You may be able to calculate the square root of 25 in your head. But it gets more difficult when the numbers get bigger, and pretty much impossible if they're not whole numbers. This is where something like R comes in very handy. Let's say I wanted to calculate $\sqrt{225}$, the square root of 225. There's two ways I could do this using R. First, since the square root of 255 is the same thing as raising 225 to the power of 0.5, we could use the power operator `^`, just like we did earlier:
```{r}
225^0.5
```
However, there's a second way that we can do this, since R also provides a ***square root function***, `sqrt()`. To calculate the square root of 255 using this function, what I do is insert the number `225` in the parentheses. That is, the command I type is this:
```{r}
sqrt(225)
```
As you might expect from our previous discussion, the spaces in between the parentheses are purely cosmetic. We could have typed `sqrt(225)` or `sqrt( 225 )` and gotten the same result. When we use a function to do something, we generally refer to this as **_calling_** the function, and the values that we type into the function (there can be more than one) are referred to as the **_arguments_** of that function.
Obviously, the `sqrt()` function doesn't really give us any new functionality, since we already knew how to do square root calculations by using the power operator `^`, though it maybe be more explicit, clearer, and thus easier to read to use `sqrt()`. However, there are lots of other functions in R: in fact, almost everything of interest that I'll talk about in this book is an R function of some kind. For example, one function that we will need to use in this book is the ***absolute value function***. Compared to the square root function, it's extremely simple: it just converts negative numbers to positive numbers, and leaves positive numbers alone. Mathematically, the absolute value of $x$ is written $|x|$ or sometimes $\mbox{abs}(x)$. Calculating absolute values in R is pretty easy, since R provides the `abs()` function that you can use for this purpose. When you feed it a positive number...
```{r}
abs(21)
```
the absolute value function does nothing to it at all. But when you feed it a negative number, it spits out the positive version of the same number, like this:
```{r}
abs(-13)
```
In all honesty, there's nothing that the absolute value function does that you couldn't do just by looking at the number and erasing the minus sign if there is one. However, there's a few places later in the book where we have to use absolute values, so I thought it might be a good idea to explain the meaning of the term early on.
Before moving on, it's worth noting that -- in the same way that R allows us to put multiple operations together into a longer command, like `1 + (2*4)` for instance -- it also lets us put functions together and even combine functions with operators if we so desire. For example, the following is a perfectly legitimate command:
```{r}
sqrt( 1 + abs(-8) )
```
When R executes this command, starts out by calculating the value of `abs(-8)`, which produces an intermediate value of `8`. Having done so, the command simplifies to `sqrt( 1 + 8 )`. To solve the square root it first needs to add `1 + 8` to get `9`, at which point it evaluates `sqrt(9)`, and so it finally outputs a value of `3`.
### Function arguments, their names and their defaults {#sec-functionarguments}
There's two more fairly important things that you need to understand about how functions work in R, and that's the use of "named" arguments, and "default values" for arguments. Not surprisingly, that's not to say that this is the last we'll hear about how functions work, but they are the last things we desperately need to discuss in order to get you started. To understand what these two concepts are all about, I'll introduce another function. The `round()` function can be used to round some value to the nearest whole number. For example, I could type this:
```{r}
round(3.1415)
```
Pretty straightforward, really. However, suppose I only wanted to round it to two decimal places: that is, I want to get `3.14` as the output. The `round()` function supports this, by allowing you to input a second argument to the function that specifies the number of decimal places that you want to round the number to. In other words, I could do this:
```{r}
round(3.14165, 2)
```
What's happening here is that I've specified *two* arguments: the first argument is the number that needs to be rounded (i.e., `3.14165`), the second argument is the number of decimal places that it should be rounded to (i.e., `2`), and the two arguments are separated by a comma. In this simple example, it's quite easy to remember which one argument comes first and which one comes second, but for more complicated functions this is not easy. Fortunately, most R functions make use of ***argument names***. For the `round()` function, for example the number that needs to be rounded is specified using the `x` argument, and the number of decimal points that you want it rounded to is specified using the `digits` argument. Because we have these names available to us, we can specify the arguments to the function by name. We do so like this:
```{r}
round(x=3.1415, digits=2)
```
Notice that this is kind of similar in spirit to variable assignment, except that `=` is used here, rather than `<-`. In both cases we're specifying specific values to be associated with a label. However, there are some differences between what we were doing earlier on when creating variables, and what we're doing here when specifying arguments, and so as a consequence it's important that you use `=` in this context.
As you can see, specifying the arguments by name involves a lot more typing, but it's also explicit and thus a lot easier to read. Because of this, the commands in this book will usually specify arguments by name, since that makes it clearer to you what I'm doing. However, one important thing to note is that when specifying the arguments using their names, it doesn't matter what order you type them in. But if you don't use the argument names, then you have to input the arguments in the correct order. In other words, these three commands all produce the same output...
```{r}
round(x=3.1415, 2)
round(x=3.1415, digits=2)
round(digits=2, x=3.1415)
```
but this one does not...
```{r}
round(2, 3.14165)
```
How do you find out what the correct order is? There's a few different ways, but the easiest one is to look at the help documentation for the function (e.g., `? round`). However, if you're ever unsure, it's probably best to actually type in the argument name.
Okay, so that's the first thing I said you'd need to know: argument names. The second thing you need to know about is default values. Notice that the first time I called the `round()` function I didn't actually specify the `digits` argument at all, and yet R somehow knew that this meant it should round to the nearest whole number. How did that happen? The answer is that the `digits` argument has a ***default value*** of `0`, meaning that if you decide not to specify a value for `digits` then R will act as if you had typed `digits = 0`. This is quite handy: the vast majority of the time when you want to round a number you want to round it to the nearest whole number, and it would be pretty annoying to have to specify the `digits` argument every single time. On the other hand, sometimes you actually do want to round to something other than the nearest whole number, and it would be even more annoying if R didn't allow this! Thus, by having `digits = 0` as the default value, we get the best of both worlds.
## Storing many numbers as a vector {#sec-vectors}
At this point we've covered functions in enough detail to get us safely through most of the rest of the book, so let's return to our discussion of variables. When variables were introduced in @sec-assign we saw how we can use variables to store a single number. In this section, we'll extend this idea and look at how to store multiple numbers within the one variable. In R, a variable stores multiple values is called a **_vector_**. So let's create one.
### Creating a vector
Let's stick to my silly "get rich quick by textbook writing" example. Suppose the textbook company (if there actually was one, that is) sends sales data on a monthly basis. Since my class start in late February, we might expect most of the sales to occur towards the start of the year. Let's suppose that I have 100 sales in February, 200 sales in March and 50 sales in April, and no other sales for the rest of the year. What I would like to do is have a variable -- let's call it `sales.by.month` -- that stores all this sales data. The first number stored should be `0` since I had no sales in January, the second should be `100`, and so on. The simplest way to do this in R is to use the **_combine_** function, `c()`. To do so, all we have to do is type all the numbers you want to store in a comma separated list, like this:
```{r}
sales.by.month <- c(0, 100, 200, 50, 0, 0, 0, 0, 0, 0, 0, 0)
sales.by.month
```
To use the correct terminology here, we have a single variable here called `sales.by.month`: this variable is a vector that consists of 12 **_elements_**.
### A handy digression
Now that we've learned how to put information into a vector, the next thing to understand is how to pull that information back out again. However, before I do so it's worth taking a slight detour. If you've been following along, typing all the commands into R yourself, it's possible that the output that you saw when we printed out the `sales.by.month` vector was slightly different to what I showed above. This would have happened if the window (or the RStudio panel) that contains the R console is really, really narrow. If that were the case, you might have seen output that looks something like this:
```{r}
#| echo: false
#| output: false
orig_width = getOption("width")
options(width = 20)
```
```{r}
sales.by.month
```
```{r}
#| echo: false
#| output: false
options(width = orig_width)
```
Because there wasn't much room on the screen, R has printed out the results over three lines. But that's not the important thing to notice. The important point is that the first line has a `[1]` in front of it, whereas the second line starts with `[5]` and the third with `[9]`. It's pretty clear what's happening here. For the first row, R has printed out the 1st element through to the 4th element, so it starts that row with a `[1]`. For the second row, R has printed out the 5th element of the vector through to the 8th one, and so it begins that row with a `[5]` so that you can tell where it's up to at a glance. It might seem a bit odd to you that R does this, but in some ways it's a kindness, especially when dealing with larger data sets!
### Getting information out of vectors {#vectorsubset}
To get back to the main story, let's consider the problem of how to get information out of a vector. At this point, you might have a sneaking suspicion that the answer has something to do with the `[1]` and `[9]` things that R has been printing out. And of course you are correct. Suppose I want to pull out the February sales data only. February is the second month of the year, so let's try this:
```{r}
sales.by.month[2]
```
Yep, that's the February sales all right. But there's a subtle detail to be aware of here: notice that R outputs `[1] 100`, *not* `[2] 100`. This is because R is being extremely literal. When we typed in `sales.by.month[2]`, we asked R to find exactly *one* thing, and that one thing happens to be the second element of our `sales.by.month` vector. So, when it outputs `[1] 100` what R is saying is that the first number *that we just asked for* is `100`. This behavior makes more sense when you realize that we can use this trick to create new variables. For example, I could create a `february.sales` variable like this:
```{r}
february.sales <- sales.by.month[2]
february.sales
```
Obviously, the new variable `february.sales` should only have one element and so when I print it out this new variable, the R output begins with a `[1]` because `100` is the value of the first (and only) element of `february.sales`. The fact that this also happens to be the value of the second element of `sales.by.month` is irrelevant. We'll pick this topic up again shortly (@sec-indexing).
### Altering the elements of a vector
Sometimes you'll want to change the values stored in a vector. Imagine my surprise when the publisher rings me up to tell me that the sales data for May are wrong. There were actually an additional 25 books sold in May, but there was an error or something so they hadn't told me about it. How can I fix my `sales.by.month` variable? One possibility would be to assign the whole vector again from the beginning, using `c()`. But that's a lot of typing. Also, it's a little wasteful: why should R have to redefine the sales figures for all 12 months, when only the 5th one is wrong? Fortunately, we can tell R to change only the 5th element, using this trick:
```{r}
sales.by.month[5] <- 25
sales.by.month
```
Another way to edit variables is to use the `edit()` or `fix()` functions. I won't discuss them in detail right now, but you can check them out on your own.
### Useful things to know about vectors {#veclength}
Before moving on, I want to mention a couple of other things about vectors. Firstly, you often find yourself wanting to know how many elements there are in a vector (usually because you've forgotten). You can use the `length()` function to do this. It's quite straightforward:
```{r}
length(x = sales.by.month)
```
Secondly, you often want to alter all of the elements of a vector at once. For instance, suppose I wanted to figure out how much money I made in each month. Since I'm earning an exciting \$7 per book (no seriously, that's actually pretty close to what authors get on the very expensive textbooks that you're expected to purchase), what I want to do is multiply each element in the `sales.by.month` vector by `7`. R makes this pretty easy, as the following example shows:
```{r}
sales.by.month * 7
```
In other words, when you multiply a vector by a single number, all elements in the vector get multiplied. The same is true for addition, subtraction, division and taking powers. So that's neat. On the other hand, suppose I wanted to know how much money I was making per day, rather than per month. Since not every month has the same number of days, I need to do something slightly different. Firstly, I'll create two new vectors:
```{r}
days.per.month <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
profit <- sales.by.month * 7
```
Obviously, the `profit` variable is the same one we created earlier, and the `days.per.month` variable is pretty straightforward. What I want to do is divide every element of `profit` by the *corresponding* element of `days.per.month`. Again, R makes this pretty easy:
```{r}
profit / days.per.month
```
I still don't like all those zeros, but that's not what matters here. Notice that the second element of the output is 25, because R has divided the second element of `profit` (i.e. 700) by the second element of `days.per.month` (i.e. 28). Similarly, the third element of the output is equal to 1400 divided by 31, and so on. We'll talk more about calculations involving vectors later on, but that's enough detail for now.
## Storing text data{#sec-text}
A lot of the time your data will be numeric in nature, but not always. Sometimes your data really needs to be described using text, not using numbers. To address this, we need to consider the situation where our variables store text. To create a variable that stores the word "hello", we can type this:
```{r}
greeting <- "hello"
greeting
```
When interpreting this, it's important to recognise that the quote marks here *aren't* part of the string itself. They're just something that we use to make sure that R knows to treat the characters that they enclose as a piece of text data, known as a **_character string_**. In other words, R treats `"hello"` as a string containing the word "hello"; but if I had typed `hello` instead, R would go looking for a variable by that name! You can also use `'hello'` to specify a character string.
Okay, so that's how we store the text. Next, it's important to recognise that when we do this, R stores the entire word `"hello"` as a *single* element: our `greeting` variable is *not* a vector of five different letters. Rather, it has only the one element, and that element corresponds to the entire character string `"hello"`. To illustrate this, if I actually ask R to find the first element of `greeting`, it prints the whole string:
```{r}
greeting[1]
```
Of course, there's no reason why I can't create a vector of character strings. For instance, if we were to continue with the example of my attempts to look at the monthly sales data for my book, one variable I might want would include the names of all 12 `months`. To do so, I could type in a command like this
```{r}
months <- c("January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November",
"December")
```
This is a **_character vector_** containing 12 elements, each of which is the name of a month. So if I wanted R to tell me the name of the fourth month, all I would do is this:
```{r}
months[4]
```
### Working with text {#simpletext}
Working with text data is somewhat more complicated than working with numeric data. There is much to discuss here, but for purposes of the current chapter we only need this bare bones sketch. The only other thing I want to do before moving on is show you an example of a function that can be applied to text data. So far, most of the functions that we have seen (i.e., `sqrt()`, `abs()` and `round()`) only make sense when applied to numeric data (e.g., you can't calculate the square root of "hello"), and we've seen one function that can be applied to pretty much any variable or vector (i.e., `length()`). So it might be nice to see an example of a function that can be applied to text.
The function I'm going to introduce you to is called `nchar()`, and what it does is count the number of individual characters that make up a string. Recall earlier that when we tried to calculate the `length()` of our `greeting` variable it returned a value of `1`: the `greeting` variable contains only the one string, which happens to be `"hello"`. But what if I want to know how many letters there are in the word? Sure, I could *count* them, but that's boring, and more to the point it's a terrible strategy if what I wanted to know was the number of letters in *War and Peace*. That's where the `nchar()` function is helpful:
```{r}
nchar( x = greeting )
```
That makes sense, since there are in fact 5 letters in the string `"hello"`. Better yet, you can apply `nchar()` to whole vectors. So, for instance, if I want R to tell me how many letters there are in the names of each of the 12 months, I can do this:
```{r}
nchar( x = months )
```
So that's nice to know. The `nchar()` function can do a bit more than this, and there's a lot of other functions that you can do to extract more information from text or do all sorts of fancy things. However, the goal here is not to teach any of that! The goal right now is just to see an example of a function that actually does work when applied to text.
## Storing "true or false" data {#logicals}
Time to move onto a third kind of data. A key concept in that a lot of R relies on is the idea of a **_logical value_** or (Boolean value). A logical value is an assertion about whether something is true or false. This is implemented in R in a pretty straightforward way. There are two logical values, namely `TRUE` and `FALSE`. Despite the simplicity, a logical values are very useful things. Let's see how they work.
### Assessing mathematical truths
In George Orwell's classic book *1984*, one of the slogans used by the totalitarian Party was "two plus two equals five", the idea being that the political domination of human freedom becomes complete when it is possible to subvert even the most basic of truths. It's a terrifying thought, especially when the protagonist Winston Smith finally breaks down under torture and agrees to the proposition. "Man is infinitely malleable", the book says. I'm pretty sure that this isn't true of humans but it's definitely not true of R. R is not infinitely malleable. It has rather firm opinions on the topic of what is and isn't true, at least as regards basic mathematics. If I ask it to calculate `2 + 2`, it always gives the same answer, and it's not bloody 5:
```{r}
2 + 2
```
Of course, so far R is just doing the calculations. I haven't asked it to explicitly assert that $2+2 = 4$ is a true statement. If I want R to make an explicit judgement, I can use a command like this:
```{r}
2 + 2 == 4
```
What I've done here is use the **_equality operator_**, `==`, to force R to make a "true or false" judgement. Okay, let's see what R thinks of the Party slogan:
```{r}
2+2 == 5
```
Booyah! Freedom and ponies for all! Or something like that. Anyway, it's worth having a look at what happens if I try to *force* R to believe that two plus two is five by making an assignment statement like `2 + 2 = 5` or `2 + 2 <- 5`. When I do this, here's what happens:
```{r}
#| error: TRUE
2 + 2 = 5
```
R doesn't like this very much. It recognizes that `2 + 2` is *not* a variable (that's what the "non-language object" part is saying), and it won't let you try to "reassign" it. While R is pretty flexible, and actually does let you do some quite remarkable things to redefine parts of R itself, there are just some basic, primitive truths that it refuses to give up. It won't change the laws of addition, and it won't change the definition of the number `2`.
That's probably for the best.
### Logical operations
So now we've seen logical operations at work, but so far we've only seen the simplest possible example. You probably won't be surprised to discover that we can combine logical operations with other operations and functions in a more complicated way, like this:
```{r}
3*3 + 4*4 == 5*5
```
or this
```{r}
sqrt( 25 ) == 5
```
Not only that, but as Table @tbl-logicals illustrates, there are several other logical operators that you can use, corresponding to some basic mathematical concepts.
```{r}
#| label: tbl-logicals
#| echo: FALSE
knitr::kable(rbind(
c("less than ", "<", "2 < 3", "`TRUE`"),
c("less than or equal to", "<=", "2 <= 2", "`TRUE`"),
c("greater than", ">", "2 > 3", "`FALSE`"),
c("greater than or equal to", ">=", "2 >= 2" , "`TRUE`"),
c("equal to", "==", "2 == 3" , "`FALSE`"),
c("not equal to", "!=", "2 != 3" , "`TRUE`")),
caption = 'Some logical operators. Technically I should be calling these "binary relational operators", but quite frankly I don\'t want to. It\'s my book so no-one can make me.',
col.names = c("operation", "operator", "example input", "answer"),
booktabs = TRUE
)
```
Hopefully these are all pretty self-explanatory: for example, the **_less than_** operator `<` checks to see if the number on the left is less than the number on the right. If it's less, then R returns an answer of `TRUE`:
```{r}
99 < 100
```
but if the two numbers are equal, or if the one on the right is larger, then R returns an answer of `FALSE`, as the following two examples illustrate:
```{r}
100 < 100
100 < 99
```
In contrast, the **_less than or equal to_** operator `<=` will do exactly what it says. It returns a value of `TRUE` if the number of the left hand side is less than or equal to the number on the right hand side. So if we repeat the previous two examples using `<=`, here's what we get:
```{r}
100 <= 100
100 <= 99
```
And at this point I hope it's pretty obvious what the **_greater than_** operator `>` and the **_greater than or equal to_** operator `>=` do! Next on the list of logical operators is the **_not equal to_** operator `!=` which -- as with all the others -- does what it says it does. It returns a value of `TRUE` when things on either side are not identical to each other. Therefore, since $2+2$ isn't equal to $5$, we get:
```{r}
2 + 2 != 5
```
We're not quite done yet. There are three more logical operations that are worth knowing about, listed in Table @tbl-logicals2.
```{r}
#| label: tbl-logicals2
#| echo: FALSE
knitr::kable(rbind(
c("not", "!", "!(1==1)", "`FALSE`"),
c("or", "|", "(1==1) | (2==3)", "`TRUE`"),
c("and", "&", "(1==1) & (2==3)", "`FALSE`")),
caption = 'Some more logical operators.',
col.names = c("operation", "operator", "example input", "answer"),
booktabs = TRUE
)
```
These are the **_not_** operator `!`, the **_and_** operator `&`, and the **_or_** operator `|`. Like the other logical operators, their behavior is more or less exactly what you'd expect given their names. For instance, if I ask you to assess the claim that either $2+2 = 4$ *or* $2+2 = 5$, then you'd say that claim is true. Since it's an "either-or" statement, all we need is for one of the two parts to be true. That's what the `|` operator does:
```{r}
(2+2 == 4) | (2+2 == 5)
```
On the other hand, if I ask you to assess the claim that both $2+2 = 4$ *and* $2+2 = 5$, then you'd say that claim is false. Since this is an *and* statement we need both parts to be true. And that's what the `&` operator does:
```{r}
(2+2 == 4) & (2+2 == 5)
```
Finally, there's the *not* operator, which is simple but annoying to describe in English. If I ask you to assess my claim that "it is not true that $2+2 = 5$", then you would say that claim is true; because my claim is that "$2+2 = 5$ is false". And I'm right. If we write this as an R command we get this:
```{r}
! (2+2 == 5)
```
In other words, since `2+2 == 5` is a `FALSE` statement, it must be the case that `!(2+2 == 5)` is a `TRUE` one. Essentially, what we've really done is claim that "not false" is the same thing as "true". Obviously, this isn't really quite right in real life. But logical values encode a black and white world: any given logical statement is either true or false. No shades of gray are allowed. We can actually see this much more explicitly, like this:
```{r}
! FALSE
```
Of course, in our $2+2 = 5$ example, we didn't really need to use "not" `!` and "equals to" `==` as two separate operators. We could have just used the "not equals to" operator `!=` like this:
```{r}
2+2 != 5
```
But there are many situations where you really do need to use the `!` operator. We'll see some later on.
### Storing and using logical data
Up to this point, I've introduced *numeric data* (@sec-assign and @sec-vectors) and *character data* (@sec-text). So you might not be surprised to discover that these `TRUE` and `FALSE` values that R has been producing are actually a third kind of data, called *logical data*. That is, when I asked R if `2 + 2 == 5` and it said `[1] FALSE` in reply, it was actually producing information that we can store in variables. For instance, I could create a variable called `is.the.Party.correct`, which would store R's opinion:
```{r}
is.the.Party.correct <- 2 + 2 == 5
is.the.Party.correct
```
Alternatively, you can assign the value directly, by typing `TRUE` or `FALSE` in your command. Like this:
```{r}
is.the.Party.correct <- FALSE
is.the.Party.correct
```
Better yet, because it's kind of tedious to type `TRUE` or `FALSE` over and over again, R provides you with a shortcut: you can use `T` and `F` instead (but it's case sensitive: `t` and `f` won't work).
## TRUE and FALSE
`TRUE` and `FALSE` are reserved keywords in R, so you can trust that they always mean what they say they do. Unfortunately, the shortcut versions `T` and `F` do not have this property. It's even possible to create variables that set up the reverse meanings, by typing commands like `T <- FALSE` and `F <- TRUE`. This is kind of insane, and something that is generally thought to be a design flaw in R. Anyway, the long and short of it is that it's safer to use `TRUE` and `FALSE`.:::
So this works:
```{r}
is.the.Party.correct <- F
is.the.Party.correct
```
but this doesn't:
```{r}
#| error: TRUE
is.the.Party.correct <- f
```
### Vectors of logicals
The next thing to mention is that you can store vectors of logical values in exactly the same way that you can store vectors of numbers (@sec-vectors) and vectors of text data (@sec-text). Again, we can define them directly via the `c()` function, like this:
```{r}
x <- c(TRUE, TRUE, FALSE)
x
```
or you can produce a vector of logicals by applying a logical operator to a vector. This might not make a lot of sense to you, so let's unpack it slowly. First, let's suppose we have a vector of numbers (i.e., a "non-logical vector"). For instance, we could use the `sales.by.month` vector that we were using in @sec-vectors. Suppose I wanted R to tell me, for each month of the year, whether I actually sold a book in that month. I can do that by typing this:
```{r}
sales.by.month > 0
```
and again, I can store this in a vector if I want, as the example below illustrates:
```{r}
any.sales.this.month <- sales.by.month > 0
any.sales.this.month
```
In other words, `any.sales.this.month` is a logical vector whose elements are `TRUE` only if the corresponding element of `sales.by.month` is greater than zero. For instance, since I sold zero books in January, the first element is `FALSE`.
### Applying logical operation to text{#sec-logictext}
In a moment (@sec-indexing) I'll show you why these logical operations and logical vectors are so handy, but before I do so I want to very briefly point out that you can apply them to text as well as to logical data. It's just that we need to be a bit more careful in understanding how R interprets the different operations. In this section I'll talk about how the equal to operator `==` applies to text, since this is the most important one. Obviously, the not equal to operator `!=` gives the exact opposite answers to `==` so I'm implicitly talking about that one too, but I won't give specific commands showing the use of `!=`. There are a variety of other operators, but those will do for now.
Okay, let's see how it works. In one sense, it's very simple. For instance, I can ask R if the word `"cat"` is the same as the word `"dog"`, like this:
```{r}
"cat" == "dog"
```
That's pretty obvious, and it's good to know that even R can figure that out. Similarly, R does recognize that a `"cat"` is a `"cat"`:
```{r}
"cat" == "cat"
```
Again, that's exactly what we'd expect. However, what you need to keep in mind is that R is not at all tolerant when it comes to grammar and spacing. If two strings differ in any way whatsoever, R will say that they're not equal to each other, as the following examples indicate:
```{r}
" cat" == "cat"
"cat" == "CAT"
"cat" == "c a t"
```
## Indexing vectors {#sec-indexing}
One last thing to add before finishing up this chapter. So far, whenever I've had to get information out of a vector, all I've done is typed something like `months[4]`; and when I do this R prints out the fourth element of the `months` vector. In this section, I'll show you two additional tricks for getting information out of the vector.
### Extracting multiple elements
One very useful thing we can do is pull out more than one element at a time. In the previous example, we only used a single number (i.e., `2`) to indicate which element we wanted. Alternatively, we can use a vector. So, suppose I wanted the data for February, March and April. What I could do is use the vector `c(2,3,4)` to indicate which elements I want R to pull out. That is, I'd type this:
```{r}
sales.by.month[ c(2,3,4) ]
```
Notice that the order matters here. If I asked for the data in the reverse order (i.e., April first, then March, then February) by using the vector `c(4,3,2)`, then R outputs the data in the reverse order:
```{r}
sales.by.month[ c(4,3,2) ]
```
A second thing to be aware of is that R provides you with handy shortcuts for very common situations. For instance, suppose that I wanted to extract everything from the 2nd month through to the 8th month. One way to do this is to do the same thing I did above, and use the vector `c(2,3,4,5,6,7,8)` to indicate the elements that I want. That works just fine
```{r}
sales.by.month[ c(2,3,4,5,6,7,8) ]
```
but it's kind of a lot of typing. To help make this easier, R lets you use `2:8` as shorthand for `c(2,3,4,5,6,7,8)`, which makes things a lot simpler. First, let's just check that this is true:
```{r}
2:8
```
Next, let's check that we can use the `2:8` shorthand as a way to pull out the 2nd through 8th elements of `sales.by.months`:
```{r}
sales.by.month[2:8]
```
So that's kind of neat.
### Logical indexing
At this point, I can introduce an extremely useful tool called **_logical indexing_**. In the last section, I created a logical vector `any.sales.this.month`, whose elements are `TRUE` for any month in which I sold at least one book, and `FALSE` for all the others. However, that big long list of `TRUE`s and `FALSE`s is a little bit hard to read, so what I'd like to do is to have R select the names of the `months` for which I sold any books. Earlier on, I created a vector `months` that contains the names of each of the months. This is where logical indexing is handy. What I need to do is this:
```{r}
months[ sales.by.month > 0 ]
```
To understand what's happening here, it's helpful to notice that `sales.by.month > 0` is the same logical expression that we used to create the `any.sales.this.month` vector in the last section. In fact, I could have just done this:
```{r}
months[ any.sales.this.month ]
```
and gotten exactly the same result. In order to figure out which elements of `months` to include in the output, what R does is look to see if the corresponding element in `any.sales.this.month` is `TRUE`. Thus, since element 1 of `any.sales.this.month` is `FALSE`, R does not include `"January"` as part of the output; but since element 2 of `any.sales.this.month` is `TRUE`, R does include `"February"` in the output. Note that there's no reason why I can't use the same trick to find the actual sales numbers for those months. The command to do that would just be this:
```{r}
sales.by.month [ sales.by.month > 0 ]
```
In fact, we can do the same thing with text. Here's an example. Suppose that -- to continue the saga of the textbook sales -- I later find out that the bookshop only had sufficient stocks for a few months of the year. They tell me that early in the year they had `"high"` stocks, which then dropped to `"low"` levels, and in fact for one month they were `"out"` of copies of the book for a while before they were able to replenish them. Thus I might have a variable called `stock.levels` which looks like this:
```{r}
stock.levels<-c("high", "high", "low", "out", "out", "high",
"high", "high", "high", "high", "high", "high")
stock.levels
```
Thus, if I want to know the months for which the bookshop was out of my book, I could apply the logical indexing trick, but with the character vector `stock.levels`, like this:
```{r}
months[stock.levels == "out"]
```
Alternatively, if I want to know when the bookshop was either low on copies or out of copies, I could do this:
```{r}
months[stock.levels == "out" | stock.levels == "low"]
```
or this
```{r}
months[stock.levels != "high" ]
```
Either way, I get the answer I want.
At this point, I hope you can see why logical indexing is such a useful thing. It's a very basic, yet very powerful way to manipulate data. Subsequent chapters will talk a lot more about how to manipulate data, since it's a critical skill for real world research that is often overlooked in introductory statistics courses It does take a bit of practice to become completely comfortable using logical indexing, so it's a good idea to play around with these sorts of commands. Try creating a few different variables of your own, and then ask yourself questions like "how do I get R to spit out all the elements that are [blah]". Practice makes perfect, and it's only by practicing logical indexing that you'll perfect the art of yelling frustrated insults at your computer.
## Exercises
1. Compute $42+17$
1. Compute $8-3$
1. Compute $(8-3)^2$
1. Compute $\frac{42+17}{(8-3)^2}$
1. Define a vector containing the numbers 29, 63, 7, 23, 84, 10 and 9.
1. Imagine this vector contains counts in units of months (29 months, 63 months, etc.). Compute a new vector that contains the same measurement but not in units of years. That is, divide all the entries in the previous vector by 12. Print these new measurements to the console.
1. Create two strings (character vectors). One should be `"R rules!"` and the other should be `"r rules!"`. Determine whether these two vectors are equal.
1. Create a vector that consists of 6 values: 3 even and 3 odd.
1. Modify the third value in this vector so that it is now double it's original value.
1. Modify this vector so that all the odd numbers are removed.
1. Calculate the sum of the values currently in the vector.
1. Imagine a study in which participants must weigh less than 90 kg and be between 18 and 60 years of age. Define a vector of weights as `weight <- c(80, 75, 92, 105, 60)` and a vector of ages `age <- c(50, 17, 39, 27, 90)`. Now calculate a vector of logical values, each of which indicates whether the corresponding participant is eligible for the study.
1. Calculate the sum of 0.1 and 0.2.
1. Calculate 10 times the sum of 0.1 and 0.2.
1. Determine whether 10 times the sum of 0.1 and 0.2 is equal to 3.