R Tutorial

March 2013

Sebastian Campbell and José Padarian
Use the left and right arrow keys to navigate slides.

Welcome to R

Hopefully by now you've started up RStudio (or your IDE or text editor of choice) and are greeted by a command prompt:

Say Hello!

As is traditional in programming, we'll start off with a Hello world! program. Try the below:

print("Hello world!")

## [1] "Hello world!"

Congratulations! You've taken your first steps into R. You've used the function print to print the text "Hello world!" to the screen. Functions are basically like verbs in language, they tell R to do something. Notice the brackets () after the name of the function in which you can tell the function what to do.

Don't worry too much about the [1] at the beginning, we'll explain that soon. Note that you can also type "Hello world!" without the print command and R will assume you meant to print it.

Some basic maths

We can also use R as a calculator for addition and powers:

2 + 2

## [1] 4

2^3

## [1] 8

Note that there are other operators we can use, like * for multiplications, - for subtraction and / for division.

Quick quiz

We'll be having a few quizzes like this during this tutorial. No pressure, it's just to check your understanding. Click 'Submit' to check if your answer is right. If you give up 'Check answer' will give you an answer. 'Clear' will allow you to start that quiz from scratch.

What is the expected result from `6 / 2`?

It's 3, because $\frac{6}{2} = 3$.

Assigning variables

So now we can do some maths, but we don't want to have to remember the results every time. So let's store some as variable, so we can refer to them by name. To assign, we can use either the = or <- operator (though <- is recommended).

core_volume <- 100
core_mass <- 180
bulk_density <- core_mass/core_volume

bulk_density

## [1] 1.8

Note that assignment doesn't print the variable to the console.

Not just numbers

R doesn't just work with numbers, it can also work with "strings" or characters. Strings are denoted by the quotes ("") surrounding them. We've already used these when we made our "Hello world!" program. In fact, we can use the paste function to stick them together.

paste("The bulk density is:", 1.75, sep = " ")

## [1] "The bulk density is: 1.75"

The paste function sticks strings and numbers together into a single string. The sep argument says: "Put a space between all of the things I'm sticking together".

Making readable results

Let's combine all the things we know so far into something useful. Remember the bulk density we calculated earlier?

paste("The bulk density is:", bulk_density, sep = " ")

## [1] "The bulk density is: 1.8"

Here R evaluated our variable bulk_density and added it to the string we were trying to make.

Quiz 2

Which of these lines of code will produce the result: `"1:2:3:x:y:z"`

paste(1, 2, 3, "x", "y", "z", sep=":")
paste("123xyz", sep=":")
paste(1, 2, 3, x, y, z, sep=":")
paste(1, 2, 3, "x", "y", "z")

The first option. The second option won't work as there's only one string, so R can't place separators between strings. The third option is trying to call variables named x, y and z (but it's likely none exist). Finally, the fourth option is space-separated, not colon-separated.

Bigger data

So we've been dealing with individual strings and numbers, let's move onto combining those into vectors. A vector is basically just a 1-dimensional array of data. Like a set of numbers or a shopping list. It's really easy to make vectors of sequential integers:

1:10

##  [1]  1  2  3  4  5  6  7  8  9 10

5:-5

##  [1]  5  4  3  2  1  0 -1 -2 -3 -4 -5

More vectors

You can also make vectors of strings or numbers using the combine (c) function:

c(5, 4, 3, 2, 8)

## [1] 5 4 3 2 8

c("sand", "silt", "clay")

## [1] "sand" "silt" "clay"

Limits of atomic vectors

But be careful. These kinds of vectors can only take one type of data. If you combine numbers and strings together:

c("SOILID001", 3, 5, "SOILID006", 7)

## [1] "SOILID001" "3"         "5"         "SOILID006" "7"

Then they all become characters.

If you think about it, you can treat numbers as letters, but it's a lot harder to treat letters as numbers.

Even bigger data

Now that we can make vectors we can make subsets of bigger datasets, matrices and dataframes. Just to start off, we'll use a simple dataset which comes with R: trees (Try ?trees for more information about the data). It is a data frame, a collection of vectors as columns. The vectors are allowed to be different types (character, integer, numeric), but they must all be the same length.

trees

##    Girth Height Volume
## 1    8.3     70   10.3
## 2    8.6     65   10.3
## 3    8.8     63   10.2
## 4   10.5     72   16.4
## 5   10.7     81   18.8
## 6   10.8     83   19.7
## 7   11.0     66   15.6
## 8   11.0     75   18.2
## 9   11.1     80   22.6
## 10  11.2     75   19.9
## 11  11.3     79   24.2
## 12  11.4     76   21.0
## 13  11.4     76   21.4
## 14  11.7     69   21.3
## 15  12.0     75   19.1
## 16  12.9     74   22.2
## 17  12.9     85   33.8
## 18  13.3     86   27.4
## 19  13.7     71   25.7
## 20  13.8     64   24.9
## 21  14.0     78   34.5
## 22  14.2     80   31.7
## 23  14.5     74   36.3
## 24  16.0     72   38.3
## 25  16.3     77   42.6
## 26  17.3     81   55.4
## 27  17.5     82   55.7
## 28  17.9     80   58.3
## 29  18.0     80   51.5
## 30  18.0     80   51.0
## 31  20.6     87   77.0

Subsetting

The basic form of a subset is object[whichrows, whichcolumns]. Leaving an element blank means select all. So for example, to get the third row of trees, or the second column:

trees[3, ]

##   Girth Height Volume
## 3   8.8     63   10.2

trees[, 2]

##  [1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74
## [24] 72 77 81 82 80 80 80 87

You might have noticed the [24] in the second line of trees[, 2]. This indicates that the 72 next to it is the 24th element of that vector. There just isn't enough room on the screen to show it all in one line.

Finer subsetting

We can get individual values by setting both rows and columns

trees[3, 2]

## [1] 63

trees[1, 1]

## [1] 8.3

The last elements

And we can also use nrow and ncol (gets the number of rows or columns) to get the last elements:

nrow(trees)

## [1] 31

trees[nrow(trees), ]

##    Girth Height Volume
## 31  20.6     87     77

trees[, ncol(trees)]

##  [1] 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 24.2 21.0 21.4 21.3
## [15] 19.1 22.2 33.8 27.4 25.7 24.9 34.5 31.7 36.3 38.3 42.6 55.4 55.7 58.3
## [29] 51.5 51.0 77.0

Why stop at one?

You can also subset collections of elements. To get the fourth to seventh row of the first and third column:

trees[4:7, c(1, 3)]

##   Girth Volume
## 4  10.5   16.4
## 5  10.7   18.8
## 6  10.8   19.7
## 7  11.0   15.6

Subsetting quiz

What is the number in the third last row of the 2nd column of `trees`?

80
87
65
10.3

Using the line trees[nrow(trees) - 2, 2] will get you the third last row of the second column. Third last is the last one minus 2 places.

Other ways of extracting columns

Data frames are collections of vectors, so you can use their names to extract them. You can use your original notation and also $ notation (which is convenient, but not as powerful).

trees[, "Height"]

##  [1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74
## [24] 72 77 81 82 80 80 80 87

trees$Height

##  [1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74
## [24] 72 77 81 82 80 80 80 87

Subsetting by a conditional

Often, we want all the rows that meet a certain condition. Say I want all the trees where the height is greater than 82 ft. First, I can produce a TRUE/FALSE variable, then take only the cases where it is true:

greater82 <- trees$Height > 82
greater82[1:5]

## [1] FALSE FALSE FALSE FALSE FALSE

trees[greater82, ]

##    Girth Height Volume
## 6   10.8     83   19.7
## 17  12.9     85   33.8
## 18  13.3     86   27.4
## 31  20.6     87   77.0

Logical subsetting quiz

How many trees are there with a girth less than 12.5 inches?

Using the code trees[trees$Girth < 12.5, ], we can see that there are 15 trees with a girth less than 12.5. Press 'p' to see the whole expected output.

trees[trees$Girth < 12.5, ]

##    Girth Height Volume
## 1    8.3     70   10.3
## 2    8.6     65   10.3
## 3    8.8     63   10.2
## 4   10.5     72   16.4
## 5   10.7     81   18.8
## 6   10.8     83   19.7
## 7   11.0     66   15.6
## 8   11.0     75   18.2
## 9   11.1     80   22.6
## 10  11.2     75   19.9
## 11  11.3     79   24.2
## 12  11.4     76   21.0
## 13  11.4     76   21.4
## 14  11.7     69   21.3
## 15  12.0     75   19.1

Getting help

But how do I find out about all these functions? There are a few ways of getting help in R:

The default help functions. Try ?print or help(print)
Other help functions: RSiteSearch("soil")
StackOverflow
The wonderful Google

At the moment, the help files will be fairly hard to get through, but they make more and more sense the more you use R. Some help files will be much simpler than others. If you find the help file impossible to get through, try testing the examples at the bottom of most help files or access them using the examples function.

Getting help quiz

What does the `sep` argument of `read.csv` do?

Separates the arguments with commas and puts them together
Determines the field separator character of a file which has been read in
Decides whether to use the Sepine Angulator algorithm
If TRUE, splits files into list elements

The read.csv function entry found using ?read.csv or help(read.csv) says sep is the 'field separator character'.

Dataframes

Now we're ready to deal with some real data. We have some soil temperature data (University of Alaska Fairbanks, 2011). The best function to read in CSVs is read.csv. Let's store that object and call it soiltempdata.

url <- "https://raw.github.com/sebastian-c/r-mar2013/gh-pages/soiltempdata.csv"
soiltempdata <- read.csv(url)

That looks like it went through nicely. No errors, no complaints.

Checking your data

It's a good idea to look at the first few rows of your data, to make sure it went through ok. We'll use head to get the first 4 rows of ours.

head(soiltempdata, 4)

##   COL.                                DATA.TYPE X X.1 X.2 X.3 X.4 X.5 X.6
## 1    1                          Date Time (AST)                          
## 2    2  0-cm Depth Soil Temperature (degrees C)                          
## 3    3  5-cm Depth Soil Temperature (degrees C)                          
## 4    4 10-cm Depth Soil Temperature (degrees C)                          
##   X.7 X.8 X.9 X.10 X.11 X.12 X.13 X.14
## 1                                     
## 2                                     
## 3                                     
## 4

Blech. That's not what we wanted at all. Most of those columns are empty.

Fixing the data

What went wrong? Let's download the data and open it up in Excel or Notepad and look at it (ideally, you should do this before trying to read it in). Click here to link to the data (if you're feeling adventurous, try using the download.file function to download it. Be sure to use the mode="wb" option). Save it to your hard drive (put it into My Documents). Then open the CSV file.

Well there's the problem, there's all this other stuff before we get to the actual data.

Getting to work

Set your working directory to where you stored the data. You can use the getwd function to work out what directory you're in and the setwd function to set the working directory. Remember to use forward slashes (/) as the path separator.

getwd()

## [1] "C:/Research/"

setwd("C:/Users/Sebastian/Documents")
getwd()

## [1] "C:/Users/Sebastian/Documents"

Skipping the first few lines

Ok, so we had a problem with the data before, let's see if we can read it in properly this time. So the first problem is that we have a few lines before we get to the data. We could remove these manually, but R has a neat trick to sort this problem out, the skip argument in read.csv.

How many lines (rows) do we have to skip?

There are 16 lines we have to skip (not 17, we want to keep the headers).

Reading in the data

Now after all these problems, we can finally read in the data and store it as an object. We can see it actually has columns we want.

soiltempdata <- read.csv("soiltempdata.csv", skip = 16)
head(soiltempdata, 3)

##   ID Year Month Day Hour Depth.0.cm Depth.5.cm Depth.10.cm Depth.15.cm
## 1  1 2011     1   1    0     -6.752     -4.940      -5.061      -4.401
## 2  2 2011     1   1    1     -6.727     -4.916      -5.098      -4.378
## 3  3 2011     1   1    2     -6.726     -4.916      -5.098      -4.377
##   Depth.20.cm Depth.40.cm Depth.60.cm Depth.80.cm Depth.100.cm
## 1      -3.698      -2.014       0.041      -0.215         7777
## 2      -3.733      -1.993       0.010      -0.195         7777
## 3      -3.733      -1.992       0.010      -0.195         7777
##   Depth.120.cm Depth.135.cm Depth.150.cm
## 1       -0.525       -0.577       -0.995
## 2       -0.505       -0.608       -1.028
## 3       -0.504       -0.608       -1.027

Structure of objects

Another way of examining objects is the str function.

str(soiltempdata)

## 'data.frame':    6552 obs. of  17 variables:
##  $ ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Year        : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Day         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Hour        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Depth.0.cm  : num  -6.75 -6.73 -6.73 -6.65 -6.67 ...
##  $ Depth.5.cm  : num  -4.94 -4.92 -4.92 -4.91 -4.86 ...
##  $ Depth.10.cm : num  -5.06 -5.1 -5.1 -5.03 -5.04 ...
##  $ Depth.15.cm : num  -4.4 -4.38 -4.38 -4.37 -4.38 ...
##  $ Depth.20.cm : num  -3.7 -3.73 -3.73 -3.72 -3.68 ...
##  $ Depth.40.cm : num  -2.01 -1.99 -1.99 -2.04 -2.05 ...
##  $ Depth.60.cm : num  0.041 0.01 0.01 0.02 0.004 0.04 0.078 0.071 0.055 0.05 ...
##  $ Depth.80.cm : num  -0.215 -0.195 -0.195 -0.236 -0.2 -0.165 -0.177 -0.184 -0.201 -0.205 ...
##  $ Depth.100.cm: int  7777 7777 7777 7777 7777 7777 7777 7777 7777 7777 ...
##  $ Depth.120.cm: num  -0.525 -0.505 -0.504 -0.546 -0.51 -0.526 -0.487 -0.494 -0.511 -0.515 ...
##  $ Depth.135.cm: num  -0.577 -0.608 -0.608 -0.598 -0.614 -0.578 -0.591 -0.598 -0.615 -0.619 ...
##  $ Depth.150.cm: num  -0.995 -1.028 -1.027 -1.017 -0.98 ...

Yet another problem

Something still isn't right. Look at the 100cm depth soil temperature from the previous slide. Note that you can use the $ symbol to subset out individual columns.

str(soiltempdata$Depth.100.cm)

##  int [1:6552] 7777 7777 7777 7777 7777 7777 7777 7777 7777 7777 ...

There is no way that there is a soil with a temperature of 7777 degrees celsius. So why is that present? Well, if we reopen the data in Excel or Notepad, we'll see there were two important lines:

NOTE:,6999 DENOTES MISSING DATA OR INVALID DATA
,7777 DENOTES UNCOLLECTED DATA

So it would seem that anything with these values should be a missing value. Turns out R has a trick for that too. Let's reread in the data again (last time, I promise).

na.strings

If we look at the help file for read.csv (?read.csv), we see there's an argument called na.strings. We have two things we need to be NA, so we combine them like we did a few slides ago: c(6999, 7777).

soiltempdata <- read.csv("soiltempdata.csv", skip = 16, na.strings = c(6999, 
    7777))
str(soiltempdata$Depth.100.cm)

##  logi [1:6552] NA NA NA NA NA NA ...

And now it's full of NAs (missing values), as it should be.

Summaries

A summary of a dataframe's columns can easily be obtained with summary:

summary(soiltempdata)

##        ID            Year          Month           Day      
##  Min.   :   1   Min.   :2011   Min.   :1.00   Min.   : 1.0  
##  1st Qu.:1639   1st Qu.:2011   1st Qu.:3.00   1st Qu.: 8.0  
##  Median :3276   Median :2011   Median :5.00   Median :16.0  
##  Mean   :3276   Mean   :2011   Mean   :5.02   Mean   :15.7  
##  3rd Qu.:4914   3rd Qu.:2011   3rd Qu.:7.00   3rd Qu.:23.0  
##  Max.   :6552   Max.   :2011   Max.   :9.00   Max.   :31.0  
##       Hour         Depth.0.cm       Depth.5.cm     Depth.10.cm   
##  Min.   : 0.00   Min.   :-10.87   Min.   :-9.90   Min.   :-9.88  
##  1st Qu.: 5.75   1st Qu.: -8.70   1st Qu.:-8.24   1st Qu.:-8.32  
##  Median :11.50   Median : -3.79   Median :-2.79   Median :-2.92  
##  Mean   :11.50   Mean   :  0.44   Mean   : 0.17   Mean   :-1.07  
##  3rd Qu.:17.25   3rd Qu.:  7.33   3rd Qu.: 6.32   3rd Qu.: 4.49  
##  Max.   :23.00   Max.   : 37.21   Max.   :32.20   Max.   :22.02  
##   Depth.15.cm     Depth.20.cm     Depth.40.cm     Depth.60.cm    
##  Min.   :-9.52   Min.   :-9.33   Min.   :-8.81   Min.   :-7.787  
##  1st Qu.:-8.04   1st Qu.:-7.86   1st Qu.:-7.51   1st Qu.:-6.435  
##  Median :-2.58   Median :-2.28   Median :-1.82   Median :-1.444  
##  Mean   :-1.51   Mean   :-1.87   Mean   :-2.75   Mean   :-2.846  
##  3rd Qu.: 3.90   3rd Qu.: 3.31   3rd Qu.: 1.08   3rd Qu.: 0.199  
##  Max.   :14.76   Max.   :10.35   Max.   : 4.03   Max.   : 1.251  
##   Depth.80.cm     Depth.100.cm    Depth.120.cm     Depth.135.cm   
##  Min.   :-7.473   Mode:logical   Min.   :-6.935   Min.   :-6.709  
##  1st Qu.:-6.212   NA's:6552      1st Qu.:-5.713   1st Qu.:-5.495  
##  Median :-2.009                  Median :-2.482   Median :-2.577  
##  Mean   :-3.170                  Mean   :-3.230   Mean   :-3.205  
##  3rd Qu.:-0.449                  3rd Qu.:-0.977   3rd Qu.:-1.096  
##  Max.   :-0.122                  Max.   :-0.487   Max.   :-0.574  
##   Depth.150.cm  
##  Min.   :-6.90  
##  1st Qu.:-5.69  
##  Median :-3.04  
##  Mean   :-3.59  
##  3rd Qu.:-1.63  
##  Max.   :-0.98

Histogram

A simple exploratory analysis could start, for example, with a histogram of the data to observe the distribution of the data. Try using the function hist with the temperatures at depth 0:

hist(soiltempdata$Depth.0.cm)

plot of chunk unnamed-chunk-27

(Pretty skewed, huh?)

Plotting

We can now do some basic plotting. Let's look at the 0cm depth over time. This is easy using the plot function. We need specify the x variable and the y variable. We'll use the type="l" argument to make it a line plot (the default has points):

plot(soiltempdata$ID, soiltempdata$Depth.0.cm, type = "l")

plot of chunk unnamed-chunk-28

Plotting a subset

What if we have a special interest in temperature midday temperatures? Let's create a histogram of midday temperatures at depth 0:

hist(soiltempdata$Depth.0.cm[soiltempdata$Hour == 12])

plot of chunk unnamed-chunk-29

Boxplots

Another simple and useful plot is the boxplot. Again, using the soil temperature at depth 0 and the function boxplot. Note that we can fix up those messy axis labels with the xlab and ylab arguments:

boxplot(Depth.0.cm ~ Month, data = soiltempdata, xlab = "Month", ylab = "Temperature")

plot of chunk unnamed-chunk-30

Tidy data

At the moment the dataset is usable, but is it tidy (Wickham, 2011)? To be tidy, a dataset must have the following qualities:

Each variable forms a column
Each observation forms a row
Each table stores data about one class of experimental unit

This data has multiple columns for temperature variables. It would be easier to deal with if all temperatures were in the same column. Ideally, we would have a column for date, a column for depth and a column for temperature. This is often called converted from 'wide' format (many columns, few rows) to 'long' format (few columns, many rows).

Loading in packages

One of the best ways to rearrange your data is to use the reshape2 package (Wickham, 2007). You can use the library or the require function to load it.

You may get an error:

library(reshape2)

## Error: there is no package called 'reshape2'

If this happens to you, then reshape2 has not yet been installed. Use install.packages("reshape2") to install it and library(reshape2) to load it.

Reshaping data

Now we have the reshape2 package loaded in, we can use the melt function to melt out dataframe. See the ?melt.data.frame page for details. The id.vars are the variables we don't want to melt. The .name arguments control the column names of the result:

library(reshape2)
soiltemp_melt <- melt(soiltempdata, id.vars = c("ID", "Year", "Month", "Day", 
    "Hour"), variable.name = "Depth", value.name = "Temperature")

head(soiltemp_melt)

##   ID Year Month Day Hour      Depth Temperature
## 1  1 2011     1   1    0 Depth.0.cm      -6.752
## 2  2 2011     1   1    1 Depth.0.cm      -6.727
## 3  3 2011     1   1    2 Depth.0.cm      -6.726
## 4  4 2011     1   1    3 Depth.0.cm      -6.651
## 5  5 2011     1   1    4 Depth.0.cm      -6.671
## 6  6 2011     1   1    5 Depth.0.cm      -6.627

Poor depths

See now we have depth in one column and temperature in another. That's not really good enough though. Out Depth column is still not useful. Using the table function to get a summary of the values in it:

table(soiltemp_melt$Depth)

## 
##   Depth.0.cm   Depth.5.cm  Depth.10.cm  Depth.15.cm  Depth.20.cm 
##         6552         6552         6552         6552         6552 
##  Depth.40.cm  Depth.60.cm  Depth.80.cm Depth.100.cm Depth.120.cm 
##         6552         6552         6552         6552         6552 
## Depth.135.cm Depth.150.cm 
##         6552         6552

There are 6552 of each Depth measurement. But the depth measurement isn't a number, it's a character with a number in it. What if we could get that number out and use this column as a numeric (like Temperature is)?

Regex

We can use regular expressions to extract the numbers. First, we'll use the regexpr function to extract the locations of the numbers:

num_matches <- regexpr("[[:digit:]]+", soiltemp_melt$Depth)
str(num_matches)

##  atomic [1:78624] 7 7 7 7 7 7 7 7 7 7 ...
##  - attr(*, "match.length")= int [1:78624] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "useBytes")= logi TRUE

The string [[:digit:]]+ is a special string. [[:digit:]] means any number from 0-9 and the + on the end means 1 or more. So find one or more numbers. The results mean that the match starts at the 7th digit ("Depth.0") and is only 1 character long. If we were to scroll down to the "Depth.150.cm", we'd see that the match also starts at the 7th digit and is 3 characters long.

From Regex to reality

We can use the regmatches command to convert from these positions to the actual numbers they match, then assign that back to our Depth column.

depth_nums <- regmatches(soiltemp_melt$Depth, num_matches)
head(depth_nums)

## [1] "0" "0" "0" "0" "0" "0"

Hmm... characters (you can tell by the quotes). We had better convert those to numbers before we replace our depth column).

soiltemp_melt$Depth <- as.numeric(depth_nums)

Stop and check quiz

What is the 3rd element of the Depth column now?

-6.727
0
1
Depth.0.cm

Plots again

Now that we have stacked data, it's much easier to make plots. For example, it's easy to make a simple scatterplot of Temperature vs Depth.

plot(Temperature ~ Depth, data = soiltemp_melt)

plot of chunk unnamed-chunk-37

Depth boxplots

We can also make better boxplots:

boxplot(Temperature ~ Depth, data = soiltemp_melt, xlab = "Depth", ylab = "Temperature")

plot of chunk unnamed-chunk-38

plyr

Stacking out data has also made our data easier to summarise. We're going to use the plyr package. In particular, a function called ddply Let's start with something simple. Let's get the mean temperature for each Depth:

library(plyr)
depth_mean <- ddply(soiltemp_melt, c("Depth"), summarise, meandepth = mean(Temperature, 
    na.rm = TRUE))
depth_mean

##    Depth meandepth
## 1      0    0.4403
## 2      5    0.1725
## 3     10   -1.0736
## 4     15   -1.5077
## 5     20   -1.8686
## 6     40   -2.7486
## 7     60   -2.8459
## 8     80   -3.1703
## 9    100       NaN
## 10   120   -3.2299
## 11   135   -3.2049
## 12   150   -3.5913

Plotting summaries

We can now plot that easily:

plot(meandepth ~ Depth, data = depth_mean)

plot of chunk unnamed-chunk-40

More complex summaries

What if we want more? Say we want to know the temperature distribution during the day for each depth?

depthday_dist <- ddply(soiltemp_melt, c("Hour", "Depth"), summarise, meandepthday = mean(Temperature))
head(depthday_dist, 2)

##   Hour Depth meandepthday
## 1    0     0       -2.688
## 2    0     5       -2.249

That summary is long and hard to plot on a simple graph. How can we make a graph that will plot something like that?

ggplot2

ggplot2 is a package which allows you to build up complex plots from basic elements (Wickham, 2009). First we define the data of the plot in ggplot, then define the geoms.

library(ggplot2)
ggplot(depthday_dist, aes(x = Hour, y = meandepthday)) + geom_point()

## Warning: Removed 24 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-42

Adding a little bit of colour

But how do we tell the points apart? Maybe we could give them different colours and connect the dots by lines? We'll convert Depth to factor so it isn't treated like a number (see what happens if you don't convert it to factor).

ggplot(depthday_dist, aes(x = Hour, y = meandepthday, colour = as.factor(Depth))) + 
    geom_line() + labs(colour = "Depth")

plot of chunk unnamed-chunk-43

ggplot2

We can go back to our original melted data and look at all the depths over time if we facet by Depth.

ggplot(soiltemp_melt, aes(x = ID, y = Temperature)) + geom_line() + facet_wrap(~Depth, 
    ncol = 3)

plot of chunk unnamed-chunk-44

Linear models

If we want to check the relation between, for example, the temperature at depth 0 and depth 5, we can use the function lm

temp0 <- soiltemp_melt$Temperature[soiltemp_melt$Depth == 0]
temp5 <- soiltemp_melt$Temperature[soiltemp_melt$Depth == 5]

linear_model <- lm(temp0 ~ temp5)  ## temp0 and temp5 stored as subsets

Getting things out of a linear model

We can access some important information using summary:

summary(linear_model)

## 
## Call:
## lm(formula = temp0 ~ temp5)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.764 -0.869  0.085  0.410  6.490 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.24580    0.01453    16.9   <2e-16 ***
## temp5        1.12760    0.00153   739.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 1.18 on 6550 degrees of freedom
## Multiple R-squared: 0.988,   Adjusted R-squared: 0.988 
## F-statistic: 5.46e+05 on 1 and 6550 DF,  p-value: <2e-16

Or using $ in combination with summary:

summary(linear_model)$r.squared

## [1] 0.9881

SS tables

We can produce familiar tables containing sum of squares and pvalues using the anova function:

anova(linear_model)

## Analysis of Variance Table
## 
## Response: temp0
##             Df Sum Sq Mean Sq F value Pr(>F)    
## temp5        1 755095  755095  546054 <2e-16 ***
## Residuals 6550   9057       1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear models: Plot

Let's plot the previous data and fitted linear model:

plot(temp5, temp0)
abline(linear_model, col = "blue", lwd = 2)

plot of chunk unnamed-chunk-50

Fun to look forward to:

When you have a bit more R experience, you can make graphs like the following:

Dates
RGoogleVis (done)
This presentation
Soil graphs (from different data)
Map with points in R

googleVis

We can make your first plot a lot fancier if we want (hover your mouse over it):

Soil profiles

Plot soil profiles with the help of package aqp:

Georeferenced data

Plot soil profiles locations:

This presentation!

This presentation was entirely made in R using some useful packages such as slidify (Vaidyanathan, 2012). You can look at the source for it on GitHub.

Now your turn

Here is a new data set
- NRCS data of soil around Indonesia
Things to try:
- Read in the data
- Make some basic graphs
- Do some basic regression
- Ask questions!
- Make mistakes!

Bibliography

University of Alaska Fairbanks . (2011). South White Hills Met Station (DFM1). http://ine.uaf.edu/werc/projects/foothills/stations/dfm1/historical.html.

Vaidyanathan R (2012). slidify: Generate reproducible html5 slides from R markdown. R package version 0.3.3, http://ramnathv.github.com/slidify/.

Wickham H (2007). “Reshaping Data with the reshape Package.” Journal of Statistical Software, 21(12), pp. 1–20. http://www.jstatsoft.org/v21/i12/.

Wickham H (2009). ggplot2: elegant graphics for data analysis. Springer New York. ISBN 978-0-387-98140-6, http://had.co.nz/ggplot2/book.

Wickham H (2011). “Tidy data.” The American Statistician. http://vita.had.co.nz/papers/tidy-data.html.

R Tutorial

March 2013

Welcome to R

Say Hello!

Some basic maths

Quick quiz

What is the expected result from 6 / 2?

Assigning variables

Not just numbers

Making readable results

Quiz 2

Which of these lines of code will produce the result: "1:2:3:x:y:z"

Bigger data

More vectors

Limits of atomic vectors

Even bigger data

Subsetting

Finer subsetting

The last elements

Why stop at one?

Subsetting quiz

What is the number in the third last row of the 2nd column of trees?

Other ways of extracting columns

Subsetting by a conditional

Logical subsetting quiz

How many trees are there with a girth less than 12.5 inches?

Getting help

Getting help quiz

What does the sep argument of read.csv do?

Dataframes

Checking your data

Fixing the data

Getting to work

Skipping the first few lines

How many lines (rows) do we have to skip?

Reading in the data

Structure of objects

Yet another problem

na.strings

Summaries

Histogram

Plotting

Plotting a subset

Boxplots

Tidy data

Loading in packages

Reshaping data

Poor depths

Regex

From Regex to reality

Stop and check quiz

What is the 3rd element of the Depth column now?

Plots again

Depth boxplots

plyr

Plotting summaries

More complex summaries

ggplot2

Adding a little bit of colour

ggplot2

Linear models

Getting things out of a linear model

SS tables

Linear models: Plot

Fun to look forward to:

googleVis

Soil profiles

Georeferenced data

This presentation!

Now your turn

Bibliography

What is the expected result from `6 / 2`?

Which of these lines of code will produce the result: `"1:2:3:x:y:z"`

What is the number in the third last row of the 2nd column of `trees`?

What does the `sep` argument of `read.csv` do?