An Introduction to R
1 The Nuts and Bolts of R
1.1 Introduction to R
R is a statistical language that is very popular with statisticians and researchers. In fact more researchers than I can count have told me that learning R is on their to-do list. There are many good reasons why learning R would end up on a to-do list. Firstly, it has an amazing array of features and customisation and secondly, it takes a rather long time to learn to proficiency. One of the first cool things about R is it is free and open source, so if you haven’t already done so, you can go and install (or update) R for free and install/update Rstudio to the newest version. You will also need administrator privileges on your computer.
Essentially the R language itself is just a calculator that accepts command line inputs. To make it more friendly, we use the program RStudio which runs on top of R and provides a limited graphical user interface. So why is a glorified calculator so cool? R has a very large community who write packages which are collections of functions that streamline manipulating and analysing data. Down the track you can also learn to write your own functions to save time on tasks that you do frequently. Essentially with R you not only get all the functionality provided by a huge pool of researchers and data scientists, but can go from being a “user” of statistical software to a developer.
# You might want to make notes in R! If you add a hashtag to the start of a line then R will know that this isn’t part of the code
You can download the datasets for today’s workshop here
A lot of first time users get thrown by the new terminology that comes with learning R, and there is a lot of new terms - arguments, objects, classes, functions, packages etc…
We will try to learn each of these terms as they become necessary, but don’t worry too much if you have to go and look them up at some stage over the coming days, weeks, and years!
There is also a handy glossary 1.12 that outlines each of these terms if you get stuck.
While it might not look like much, RStudio actually has a much more comprehensive graphical user interface (GUI) compared to R itself. RStudio runs ‘on top’ of R, allowing you to run R but with some extra interfaces on top. RStudio is set out in four panels (you might only be seeing three, don’t panic). The four panels don’t have official names, so I have just made up unofficial ones.
The top left panel is the script panel. This panel is hidden if there are no open scripts, to open it go
File > New File > R Script. This script panel is the place where we type our code and run it whenever we want. You can also leave comments and notes in the script file to remind yourself what each line/section does. To do this place a # at the beginning of the comment
To run code place the cursor somewhere in the section of code and hit
Cmd + Enter on a Mac or
Ctrl + Enter on a PC. Let’s give it a go. Type
3 + 7 in your script and then execute that section.
After executing your code you should see that in the bottom left panel you now have the resulting output from the code you ran - 3 + 7 is equal to 10 (phew). The console panel will print the output from any commands/code you run in R. You can also type directly into the console and get an answer like that (go ahead and type
3 + 10 into the console). You should almost always type your code in the script panel, not the console panel. The console doesn’t keep a record of your code (or your resulting output) so it is lost to ether once you have finished (this isn’t actually true, you can track down a record of what was typed into the console, but it is a pain to do).
The environment panel contains a record of all the stuff you’ve saved into the R environment. What kind of stuff you ask? We’ll get to that in a second, but for now it’s worth noting that in R you could have dozens (hundreds etc.) of datasets, lists, variables etc all open in the same environment
Plots (and more) Panel
This panel is somewhat multipurpose. You can use it to access your current directories, take a look at plots, look at the currently load packages (more on that later), and get help. Want to see an incredibly ugly and pointless plot? Type
plot(10) into your script panel and execute.
Objects are just things that you have stored in your R environment. If you are used to working with SPSS the idea of an environment can be confusing. In SPSS you have a dataset made up of variables. In R you have an environment made up of objects. The objects can be variables, but they can also be whole datasets, or even lists of hundreds of datasets. If this all sounds a bit meta, don’t worry too much you will gradually get used to working with all different kinds of objects in R.
Each object has a name and a class. We use the object’s name to refer to it in our code and the class of the object determines what you can and can’t do with it.
Let’s make our first object, called
x. To assign an object we use the <- symbol. It is the equivalent of an equals sign (you can even use an equals sign instead, but this can be problematic so I recommend sticking with the arrow).
x <- 7
You can see in the top right panel that the object x has been added to the environment. Now whenever we type the object’s name in our script, the number 7 will be returned in the console.
##  7
Now seems like a good time to mention that R is case sensitive. If we run the code
X, you will see the result is Error: object ‘X’ not found. Try not to spend 3 hours trying to fix your code when all that’s wrong is you forgot to use a capital!
We can also make objects of all different varieties (classes). For example:
y <- c(3,3,6,17,17,17) # don't worry about the 'c', we'll cover that in a bit y
##  3 3 6 17 17 17
or a word
name <- "Kit" sentence <- c("Kit", "loves", "R") # or even easier.... sentence <- "Kit loves R"
We can use our objects in our code in flexible ways. For example
9 - x
##  2
x <- 3 x
##  3
y + x
##  6 6 9 20 20 20
As I said, each object has a class. To see the class of an object we use the function
class() function simply tells you what class an object is (more on functions next).
##  "numeric"
##  "numeric"
##  "character"
You can see that the
y objects are both class numeric. This means they are numbers and you could use them in equations. The object
sentence however is of class character, because it is a series of characters (letters). The major classes of R objects are:
Character: Letters (and maybe some numbers e.g. ‘2a’ . The equivalent of string variable in SPSS
Numeric: Numbers and only numbers
Factor: A special class which basically means the variable has pre-defined categories e.g. hot of cold.
Matrix: A matrix of rows and columns (e. 12 x 13)
Data.frame: A dataset made up of rows and columns, but with special properties compared to a matrix
Okay so we used our first function
class(), but what are functions? Functions are, in a abstract sense, pieces of code that can be run on an object. In a more concrete sense they are the things that, well…add functionality in R. There are millions (probably billions) of functions in R, so you probably shouldn’t set about trying to learn them all. Instead consider that for everything you want to do in R, there is a function for that, and if there isn’t you could make one!
Our next function that we will learn is the function called
hist(). The hist function makes a histogram. Applying a function always involves the use of parentheses after the function name.
Ugly histogram right? We can make changes to the histogram by specifying the arguments passed to the function. Arguments are pieces of information the function can be given. One of the arguments in the
hist() function is xlab, which specifies the label of the x-axis.
Let’s change the label of the x-axis using the xlab argument.
hist(y, xlab = "Number of Red Cars")
Alright, now you’re probably having something of a memory overload. Not only do you have to remember all these new words but also all kinds of functions, and all kinds of arguments! To help out we can use the
tab key to see what arguments we can give a function. To do this type
hist(y,) then hit
You should see a list of all the arguments we could possibly supply the
hist() function. Let’s use the main argument to change the tile of the histogram
hist(y, xlab = "Number of Red Cars", main = "Number of Red Cars During a Week")
You can either specify the argument directly, for example, by writing ‘xlab =’ or you can enter arguments in a particular order (and the function will know which piece of information corresponds with which argument). When you are learning it is good to specify the arguments for most functions (except perhaps when they only have one argument).
So R has some handy functions that can be customised using arguments. But what makes it cool are the hundreds of thousands of very useful packages that have even more functions! You can use these packages to do advanced analysis, make amazing plots, or automate data cleaning. So let’s install and load our first package - beepr!
To do this we use the
Packages are basically a collection of functions written by someone in the R community (you could even make a package of your own custom functions).
You can see a list of installed packages by clicking on the Packages tab in the bottom right panel.
Once installed we can use the functions that are part of a particular package. But the functions that come in a package are not automatically available when you start R even if you have installed a package. Packages need to be loaded every time you open R.
To load a package we use the
library function (we have to do this everytime we start R for every package we are going to use).
You might notice that when we installed the package we used quotation marks around the package name, but we didn’t when we loaded the package. That’s because R didn’t know what beepr was before we installed it. The quotation marks told R that this was a string of characters rather than an object in R. Compare the difference below:
y "y" s "s"
Note: If you are planning to save your script (syntax) file for later, then should you leave the
install.packages() function there? If you do, and then re-run your script you will reinstall the package unnecessarily. On the other hand if you leave it out, then if someone else (or you sitting at a different computer) tries to run the script, then it will produce an error. The pacman package provides a function called
p_load is cool, it checks whether you have a package (installs it if you don’t already have it installed) AND loads it! You can use
p_load instead of the
beepr has a function called
beep(). See if you can work out how to get the function to play the mario theme song. Turn on your sound for this demo to work!
beep(soud = 8)
In this case one of the arguments was
sound, which tells beep which sound to play. Okay, beepr is pretty well useless, but now you know how to install a package and run one of its functions.
To see what a package is capable of you can view the help file.
To refer to a particular row or column or even cell in R, you can always use indexing. Indexing involves specifying the ROW and the COLUMN.
To explore this idea let’s create a matrix. A matrix is the same as a dataset which we will come to later but it doesn’t have column names.
mymatrix <- matrix(data = 1:56, nrow = 14, ncol = 4)
We can take a look at the matrix by clicking on it in our environment.
## [,1] [,2] [,3] [,4] ## [1,] 1 15 29 43 ## [2,] 2 16 30 44 ## [3,] 3 17 31 45 ## [4,] 4 18 32 46 ## [5,] 5 19 33 47 ## [6,] 6 20 34 48
To specify a particular cell we use indexing as follows: object[row, column]. The fact that row always comes before column is arbitrary but worth memorising.
Let’s do some examples
##  31
##  4
Each will return the cell according to its specified row and column. You could use this to change a value
mymatrix[4,1] <- 11
If we wanted to specify the entire second column we would just leave the row argument blank.
##  15 16 17 18 19 20 21 22 23 24 25 26 27 28
or the entire third row
##  3 17 31 45
Concatenation means to join to things. In R this is something we often want to do for example we may want join columns one and three in the mymatrix object.
To concatenate in R we use the
## [,1] [,2] ## [1,] 1 29 ## [2,] 2 30 ## [3,] 3 31 ## [4,] 11 32 ## [5,] 5 33 ## [6,] 6 34 ## [7,] 7 35 ## [8,] 8 36 ## [9,] 9 37 ## [10,] 10 38 ## [11,] 11 39 ## [12,] 12 40 ## [13,] 13 41 ## [14,] 14 42
1.11 Changing Classes
Many of the problems you encounter as you learn R will be because of the class of the object not being what you think it is. Let’s confirm the class of the object mymatrix
##  "matrix"
We can also change the class of an object. For example:
mymatric <- as.data.frame(mymatrix)
Argument: A piece of information that is fed into a function. This information can be optional or mandatory
Class: Each object in R has a class. The class tells R what kind of object this is and determines what R (thinks) it can and can’t do with the object
Concatenation: Concatenation means to join to things, for example, you mind join to vectors of numbers to form a matrix.
Environment The workspace in R consisting of all the loaded objects.
Functions: Functions are, in an abstract sense, pieces of code that can be run on an object. In a more concrete sense they are the things that, well…add functionality in R. All Functions are used in the format function().
Indexing: Within objects with multiple dimensions e.g. lists, data frames, matricies, etc. You can use indexing to specify an element e.g. a particular column within a data frame. Always in the format object[row, column]…if there are two dimensions.
Objects: The ‘stuff’ in your environment (can be pretty much anything).
Packages: Collections of functions made by the R community to make your life easier.