Chapter 2 Getting started with R
2.1 Installing and using R packages
R comes with some pre-installed packages. You can have a look at them at the packages tab on the lower right section of the RStudio.
However, you may require functions from packages that are not pre-installed. To do so, we need to first install the packages. For example epidemiological package epitools, which is not preinstalled with R. To install the package - type as below in the R console
If installing more than one package (also install the tidyverse Tidyverse: R packages for data science, use the code below:
To call the two packages you’ve installed into the R environment, use the codes:
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
To tell which packages have been loaded in the R environment use the command
## [1] ".GlobalEnv" "package:lubridate" "package:forcats"
## [4] "package:stringr" "package:dplyr" "package:purrr"
## [7] "package:readr" "package:tidyr" "package:tibble"
## [10] "package:ggplot2" "package:tidyverse" "package:epitools"
## [13] "package:stats" "package:graphics" "package:grDevices"
## [16] "package:utils" "package:datasets" "package:methods"
## [19] "Autoloads" "package:base"
To learn more about a package eg. tidyverse, one can check for the reference manual using Rseek or use the code
For information on the function in the package , use
The command provides a description of the function, its usage, author(s) etc, and (very helpful) examples on how to use the function.
2.2 Creating objects in R
An object is anything that takes a value. You can create objects by telling R to read a file, for example a .csv file from UCLA site as shown below, and assign the object a name, e.g. dat1
Object dat1 now exists in the workspace. To view the first few rows of the data (default first 6 rows)
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
Supposing we want a vector of the sequence of numericals 1 to 100, as an object called dat2
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100
To assign a vector of e.g. to an object named (an array), you will need to use the
## NULL
## [1] 2 3 6 15
## [1] 10 11 14 23
## [1] 3.4 4.4 7.4 16.4
To assign characters/names to an object:
## [1] "two" "three" "six" "fifteen"
You may want to form a data frame dat5 from objects dat3 and dat4
## dat4 dat3
## 1 two 2
## 2 three 3
## 3 six 6
## 4 fifteen 15
We will do much more of these during data importation and manipulation
2.3 Functions in R
So far we have created objects, viewed them, manipulated them and even destroyed/removed them from the workspace. More interest lies in using the objects for data analysis!
R functions are a special type of object with power to change or manipulate other objects. If say we were interested in finding the mean of the array dat3 created previously, we would use a base R function by the same name “mean”
## [1] 2 3 6 15
## [1] 6.5
## [1] 26
## [1] 15
## [1] 2
## [1] 4
## [1] 1.414214 1.732051 2.449490 3.872983
We can find more summary measures of dat3 by using functions such as:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 2.75 4.50 6.50 8.25 15.00
We can do a little more with several of these functions, for example
## [1] 6.5
Even better determine the standard deviation of dat3 without using the base R function “sd” by writing out the standard deviation formula in R language
## [1] 5.91608
## [1] 5.91608
2.4 Data types and importation
Data types
A dataset may contain variables with different data types (e.g dates, numerical numbers, categorical values etc) - these require varying treatment during statistical analysis. R comfortably handles various data types including these common ones:
- Doubles - these represent continuous variables - as weight/height/length/width of an object, animal, person. You can determine if a number is a double using a logical question
## [1] TRUE
## [1] "double"
- Integers - these are natural numbers, eg counting variables - number of black rhinos at the Mt Kenya National Park
## [1] "integer"
## [1] FALSE
- Logical - this data type takes the values FALSE or TRUE, and indicates if a condition is true or false (logical expressions)
## [1] FALSE
The logical expressions are built from logical operators
\(>\) greater than
\(<\) less than
\(>=\) greater than or equal to
\(<=\) less than or equal to
\(==\) is equal to
\(!=\) is unequal to
& and
\(!\) not
Characters - represents a collection of characters between double quotes
## [1] "Elk" "Dun" "Sar" "Mik"
- Factor - represents categorical data - the value range of which is a collection of codes, eg sex - male/female, education - informal/formal, infection status - positive/negative ..etc. Individual code of the factor is called level of the factor
## [1] primary secondary primary university no education
## [6] secondary
## Levels: no education primary secondary university
## [1] "no education" "primary" "secondary" "university"
So far we have worked with vectors, arrays and data frames. Data frames have columns which may contain different data types, and are the convinient data structure for analysis in R. The rest of the R course will mainly use data frames!
2.5 Importing and Data manipulation in R
Data may be imported into R using two ways:
- Accessing datasets ``built in” R which come as part of a libraries loaded into the console
- Importing data from external data files (eg Excel spreadsheets, Access databases, MySQL, text files, spss files, sas files…etc)
2.5.1 In-built datasets
- These are datasets that come together with installed R libraries. To see a list of all available in-built datasets and a short-description of each use:
- Bring in the data WHONET - dataset with exact structure as an export from WHONET (AMR data).
## Warning: package 'AMR' was built under R version 4.0.5
## Identification number Specimen number Organism Country
## 1 fe41d7bafa 1748 SPN Belgium
## 2 91f175ec37 1767 eco The Netherlands
## 3 cc4015056e 1343 eco The Netherlands
## 4 e864b692f5 1894 MAP Denmark
## Laboratory Last name First name Sex Age
## 1 National Laboratory of Belgium Abel B. F 68
## 2 National Laboratory of The Netherlands Delacroix F. M 89
## 3 National Laboratory of The Netherlands Steensen F. M 85
## 4 National Laboratory of Denmark Beyersdorf L. M 62
## Age category Date of admission Specimen date Specimen type
## 1 55-74 2005-01-12 2005-01-30 Urine
## 2 75+ 2006-07-30 2006-08-16 Urine
## 3 75+ 2014-03-05 2014-03-14 Urine
## 4 55-74 2014-10-22 2014-11-01 Urine
## Specimen type (Numeric) Reason Isolate number Organism type Serotype
## 1 2 Unknown 1748 Bacteria
## 2 2 Unknown 1767 Bacteria
## 3 2 Unknown 1343 Bacteria
## 4 2 Unknown 1894 Bacteria
## Beta-lactamase ESBL Carbapenemase MRSA screening test
## 1 FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE
## Inducible clindamycin resistance Comment Date of data entry AMP_ND10 AMC_ED20
## 1 FALSE 2005-01-30 S S
## 2 FALSE 2006-08-16 <NA> S
## 3 FALSE 2014-03-14 S S
## 4 FALSE 2014-11-01 R <NA>
## TZP_ED30 FEP_ED30 CTX_ED5 FOX_ED30 CAZ_ED10 CRO_ED30 CIP_ED5 AMK_ED30
## 1 S <NA> <NA> <NA> R <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> R <NA> <NA> <NA>
## 3 S <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> R <NA> <NA> <NA>
## GEN_ED10 TOB_ED10 SXT_ED1.2 IPM_ND10 PEN_ND1 AMP_ND2 AMC_ND2 CHL_ND30 VAN_ED5
## 1 R R S <NA> S S S <NA> S
## 2 S <NA> R <NA> R <NA> S <NA> S
## 3 <NA> <NA> <NA> <NA> S S S <NA> <NA>
## 4 R R <NA> <NA> R R <NA> <NA> <NA>
## OXA_ED1 ERY_ED15 CLI_ED2 TCY_ED30 RIF_ED5 PEN_EE AMP_EE CRO_EE CIP_EE
## 1 <NA> S <NA> S <NA> S S <NA> <NA>
## 2 <NA> S S S <NA> R <NA> <NA> <NA>
## 3 <NA> <NA> S <NA> <NA> S S <NA> <NA>
## 4 <NA> <NA> S <NA> <NA> R R <NA> <NA>
2.5.2 Import data from ``external” files
These data could be in different formats including:
Comma separated values .csv files: a very easy one to deal with. Save your data as a .csv file
read.csv - is the function that instructs R that the data is saved as a .csv format, and use:
file.choose - opens a browser window enabling you to locate where the file is sitted
header=TRUE - instructs R to take the first row to contain the variable names
Text files - imported much the same way as above, using the function read.table, check
Excel files you will need the package xlsReadWrite and you will use the functions read.xls and write.xls
Databases: uses the following packages
RODBC - provide an interface for ODBC compliant databases (eg. MS Access, MS SQLServer, Oracle)
RMySQL - provide an interface to a MySQL database
RSQLite - interface with SQLite
This may be done at a more advanced stage.
Importing from other statistical softwares such as S, SAS, Epi info, Stata, SPSS, dBase…etc
The library(foreign) contains functions to read data saved in formats used by softwares above
## Warning: package 'foreign' was built under R version 4.0.5