Chapter 2 Getting started with R

2.1 Installing and using R packages

  • R comes with some pre-installed packages. You can have a look at them at the packages tab on the lower right section of the RStudio.

  • However, you may require functions from packages that are not pre-installed. To do so, we need to first install the packages. For example epidemiological package epitools, which is not preinstalled with R. To install the package - type as below in the R console

install.packages("epitools")

If installing more than one package (also install the tidyverse Tidyverse: R packages for data science, use the code below:

install.packages(c("epitools","tidyverse"))

To call the two packages you’ve installed into the R environment, use the codes:

library(epitools)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## alternatively
require(epitools)
require(tidyverse)

To tell which packages have been loaded in the R environment use the command

search()
##  [1] ".GlobalEnv"        "package:lubridate" "package:forcats"  
##  [4] "package:stringr"   "package:dplyr"     "package:purrr"    
##  [7] "package:readr"     "package:tidyr"     "package:tibble"   
## [10] "package:ggplot2"   "package:tidyverse" "package:epitools" 
## [13] "package:stats"     "package:graphics"  "package:grDevices"
## [16] "package:utils"     "package:datasets"  "package:methods"  
## [19] "Autoloads"         "package:base"

To learn more about a package eg. tidyverse, one can check for the reference manual using Rseek or use the code

help(package=tidyverse)

For information on the function in the package , use

?select

The command provides a description of the function, its usage, author(s) etc, and (very helpful) examples on how to use the function.

2.2 Creating objects in R

An object is anything that takes a value. You can create objects by telling R to read a file, for example a .csv file from UCLA site as shown below, and assign the object a name, e.g. dat1

dat1 <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

Object dat1 now exists in the workspace. To view the first few rows of the data (default first 6 rows)

head(dat1)
##   admit gre  gpa rank
## 1     0 380 3.61    3
## 2     1 660 3.67    3
## 3     1 800 4.00    1
## 4     1 640 3.19    4
## 5     0 520 2.93    4
## 6     1 760 3.00    2

Supposing we want a vector of the sequence of numericals 1 to 100, as an object called dat2

dat2 <- seq(1,100)
dat2
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100
# check more possibilities of the function "seq"
?seq

To assign a vector of e.g.  to an object named (an array), you will need to use the

c()
## NULL
dat3 <- c(2,3,6,15)
dat3
## [1]  2  3  6 15
# You can add a constant to each member of the array
dat3 + 8
## [1] 10 11 14 23
dat3 + 7/5
## [1]  3.4  4.4  7.4 16.4

To assign characters/names to an object:

dat4 <- c("two","three","six","fifteen")
dat4
## [1] "two"     "three"   "six"     "fifteen"

You may want to form a data frame dat5 from objects dat3 and dat4

dat5 <- data.frame(dat4,dat3)
dat5
##      dat4 dat3
## 1     two    2
## 2   three    3
## 3     six    6
## 4 fifteen   15

We will do much more of these during data importation and manipulation

2.3 Functions in R

So far we have created objects, viewed them, manipulated them and even destroyed/removed them from the workspace. More interest lies in using the objects for data analysis!

R functions are a special type of object with power to change or manipulate other objects. If say we were interested in finding the mean of the array dat3 created previously, we would use a base R function by the same name “mean”

dat3
## [1]  2  3  6 15
mean(dat3)
## [1] 6.5
## try a few other R functions - most of all self explanatory
sum(dat3) 
## [1] 26
max(dat3)
## [1] 15
min(dat3)
## [1] 2
length(dat3)
## [1] 4
sqrt(dat3)
## [1] 1.414214 1.732051 2.449490 3.872983

We can find more summary measures of dat3 by using functions such as:

summary(dat3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    2.75    4.50    6.50    8.25   15.00

We can do a little more with several of these functions, for example

sum(dat3)/length(dat3)
## [1] 6.5

Even better determine the standard deviation of dat3 without using the base R function “sd” by writing out the standard deviation formula in R language

sqrt(sum((dat3 - mean(dat3))^ 2)/(length(dat3) - 1))
## [1] 5.91608
# same result while using the sd function in base R for standard deviation
sd(dat3)
## [1] 5.91608

2.4 Data types and importation

Data types

A dataset may contain variables with different data types (e.g dates, numerical numbers, categorical values etc) - these require varying treatment during statistical analysis. R comfortably handles various data types including these common ones:

  • Doubles - these represent continuous variables - as weight/height/length/width of an object, animal, person. You can determine if a number is a double using a logical question
x <- c(3.14,5.7)

is.double(x)
## [1] TRUE
typeof(x)
## [1] "double"
  • Integers - these are natural numbers, eg counting variables - number of black rhinos at the Mt Kenya National Park
nrhinos <- as.integer(88)
typeof(nrhinos)
## [1] "integer"
nrhinos <- 88.0
is.integer(nrhinos)
## [1] FALSE
  • Logical - this data type takes the values FALSE or TRUE, and indicates if a condition is true or false (logical expressions)
g <- 99
h <- g < 88
h
## [1] FALSE

The logical expressions are built from logical operators

  • \(>\) greater than

  • \(<\) less than

  • \(>=\) greater than or equal to

  • \(<=\) less than or equal to

  • \(==\) is equal to

  • \(!=\) is unequal to

  • & and

  • \(!\) not

  • Characters - represents a collection of characters between double quotes

rstudents <- c("Elk","Dun","Sar","Mik")
rstudents
## [1] "Elk" "Dun" "Sar" "Mik"
  • Factor - represents categorical data - the value range of which is a collection of codes, eg sex - male/female, education - informal/formal, infection status - positive/negative ..etc. Individual code of the factor is called level of the factor
educ <- factor(c("primary","secondary","primary","university",
                 "no education","secondary"))
educ
## [1] primary      secondary    primary      university   no education
## [6] secondary   
## Levels: no education primary secondary university
levels(educ)
## [1] "no education" "primary"      "secondary"    "university"

So far we have worked with vectors, arrays and data frames. Data frames have columns which may contain different data types, and are the convinient data structure for analysis in R. The rest of the R course will mainly use data frames!

2.5 Importing and Data manipulation in R

Data may be imported into R using two ways:

  1. Accessing datasets ``built in” R which come as part of a libraries loaded into the console
  2. Importing data from external data files (eg Excel spreadsheets, Access databases, MySQL, text files, spss files, sas files…etc)

2.5.1 In-built datasets

  • These are datasets that come together with installed R libraries. To see a list of all available in-built datasets and a short-description of each use:
data()
  • Bring in the data WHONET - dataset with exact structure as an export from WHONET (AMR data).
# we first import the library
library(AMR) # if you have not installed it, kindly do so
## Warning: package 'AMR' was built under R version 4.0.5
data(WHONET)
?WHONET
head(WHONET,4)
##   Identification number Specimen number Organism         Country
## 1            fe41d7bafa            1748      SPN         Belgium
## 2            91f175ec37            1767      eco The Netherlands
## 3            cc4015056e            1343      eco The Netherlands
## 4            e864b692f5            1894      MAP         Denmark
##                               Laboratory  Last name First name Sex Age
## 1         National Laboratory of Belgium       Abel         B.   F  68
## 2 National Laboratory of The Netherlands  Delacroix         F.   M  89
## 3 National Laboratory of The Netherlands   Steensen         F.   M  85
## 4         National Laboratory of Denmark Beyersdorf         L.   M  62
##   Age category Date of admission Specimen date Specimen type
## 1        55-74        2005-01-12    2005-01-30         Urine
## 2          75+        2006-07-30    2006-08-16         Urine
## 3          75+        2014-03-05    2014-03-14         Urine
## 4        55-74        2014-10-22    2014-11-01         Urine
##   Specimen type (Numeric)  Reason Isolate number Organism type Serotype
## 1                       2 Unknown           1748      Bacteria         
## 2                       2 Unknown           1767      Bacteria         
## 3                       2 Unknown           1343      Bacteria         
## 4                       2 Unknown           1894      Bacteria         
##   Beta-lactamase  ESBL Carbapenemase MRSA screening test
## 1          FALSE FALSE         FALSE               FALSE
## 2          FALSE FALSE         FALSE               FALSE
## 3          FALSE FALSE         FALSE               FALSE
## 4          FALSE FALSE         FALSE               FALSE
##   Inducible clindamycin resistance Comment Date of data entry AMP_ND10 AMC_ED20
## 1                            FALSE                 2005-01-30        S        S
## 2                            FALSE                 2006-08-16     <NA>        S
## 3                            FALSE                 2014-03-14        S        S
## 4                            FALSE                 2014-11-01        R     <NA>
##   TZP_ED30 FEP_ED30 CTX_ED5 FOX_ED30 CAZ_ED10 CRO_ED30 CIP_ED5 AMK_ED30
## 1        S     <NA>    <NA>     <NA>        R     <NA>    <NA>     <NA>
## 2     <NA>     <NA>    <NA>     <NA>        R     <NA>    <NA>     <NA>
## 3        S     <NA>    <NA>     <NA>     <NA>     <NA>    <NA>     <NA>
## 4     <NA>     <NA>    <NA>     <NA>        R     <NA>    <NA>     <NA>
##   GEN_ED10 TOB_ED10 SXT_ED1.2 IPM_ND10 PEN_ND1 AMP_ND2 AMC_ND2 CHL_ND30 VAN_ED5
## 1        R        R         S     <NA>       S       S       S     <NA>       S
## 2        S     <NA>         R     <NA>       R    <NA>       S     <NA>       S
## 3     <NA>     <NA>      <NA>     <NA>       S       S       S     <NA>    <NA>
## 4        R        R      <NA>     <NA>       R       R    <NA>     <NA>    <NA>
##   OXA_ED1 ERY_ED15 CLI_ED2 TCY_ED30 RIF_ED5 PEN_EE AMP_EE CRO_EE CIP_EE
## 1    <NA>        S    <NA>        S    <NA>      S      S   <NA>   <NA>
## 2    <NA>        S       S        S    <NA>      R   <NA>   <NA>   <NA>
## 3    <NA>     <NA>       S     <NA>    <NA>      S      S   <NA>   <NA>
## 4    <NA>     <NA>       S     <NA>    <NA>      R      R   <NA>   <NA>

2.5.2 Import data from ``external” files

  • These data could be in different formats including:

  • Comma separated values .csv files: a very easy one to deal with. Save your data as a .csv file

# mydata1 <- read.csv(file.choose(), header=TRUE)
  • read.csv - is the function that instructs R that the data is saved as a .csv format, and use:

  • file.choose - opens a browser window enabling you to locate where the file is sitted

  • header=TRUE - instructs R to take the first row to contain the variable names

  • Text files - imported much the same way as above, using the function read.table, check

?read.table
  • Excel files you will need the package xlsReadWrite and you will use the functions read.xls and write.xls

  • Databases: uses the following packages

  • RODBC - provide an interface for ODBC compliant databases (eg. MS Access, MS SQLServer, Oracle)

  • RMySQL - provide an interface to a MySQL database

  • RSQLite - interface with SQLite

  • This may be done at a more advanced stage.

  • Importing from other statistical softwares such as S, SAS, Epi info, Stata, SPSS, dBase…etc

  • The library(foreign) contains functions to read data saved in formats used by softwares above

library(foreign)
## Warning: package 'foreign' was built under R version 4.0.5
help(package=foreign)