Chapter 3 Data Manipulation in R

3.1 Dereferencing

In R, dereferencing means getting a specific part of an object. This is perhaps one of the key operations you will frequently use when manipulating data

Lets start with an array:

a1 <- seq(from=11, to=53, by=3)
a1

##  [1] 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53

Supposing you want the 3rd and 6th element of the array a1

a1[c(3,6)]

## [1] 17 26

Form a matrix using the function matrix() on object a1

a2 <- matrix(a1, ncol=3, byrow=TRUE)
a2

##      [,1] [,2] [,3]
## [1,]   11   14   17
## [2,]   20   23   26
## [3,]   29   32   35
## [4,]   38   41   44
## [5,]   47   50   53

a3 <- matrix(a1, ncol=2, byrow=FALSE)

## Warning in matrix(a1, ncol = 2, byrow = FALSE): data length [15] is not a
## sub-multiple or multiple of the number of rows [8]

a3

##      [,1] [,2]
## [1,]   11   35
## [2,]   14   38
## [3,]   17   41
## [4,]   20   44
## [5,]   23   47
## [6,]   26   50
## [7,]   29   53
## [8,]   32   11

# selecting the 2nd column of the matrix
a2[,2]

## [1] 14 23 32 41 50

# selecting the 3rd row of the matrix
a2[3,]

## [1] 29 32 35

If we convert a2 into a dataframe named a4 - we can reference the columns (variables) as follows:

a4 <- data.frame(a2)
a4

##   X1 X2 X3
## 1 11 14 17
## 2 20 23 26
## 3 29 32 35
## 4 38 41 44
## 5 47 50 53

# selecting column 3
# variables (columns) can be addressed using the dollar sign $
a4$X3

## [1] 17 26 35 44 53

To get the names of the columns of dataframe, number of rows/columns, dimensions of the dataframe, length of a column use:

names(a4)

## [1] "X1" "X2" "X3"

nrow(a4)

## [1] 5

ncol(a4)

## [1] 3

dim(a4)

## [1] 5 3

length(a4$X2)

## [1] 5

To help learn on data manipulation in R, I have provided you with two .csv datasets; ideal1.csv and ideal2.csv. The two datasets come from a longitudinal cohort study (Infectious Diseases of East African Livestock - IDEAL study) investigating the infection experiences of 548 calves during their first year of life. The study was conducted in Western Kenya between 2007 and 2010. You can read about the study from this publication and this web page

The data provided to you is from a (30) of the 548 IDEAL study subjects, with only a select few of the variables collected.

Briefly:

Calves were recruited in the first 7 days following birth
Questionnaires capturing household characteristics, farm management practices etc collected during recruitment visit
Follow up was done every 5 weeks
Packed Cell Volume, live weight, infection status with different pathogens (among other parameters) collected every 5 weeks

If you have an internet connection - you could run the following codes on your console

ideal1 <- read.csv("https://raw.githubusercontent.com/cema-uonbi/R_course/main/ideal1.csv")
ideal2 <- read.csv("https://raw.githubusercontent.com/cema-uonbi/R_course/main/ideal2.csv")

Quick look through the attributes of the dataset

head(ideal1, 5) # first 5 data rows (default = 6)

##        CalfID CalfSex sublocation     CADOB ReasonsLoss           Education
## 1 CA020610151       1      Karisa 3/11/2007          NA    Secondary school
## 2 CA020610152       1      Karisa 4/10/2007          NA No formal education
## 3 CA010310064       1      Kokare  3/1/2008           7      Primary school
## 4 CA031310367       1   Yiro West  4/6/2008           7      Primary school
## 5 CA051910545       2    Bujwanga  7/4/2008          NA      Primary school
##   Distance_water RecruitWeight
## 1   At household          19.0
## 2          <1 km          23.0
## 3          <1 km          16.5
## 4         1-5 km          20.0
## 5          <1 km          20.0

tail(ideal1)    # last 6 data rows of data ideal2

##         CalfID CalfSex sublocation      CADOB ReasonsLoss        Education
## 25 CA051810529       2      Luanda 25/02/2009          NA   Primary school
## 26 CA041510443       2   Namboboto 25/07/2009          NA Secondary school
## 27 CA051810538       1      Luanda 25/09/2009          NA Secondary school
## 28 CA010210041       1      Kidera 27/06/2008          NA   Primary school
## 29 CA030810228       1       Igero 30/12/2008          NA   Primary school
## 30 CA010110028       1 East Siboti 31/07/2009           7   Primary school
##    Distance_water RecruitWeight
## 25          <1 km          16.5
## 26          <1 km          19.5
## 27          <1 km          18.0
## 28         1-5 km          16.5
## 29          <1 km          22.0
## 30          <1 km          19.0

# dimensions number of rows and columns
dim(ideal1)

## [1] 30  8

dim(ideal2)

## [1] 270   9

## Attributes of the dataset
attributes(ideal1)

## $names
## [1] "CalfID"         "CalfSex"        "sublocation"    "CADOB"         
## [5] "ReasonsLoss"    "Education"      "Distance_water" "RecruitWeight" 
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30

# summary of the dataset
summary(ideal1)

##     CalfID             CalfSex      sublocation           CADOB          
##  Length:30          Min.   :1.000   Length:30          Length:30         
##  Class :character   1st Qu.:1.000   Class :character   Class :character  
##  Mode  :character   Median :1.000   Mode  :character   Mode  :character  
##                     Mean   :1.333                                        
##                     3rd Qu.:2.000                                        
##                     Max.   :2.000                                        
##                                                                          
##   ReasonsLoss      Education         Distance_water     RecruitWeight  
##  Min.   : 6.000   Length:30          Length:30          Min.   :14.00  
##  1st Qu.: 7.000   Class :character   Class :character   1st Qu.:17.00  
##  Median : 7.000   Mode  :character   Mode  :character   Median :19.25  
##  Mean   : 8.667                                         Mean   :19.25  
##  3rd Qu.: 7.000                                         3rd Qu.:20.38  
##  Max.   :18.000                                         Max.   :29.00  
##  NA's   :21

# description of the variables
typeof(ideal1$CalfSex)

## [1] "integer"

typeof(ideal1$sublocation)

## [1] "character"

We now need to change CalfSex from integer to factor and recode the values from 1 to Male and 2 to Female

is.factor(ideal1$CalfSex)

## [1] FALSE

ideal1$CalfSex <- factor(ideal1$CalfSex)
is.factor(ideal1$CalfSex)

## [1] TRUE

# determmine levels of factor calfsex
# recode to ``Male" and ``Female"

levels(ideal1$CalfSex)

## [1] "1" "2"

# recode variable CalfSex
# first create a new variable called CalfSex1
# assign it values of CalfSex
ideal1$CalfSex1 <- ideal1$CalfSex
table(ideal1$CalfSex1)

## 
##  1  2 
## 20 10

We use the ``magical” ifelse statement to recode the values

ideal1$CalfSex1 <- ifelse(ideal1$CalfSex1==1,"Male","Female")
table(ideal1$CalfSex1)

## 
## Female   Male 
##     10     20

# check consistency of the two columns with CalfSex data

table(ideal1$CalfSex,ideal1$CalfSex1)

##    
##     Female Male
##   1      0   20
##   2     10    0

3.1.1 Merging data

Combining different tables with related data is a common need
It is easy in R, using the function merge
Here we will merge ideal1 which has 30 records, with ideal2 with 270 records
You need an identifier variable in each table that will be used to link the databases

head(ideal1,2)

##        CalfID CalfSex sublocation     CADOB ReasonsLoss           Education
## 1 CA020610151       1      Karisa 3/11/2007          NA    Secondary school
## 2 CA020610152       1      Karisa 4/10/2007          NA No formal education
##   Distance_water RecruitWeight CalfSex1
## 1   At household            19     Male
## 2          <1 km            23     Male

head(ideal2,2)

##     VisitID      CalfID  VisitDate Weight ManualPCV Theileria.spp. ELISA_mutans
## 1 VRC010151 CA020610151  9/11/2007     19      37.0              0            0
## 2 VRC010152 CA020610152 11/10/2007     23      35.8              0            0
##   ELISA_parva Q.Strongyle.eggs
## 1           0               NA
## 2           0               NA

##CalfID is the unique identifier we will use to combine the two datasets
ideal3 <- merge(ideal1,ideal2,by.x="CalfID",by.y="CalfID", all.x=T)

dim(ideal1)

## [1] 30  9

dim(ideal2)

## [1] 270   9

dim(ideal3)

## [1] 270  17

head(ideal3,4)

##        CalfID CalfSex sublocation      CADOB ReasonsLoss      Education
## 1 CA010110028       1 East Siboti 31/07/2009           7 Primary school
## 2 CA010210036       2      Kidera  12/3/2008           7 Primary school
## 3 CA010210036       2      Kidera  12/3/2008           7 Primary school
## 4 CA010210036       2      Kidera  12/3/2008           7 Primary school
##   Distance_water RecruitWeight CalfSex1   VisitID  VisitDate Weight ManualPCV
## 1          <1 km            19     Male VRC010028   3/8/2009     19        40
## 2         1-5 km            15   Female VRC010036 18/03/2008     15        32
## 3         1-5 km            15   Female VRC060036 23/04/2008     20        38
## 4         1-5 km            15   Female VRC160036   2/7/2008     34        26
##   Theileria.spp. ELISA_mutans ELISA_parva Q.Strongyle.eggs
## 1              0           NA          NA               NA
## 2              0            0           0               NA
## 3              0            0           0              250
## 4              0            0           0             2300

3.1.2 Subsetting data

# If we wanted data from only one sublocation - Kidera

idealkidera <- subset(ideal3, ideal3$sublocation=="Kidera")
table(idealkidera$sublocation)

## 
## Kidera 
##     24

# For data on only females from Kidera

idealkideraF <- subset(ideal3, ideal3$sublocation=="Kidera" & 
                         ideal3$CalfSex1=="female")

3.1.3 Substring

It is used to extract substrings in a character vector.
We will extract the week from the VisitID of each row using the substring function.

 # use the R help function (?substring) to try out how you can extract the 4th and 5th digits of the variable visitID