Chapter 3 Data Manipulation in R
3.1 Dereferencing
In R, dereferencing means getting a specific part of an object. This is perhaps one of the key operations you will frequently use when manipulating data
- Lets start with an array:
## [1] 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53
Supposing you want the 3rd and 6th element of the array a1
## [1] 17 26
Form a matrix using the function matrix() on object a1
## [,1] [,2] [,3]
## [1,] 11 14 17
## [2,] 20 23 26
## [3,] 29 32 35
## [4,] 38 41 44
## [5,] 47 50 53
## Warning in matrix(a1, ncol = 2, byrow = FALSE): data length [15] is not a
## sub-multiple or multiple of the number of rows [8]
## [,1] [,2]
## [1,] 11 35
## [2,] 14 38
## [3,] 17 41
## [4,] 20 44
## [5,] 23 47
## [6,] 26 50
## [7,] 29 53
## [8,] 32 11
## [1] 14 23 32 41 50
## [1] 29 32 35
If we convert a2 into a dataframe named a4 - we can reference the columns (variables) as follows:
## X1 X2 X3
## 1 11 14 17
## 2 20 23 26
## 3 29 32 35
## 4 38 41 44
## 5 47 50 53
## [1] 17 26 35 44 53
To get the names of the columns of dataframe, number of rows/columns, dimensions of the dataframe, length of a column use:
## [1] "X1" "X2" "X3"
## [1] 5
## [1] 3
## [1] 5 3
## [1] 5
To help learn on data manipulation in R, I have provided you with two .csv datasets; ideal1.csv and ideal2.csv. The two datasets come from a longitudinal cohort study (Infectious Diseases of East African Livestock - IDEAL study) investigating the infection experiences of 548 calves during their first year of life. The study was conducted in Western Kenya between 2007 and 2010. You can read about the study from this publication and this web page
The data provided to you is from a (30) of the 548 IDEAL study subjects, with only a select few of the variables collected.
Briefly:
Calves were recruited in the first 7 days following birth
Questionnaires capturing household characteristics, farm management practices etc collected during recruitment visit
Follow up was done every 5 weeks
Packed Cell Volume, live weight, infection status with different pathogens (among other parameters) collected every 5 weeks
If you have an internet connection - you could run the following codes on your console
ideal1 <- read.csv("https://raw.githubusercontent.com/cema-uonbi/R_course/main/ideal1.csv")
ideal2 <- read.csv("https://raw.githubusercontent.com/cema-uonbi/R_course/main/ideal2.csv")
Quick look through the attributes of the dataset
## CalfID CalfSex sublocation CADOB ReasonsLoss Education
## 1 CA020610151 1 Karisa 3/11/2007 NA Secondary school
## 2 CA020610152 1 Karisa 4/10/2007 NA No formal education
## 3 CA010310064 1 Kokare 3/1/2008 7 Primary school
## 4 CA031310367 1 Yiro West 4/6/2008 7 Primary school
## 5 CA051910545 2 Bujwanga 7/4/2008 NA Primary school
## Distance_water RecruitWeight
## 1 At household 19.0
## 2 <1 km 23.0
## 3 <1 km 16.5
## 4 1-5 km 20.0
## 5 <1 km 20.0
## CalfID CalfSex sublocation CADOB ReasonsLoss Education
## 25 CA051810529 2 Luanda 25/02/2009 NA Primary school
## 26 CA041510443 2 Namboboto 25/07/2009 NA Secondary school
## 27 CA051810538 1 Luanda 25/09/2009 NA Secondary school
## 28 CA010210041 1 Kidera 27/06/2008 NA Primary school
## 29 CA030810228 1 Igero 30/12/2008 NA Primary school
## 30 CA010110028 1 East Siboti 31/07/2009 7 Primary school
## Distance_water RecruitWeight
## 25 <1 km 16.5
## 26 <1 km 19.5
## 27 <1 km 18.0
## 28 1-5 km 16.5
## 29 <1 km 22.0
## 30 <1 km 19.0
## [1] 30 8
## [1] 270 9
## $names
## [1] "CalfID" "CalfSex" "sublocation" "CADOB"
## [5] "ReasonsLoss" "Education" "Distance_water" "RecruitWeight"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30
## CalfID CalfSex sublocation CADOB
## Length:30 Min. :1.000 Length:30 Length:30
## Class :character 1st Qu.:1.000 Class :character Class :character
## Mode :character Median :1.000 Mode :character Mode :character
## Mean :1.333
## 3rd Qu.:2.000
## Max. :2.000
##
## ReasonsLoss Education Distance_water RecruitWeight
## Min. : 6.000 Length:30 Length:30 Min. :14.00
## 1st Qu.: 7.000 Class :character Class :character 1st Qu.:17.00
## Median : 7.000 Mode :character Mode :character Median :19.25
## Mean : 8.667 Mean :19.25
## 3rd Qu.: 7.000 3rd Qu.:20.38
## Max. :18.000 Max. :29.00
## NA's :21
## [1] "integer"
## [1] "character"
- We now need to change CalfSex from integer to factor and recode the values from 1 to Male and 2 to Female
## [1] FALSE
## [1] TRUE
## [1] "1" "2"
# recode variable CalfSex
# first create a new variable called CalfSex1
# assign it values of CalfSex
ideal1$CalfSex1 <- ideal1$CalfSex
table(ideal1$CalfSex1)
##
## 1 2
## 20 10
- We use the ``magical” ifelse statement to recode the values
##
## Female Male
## 10 20
##
## Female Male
## 1 0 20
## 2 10 0
3.1.1 Merging data
Combining different tables with related data is a common need
It is easy in R, using the function merge
Here we will merge ideal1 which has 30 records, with ideal2 with 270 records
You need an identifier variable in each table that will be used to link the databases
## CalfID CalfSex sublocation CADOB ReasonsLoss Education
## 1 CA020610151 1 Karisa 3/11/2007 NA Secondary school
## 2 CA020610152 1 Karisa 4/10/2007 NA No formal education
## Distance_water RecruitWeight CalfSex1
## 1 At household 19 Male
## 2 <1 km 23 Male
## VisitID CalfID VisitDate Weight ManualPCV Theileria.spp. ELISA_mutans
## 1 VRC010151 CA020610151 9/11/2007 19 37.0 0 0
## 2 VRC010152 CA020610152 11/10/2007 23 35.8 0 0
## ELISA_parva Q.Strongyle.eggs
## 1 0 NA
## 2 0 NA
##CalfID is the unique identifier we will use to combine the two datasets
ideal3 <- merge(ideal1,ideal2,by.x="CalfID",by.y="CalfID", all.x=T)
dim(ideal1)
## [1] 30 9
## [1] 270 9
## [1] 270 17
## CalfID CalfSex sublocation CADOB ReasonsLoss Education
## 1 CA010110028 1 East Siboti 31/07/2009 7 Primary school
## 2 CA010210036 2 Kidera 12/3/2008 7 Primary school
## 3 CA010210036 2 Kidera 12/3/2008 7 Primary school
## 4 CA010210036 2 Kidera 12/3/2008 7 Primary school
## Distance_water RecruitWeight CalfSex1 VisitID VisitDate Weight ManualPCV
## 1 <1 km 19 Male VRC010028 3/8/2009 19 40
## 2 1-5 km 15 Female VRC010036 18/03/2008 15 32
## 3 1-5 km 15 Female VRC060036 23/04/2008 20 38
## 4 1-5 km 15 Female VRC160036 2/7/2008 34 26
## Theileria.spp. ELISA_mutans ELISA_parva Q.Strongyle.eggs
## 1 0 NA NA NA
## 2 0 0 0 NA
## 3 0 0 0 250
## 4 0 0 0 2300