Chapter 4 Data wrangling

  • The goal of data wrangling is getting your data in a useful/effective/efficient form for visualization and analysis.

  • Preparation of data (collecting, organizing, cleaning, manupulation) before analysis are steps ofte undervalued and yet take most of the time!.

  • “There’s never time to do it right … but there’s always time to do it over!” John Bergman

The first step is to develop a data analysis plan where you: - Define a data vision/ roadmap for your use - (A priori) identify which data to use and how to use it - Plan data storage, security, documentation and sharing

Next, you think through the data ownership and the data ecosystem:

  • Data owners: They are responsible for:
    • data access/ granting others access
    • how the data is used
  • Dana analysts/ managers: responsible for:
    • quality
    • meaning of data
  • Data custodians- responsible for:
    • data archiving, maintenance
    • data security

The process of data collection to data analysis and archiving should follow the FAIR principle which stands for Findable, Accessible, Interoperable and Reusable.

  • Following the FAIR principle, there are packages in R which have been developed to ease the data wrangling process, while ensuring the code/process is reproducable. tidyverse packages are some of the easy to use packages and you can read more about them here [https://www.tidyverse.org].

  • We will be using these packages throughout this course and to install them, use the command

install.packages(tidyverse)
# installs packages tidyr; dplyr; ggplot2; readr; purrr; tibble; stringr; forcats; lubridate; haven

4.1 Tools for manipulating data

library(tidyverse)
# loads packages tidyr; dplyr; ggplot2; readr; purrr; tibble; stringr; forcats; lubridate; haven
  • library(tidyr): It helps you create tidy data
    • pivot_longer() # transposes your data to long format
    • pivot_wider() # transposes your data to wide format
    • separate() # changes a single column into multiple columns
    • unite() # combines multiple columns into a single column
  • library(dplyr): It has functions for data manipulation and some of these include:
    • select() # picks columns based on their names
    • filter() # picks cases based on their values
    • group_by() # manipulates data by groups
    • summarise() #reduces multiple values to a single summary
    • arrange() #changes the ordering of rows
    • join() # combines multiple datasets into one
    • mutate() # adds new columns or manipulates existing columns
    • recode() # recode the data
  • library(ggplot2): Used for creating graphics- the gg stands for grammar of graphics
    • ggplot() #supply a dataset for graphics
    • geom_histogram() #for plotting histograms
    • geom_line() # for plotting line graphs
    • geom_bar() # for plotting bar graphs
    • scale_colour_brewer() # for adding color to the graphs
    • facet_wrap() # giving faceting specifications for graphs
  • library(readr): Provides a fast and friendly way to import data into R
    • read_csv() # import comma separated files
    • read_csv2() # semi-colon Delimited Files (common in countries where , is used as the decimal place)
    • read_delim() # import files with any Delimiter
    • read_fwf() # import fixed width files
    • read_tsv() # import tab Delimited files
    • read_table() # import files where columns are separated by white-space
    • write_csv() # write data to your computer in csv format
    • write_tsv() # write data to your computer in tab delimited format
  • library(haven)
    • read_sas() # read SAS files
    • read_sav() # read SPSS files
    • read_dta() # read Stata files
  • library(stringr): clean data in columns, mostly character data str_subset() # subset data based on an expression str_replace() #replace the matches with new text str_split() # split up a string into multiple pieces str_to_lower() #change column data into lower case str_to_sentence() #change column data into sentence case (capitalise the first letter) str_to_title() # change column data into title case (capitalise every first letter of a word) *str_to_upper() # change column data into upper case (capitalise the data)

4.2 Making datasets tidy

  • Variable: a measure (values) of the same attribute across observation units (eg weight, age, infection status)

    • Variable = column
  • Observation: a unit e.g a person - and all the variable values associated with the unit

    • Observation = row

    • Observational unit forms a table

4.3 Pipes in R %>%

  • The pipe, %>% , from the package “magrittr” - and loaded automatically with library(tidyverse)

  • Helps write more readable code

  • To learn on the tidyverse package, we will use the ideal datasets introduced in Chapter 3.

4.3.1 Data importation

library(tidyverse)
ideal1 <- read_csv("https://raw.githubusercontent.com/ThumbiMwangi/R_sources/master/ideal1.csv")
ideal2 <- read_csv("https://raw.githubusercontent.com/ThumbiMwangi/R_sources/master/ideal2.csv")
  • When you import data using read_csv, it prints out the name and type of each column. From above, you can see all the columns of the two datasets, ideal1 and ideal2, and the data types of each. If you need a refresher on the data types, kindly go to chapter 2.4.

Now let us look at the dataset by executing the code below:

ideal1
## # A tibble: 30 × 8
##    CalfID      CalfSex sublocation  CADOB   ReasonsLoss Education Distance_water
##    <chr>         <dbl> <chr>        <chr>         <dbl> <chr>     <chr>         
##  1 CA020610151       1 Karisa       3/11/2…          NA Secondar… At household  
##  2 CA020610152       1 Karisa       4/10/2…          NA No forma… <1 km         
##  3 CA010310064       1 Kokare       3/1/20…           7 Primary … <1 km         
##  4 CA031310367       1 Yiro West    4/6/20…           7 Primary … 1-5 km        
##  5 CA051910545       2 Bujwanga     7/4/20…          NA Primary … <1 km         
##  6 CA051810516       1 Luanda       11/3/2…          NA Primary … <1 km         
##  7 CA010210036       2 Kidera       12/3/2…           7 Primary … 1-5 km        
##  8 CA052010596       2 Magombe East 2/9/20…          NA No forma… 1-5 km        
##  9 CA041610476       1 Ojwando B    4/9/20…          NA Primary … <1 km         
## 10 CA051910566       2 Bujwanga     5/9/20…          NA No forma… <1 km         
## # ℹ 20 more rows
## # ℹ 1 more variable: RecruitWeight <dbl>
  • You can see that at the top, it writes that it is a “tibble”. Tidyverse functions change dataframes to tibble. Well, you may not see the difference bewteen a dataframe and tibble but some people refer to a tibble as a “lazy dataframe” which does not change the data types. Tibbles also support really large datasets. By default, a tibble shows the columns that can fit on the window and the first 10 rows.

  • The 30 represents the number of rows while the 8 represents the number of columns in the dataset.

4.3.2 Exercise

Before you continue, I would like you to remind yourself about the data types in R. If you do not recall, kingly go and read chapter 2.5

  • change the CADOB column in ideal 1 into a date

  • change the VisitDate column into ideal 2 into a date

  • To read data in other formats, you may use the following packages:

    • haven: imports Stata, SPSS and SAS filesinto R
    • readxl: imports excel files into R( both .xls and .xlsx)
    • DBI: allows you to import data from an SQL database

4.4 recode()

  • As mentioned above, this function recodes data into the columns. We will recode the CalfSex column
ideal1<- ideal1%>% # This is piping
  mutate(CalfSex=recode(CalfSex, "1"="Male", "2"="Female"))%>%
  mutate() # Kindly try it by  recoding ReasonsLoss to:
# 6= owner declined
# 7= Death
# 12= Relocation
# 18= Given out

4.5 pivot_longer()

-This function gathers multiple columns into key-value pairs

ideal2a<- ideal2%>%
  pivot_longer(cols=c(ManualPCV, Theileria.spp., ELISA_mutans, ELISA_parva,Q.Strongyle.eggs))
  • pivot_wider() - complements pivot_longer()

Try change the data back to a wide format. Use ?pivot_wider() to use the inbuilt R help tool

  • arrange() orders rows of a data frame by the values of selected columns.
ideal1<- ideal1%>%
  arrange(CADOB)

4.6 Grouping data

ideal1a<- ideal1%>%
  group_by(CalfID)%>%
  mutate(minweight=min(RecruitWeight))

4.7 A sample data set

  • We can also use a sample dataset for dogdemography data for a previous study.
dogdemography <- read_csv("https://raw.githubusercontent.com/cema-uonbi/R_course/main/DogcohortDemographics.csv") # import the data
dim(dogdemography)
## [1] 1207   11
head(dogdemography)
## # A tibble: 6 × 11
##   IntDate HousehldID VillageID HhMmbrs OwnDogs DgsOwnd AdltDgs Puppies DogDied
##   <chr>   <chr>          <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1/30/17 13-172-2          13       4       0       0       0       0       0
## 2 1/30/17 2-87-7             2       6       0       0       0       0       0
## 3 1/30/17 49-1-1            49       1       0       0       0       0       0
## 4 1/30/17 13-224-3          13       9       1       1       1       0       0
## 5 1/30/17 2-109-5            2       3       0       0       0       0       0
## 6 1/30/17 67-9-2            67      11       0       0       0       0       0
## # ℹ 2 more variables: NumDd <dbl>, DogBite <dbl>
## rename the variables for clarity
dogdemography <- dogdemography%>%rename(interviewDate = IntDate # format as date
                , householdID = HousehldID # format as character
                , villageID = VillageID # format as character
                , householdMembers = HhMmbrs # format as integer
                , ownDogs = OwnDogs # format as logical
                , numDogsOwned = DgsOwnd # format as integer
                , adultDogsOwned = AdltDgs # format as integer
                , puppiesOwned = Puppies # format as integer
                , dogDiedPastMonth = DogDied # format as logical
                , numDogsDiedPastMonth = NumDd # format as integer
                , dogBitesPastMonth = DogBite # format as logical 
                )

4.8 summarise()

  • It is used to sumamrise each group to fewer rows
dogdemography1<-dogdemography %>%
  group_by(villageID) %>% 
  summarize(count = n()) 

4.9 filter() rows/observations with matching conditions

dogdemography2<-dogdemography%>%
  filter(adultDogsOwned+puppiesOwned == numDogsOwned)%>%
  select(householdID,adultDogsOwned,puppiesOwned,numDogsOwned)

4.10 distinct()

  • It removes duplicates
dogdemography2a<-dogdemography%>%
  group_by(numDogsOwned,adultDogsOwned,puppiesOwned)%>%
  select(adultDogsOwned,puppiesOwned,numDogsOwned)%>%
  distinct() 
dogbites<- dogdemography%>%
  group_by(dogBitesPastMonth)%>%
  summarize(count = n()) 
  • Play with your data

    • Don’t touch the original data
    • Make scripts and make sure that they are replicable
    • Don’t be afraid to play, experiment, probe
    • You want to like your data-manipulation tools, and get your data as clean as possible
  • Try them out

    • tidyr
      • gather()
      • spread()
      • separate()
      • unite()
    • dplyr
      • select()
      • filter()
      • group_by()
      • summarise()
      • arrange()
      • join()
      • mutate()

Summary

  • Let the computer serve you
    • Input in the form that you want to input
    • Manage in the tidy form that is efficient and logical
    • Output in the way you want to output
  • Be aggressive about exploring, questioning and cleaning