dplyr	lubridate	forcats
1.0.4	1.7.9.2	0.5.1

The data file FAO_grains_NA.csv will be used in this exercise. This dataset consists of grain yield and harvest year by North American country. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014.

Run the following line to load the FAO data file into your current R session.

dat <- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE)

The `dplyr` basics

The basic set of R tools can accomplish many data table queries, but the syntax can be overwhelming and verbose. The package dplyr offers some nifty and simple querying functions as shown in the next subsections. Some of dplyr’s key data manipulation functions are summarized in the following table:

`dplyr` function	Description
`filter()`	Subset by row values
`arrange()`	Sort rows by column values
`select()`	Subset columns
`mutate()`	Add columns
`summarise()`	Summarize columns

Note that all of these functions take as first argument the data table name except when used in a piping operation (pipes will be discussed later in this section). For example:

# When used alone, dataframe dat is inside the select function
dat2 <- select(dat, column1)

# When used in a pipe, dataframe dat is outside of the select function
dat2 <- dat %>% select(column1)

Make sure to load the dplyr package before proceeding with the following examples.

library(dplyr)

`filter`: Subset by rows

Tables can be subsetted by rows based on column values. For example, we may wish to grab all rows associated with Oats:

dat.query1 <- filter(dat, Crop == "Oats")
summary(dat.query1)

   Country              Crop           Information             Year     
 Length:208         Length:208         Length:208         Min.   :1961  
 Class :character   Class :character   Class :character   1st Qu.:1974  
 Mode  :character   Mode  :character   Mode  :character   Median :1986  
                                                          Mean   :1986  
                                                          3rd Qu.:1999  
                                                          Max.   :2012  
     Value            Source         
 Min.   :  12667   Length:208        
 1st Qu.:  20392   Class :character  
 Median : 204821   Mode  :character  
 Mean   :1350024                     
 3rd Qu.:1638250                     
 Max.   :9666550

Note that R is case sensitive, so make sure that you respect each letter’s case (i.e. upper or lower).

We can expand our query by including both Oats, Buckwheat and limiting the country to Canada.

dat.query2 <- filter(dat, Crop == "Oats" | Crop == "Buckwheat", 
                          Country == "Canada")
summary(dat.query2)

   Country              Crop           Information             Year     
 Length:200         Length:200         Length:200         Min.   :1961  
 Class :character   Class :character   Class :character   1st Qu.:1973  
 Mode  :character   Mode  :character   Mode  :character   Median :1986  
                                                          Mean   :1986  
                                                          3rd Qu.:1998  
                                                          Max.   :2012  
     Value            Source         
 Min.   :      0   Length:200        
 1st Qu.:  11822   Class :character  
 Median :  22197   Mode  :character  
 Mean   : 504098                     
 3rd Qu.: 924675                     
 Max.   :4277000

The character | is the Boolean operator OR. So in our example, the query can be read as “… crop equals oats OR crop equals buckwheat”. Had we used the AND operator, &, instead as in Crop == "Oats" & Crop == "Buckwheat" the output would have returned zero rows since a Crop value cannot be both Oats AND Buckwheat.

We can expand this query by limiting our output to the years 2005 to 2010

library(dplyr)
dat.query3 <- filter(dat, Crop == "Oats" | Crop == "Buckwheat", 
                          Country == "Canada", 
                          Year >= 2005 & Year <= 2010)
summary(dat.query3)

   Country              Crop           Information             Year     
 Length:18          Length:18          Length:18          Min.   :2005  
 Class :character   Class :character   Class :character   1st Qu.:2006  
 Mode  :character   Mode  :character   Mode  :character   Median :2007  
                                                          Mean   :2007  
                                                          3rd Qu.:2008  
                                                          Max.   :2010  
     Value            Source         
 Min.   :   2000   Length:18         
 1st Qu.:  11500   Class :character  
 Median :  26615   Mode  :character  
 Mean   : 453810                     
 3rd Qu.: 961825                     
 Max.   :1815700

Note the use of the AND Boolean operator (&) instead of the OR operator (|) for the Year query. We want the Year value to satisfy two criteria simultaneously: greater than or equal to 2005 AND less than or equal to 2010. Had we used the | operator, R would have returned all years since all year values satisfy at least one of the two criterion.

`arrange`: Sort rows by column value

You can sort a table based on a column’s values. For example, to sort dat by crop name type:

dat.sort1 <- arrange(dat, Crop)
head(dat.sort1)

                   Country   Crop         Information Year      Value
1                   Canada Barley Area harvested (Ha) 2012 2060000.00
2                   Canada Barley       Yield (Hg/Ha) 2012   38894.66
3 United States of America Barley Area harvested (Ha) 2012 1312810.00
4 United States of America Barley       Yield (Hg/Ha) 2012   36533.24
5                   Canada Barley Area harvested (Ha) 2011 2364800.00
6                   Canada Barley       Yield (Hg/Ha) 2011   32796.43
           Source
1   Official data
2 Calculated data
3   Official data
4 Calculated data
5   Official data
6 Calculated data

tail(dat.sort1)

     Country      Crop         Information Year    Value
1496  Canada Triticale Area harvested (Ha) 1991  1093.00
1497  Canada Triticale       Yield (Hg/Ha) 1991 21957.91
1498  Canada Triticale Area harvested (Ha) 1990  1074.00
1499  Canada Triticale       Yield (Hg/Ha) 1990 26396.65
1500  Canada Triticale Area harvested (Ha) 1989  1093.00
1501  Canada Triticale       Yield (Hg/Ha) 1989 21957.91
                                       Source
1496                            Official data
1497                          Calculated data
1498 FAO data based on imputation methodology
1499                          Calculated data
1500                             FAO estimate
1501                          Calculated data

By default, arrange sorts by ascending order. To sort by descending order, wrap the column name with the function desc(). For example, to sort the table by Crop in ascending order then by Year in descending order, type:

dat.sort2 <- arrange(dat, Crop, desc(Year))
head(dat.sort2)

                   Country   Crop         Information Year      Value
1                   Canada Barley Area harvested (Ha) 2012 2060000.00
2                   Canada Barley       Yield (Hg/Ha) 2012   38894.66
3 United States of America Barley Area harvested (Ha) 2012 1312810.00
4 United States of America Barley       Yield (Hg/Ha) 2012   36533.24
5                   Canada Barley Area harvested (Ha) 2011 2364800.00
6                   Canada Barley       Yield (Hg/Ha) 2011   32796.43
           Source
1   Official data
2 Calculated data
3   Official data
4 Calculated data
5   Official data
6 Calculated data

tail(dat.sort2)

     Country      Crop         Information Year    Value
1496  Canada Triticale Area harvested (Ha) 1991  1093.00
1497  Canada Triticale       Yield (Hg/Ha) 1991 21957.91
1498  Canada Triticale Area harvested (Ha) 1990  1074.00
1499  Canada Triticale       Yield (Hg/Ha) 1990 26396.65
1500  Canada Triticale Area harvested (Ha) 1989  1093.00
1501  Canada Triticale       Yield (Hg/Ha) 1989 21957.91
                                       Source
1496                            Official data
1497                          Calculated data
1498 FAO data based on imputation methodology
1499                          Calculated data
1500                             FAO estimate
1501                          Calculated data

`select`: Subset by column

You can grab a subset of the table by column(s). To extract the columns Crop, Year and Value, type:

dat.subcol <- select(dat, Crop, Year, Value)
head(dat.subcol, 2)

    Crop Year      Value
1 Barley 2012 2060000.00
2 Barley 2012   38894.66

If you want all columns other than Crop, Year and Value, add the negative - symbol before the column name:

dat.subcol <- select(dat, -Crop, -Year, -Value)
head(dat.subcol, 2)

  Country         Information          Source
1  Canada Area harvested (Ha)   Official data
2  Canada       Yield (Hg/Ha) Calculated data

`mutate`: Creating and/or calculating column values

You can add columns (and compute their values) using the mutate function. For example, to add a column Ctr_abbr and assign it the abbreviated values CAN for Canada and USA for the United States of America based on the values in column Country type:

dat.extended <- mutate(dat, Ctr_abbr = ifelse(Country == "Canada", "CAN", "USA"))
head(dat.extended,3)

  Country      Crop         Information Year      Value          Source
1  Canada    Barley Area harvested (Ha) 2012 2060000.00   Official data
2  Canada    Barley       Yield (Hg/Ha) 2012   38894.66 Calculated data
3  Canada Buckwheat Area harvested (Ha) 2012       0.00    FAO estimate
  Ctr_abbr
1      CAN
2      CAN
3      CAN

tail(dat.extended,3)

                      Country    Crop         Information Year      Value
1499 United States of America     Rye       Yield (Hg/Ha) 1961   11121.79
1500 United States of America Sorghum Area harvested (Ha) 1961 4445000.00
1501 United States of America Sorghum       Yield (Hg/Ha) 1961   27442.07
              Source Ctr_abbr
1499 Calculated data      USA
1500   Official data      USA
1501 Calculated data      USA

Here, we make use of an embedded function, ifelse, which performs a conditional operation: if the Country value is Canada return CAN if not, return USA.

Note that if you wish to rename a column, you can use the rename() function instead of mutate.

You can also use mutate to recompute column values. For example, to replace the Country column values with CAN or USA type:

dat.overwrite <- mutate(dat, Country = ifelse(Country == "Canada", "CAN", "USA"))
head(dat.overwrite,3)

  Country      Crop         Information Year      Value          Source
1     CAN    Barley Area harvested (Ha) 2012 2060000.00   Official data
2     CAN    Barley       Yield (Hg/Ha) 2012   38894.66 Calculated data
3     CAN Buckwheat Area harvested (Ha) 2012       0.00    FAO estimate

tail(dat.overwrite,3)

     Country    Crop         Information Year      Value          Source
1499     USA     Rye       Yield (Hg/Ha) 1961   11121.79 Calculated data
1500     USA Sorghum Area harvested (Ha) 1961 4445000.00   Official data
1501     USA Sorghum       Yield (Hg/Ha) 1961   27442.07 Calculated data

`mutate_at`: Creating and/or calculating values across muliple columns

You might find yourself wanting to apply a same set of mutate operations across multiple variables. For example, given the following sample dataset,

met <- data.frame(Wind = c(3.4, 5.0, 99, 4.1, 1.5),
                  Dir  = c(181, 220, 15,  15,  99 ),
                  Prec = c(99 , 0.5,  0,  99,  99))
met

  Wind Dir Prec
1  3.4 181 99.0
2  5.0 220  0.5
3 99.0  15  0.0
4  4.1  15 99.0
5  1.5  99 99.0

where the value 99 is a placeholder for a missing value for the variables Wind and Prec but a valid value for Dir, we want to replace all missing values with NA. We could either create two mutate operations as in,

met2 <- mutate( met, Wind = ifelse(Wind == 99, NA, Wind),
                     Prec = ifelse(Prec == 99, NA, Prec))
met2

  Wind Dir Prec
1  3.4 181   NA
2  5.0 220  0.5
3   NA  15  0.0
4  4.1  15   NA
5  1.5  99   NA

or, we could reduce the separate mutate operations into a single mutate_at operation,

met2 <- mutate_at( met, vars(Wind, Prec),
                        list(~ ifelse( . == 99, NA, .)))
met2

  Wind Dir Prec
1  3.4 181   NA
2  5.0 220  0.5
3   NA  15  0.0
4  4.1  15   NA
5  1.5  99   NA

The dot . is a placeholder for each column listed in the vars() function. Also, note the ~ operator in lieu of . =.

`summarise`: Summarize columns

You can summarize (or “collapse”) one or more columns using the summarise function. For instance, to get the minimum and maximum years from the Year column, type:

summarise(dat, yr_min = min(Year), yr_max=max(Year))

  yr_min yr_max
1   1961   2012

Combining table manipulation functions using `%>%`

In most cases, you will find yourself wanting to combine several of dplyr’s data manipulation functions. For example,

dat.yield  <- filter(dat, Information == "Yield (Hg/Ha)", 
                          Crop == "Oats",
                          Year == 2012)
dat.rename <- mutate(dat.yield, Country = ifelse(Country == "Canada", "CAN", "USA"))
dat.final  <- select(dat.rename, Country, Value)

head(dat.final, 3)

  Country    Value
1     CAN 24954.79
2     USA 21974.70

The downside to this approach is the creation of several intermediate objects (e.g. dat.yield and dat.rename). This can make the workflow difficult to follow and clutter your R session with needless intermediate objects.

Another approach to combining dplyr operations is to use the piping operator ,%>%, which daisy chains dplyr operations. So our previous workflow could look like:

dat.final <- dat %>%
  filter(Information == "Yield (Hg/Ha)", 
         Crop == "Oats",
         Year == 2012)  %>% 
  mutate(Country = ifelse(Country == "Canada", "CAN", "USA")) %>%
  select(Country, Value)

head(dat.final, 3)

  Country    Value
1     CAN 24954.79
2     USA 21974.70

The chunk of code can be read as “… with the dat table, filter by …, then mutate …., then select …” with the result from one operation being passed on to the next using the %>% operator. Note that the filter, mutate and select functions do not include the data table name making the chunk of code less cluttered and easier to read. The input data table dat appears just once at the beginning of the pipe.

Conditional statements

The base `ifelse`

Conditional statements are used when you want to create an output value that is conditioned on an evaluation. For example, if you want to output a value of 1 if an input value is less than 23 and a value of 0 otherwise, you can make use of the ifelse function as follows:

x <- c(12,102, 43, 20, 90, 0, 12, 6)
ifelse(x < 23, 1, 0)

[1] 1 0 0 1 0 1 1 1

The base ifelse function works as expected when the input/output values are numeric or character, but does not work as expected when applied to factors or dates. For example, if you wish to replace one factor level with another, the following example will not return the expected output.

x <- as.factor( c("apple", "banana", "banana", "pear", "apple"))
ifelse(x == "pear", "apple", x)

[1] "1"     "2"     "2"     "apple" "1"

The output is a character representation of the level number (recall that factors encode level values as numbers behind the scenes, i.e. apple =1, banana=2, etc…). Likewise, if you wish to replace an erroneous date with a missing value you will get:

library(lubridate)
y <- mdy("1/23/2016", "3/2/2016", "12/1/1901", "11/23/2016")
ifelse( year(y) != 2016, NA, y)

[1] 16823 16862    NA 17128

Here, ifelse converts the date object to its internal numeric representation as number of days since 1970.

if_else does not preserve the attributes that might be present in a vector. In other words, it will strip away the vector’s class. If you want to ensure that the vector’s class is preserved, a safer alternative is to use dplyr’s if_else function.

`dplyr`’s `if_else`

The if_else function (note the underscore _), will preserve data type but does so strictly. For example, the following code will return an error:

if_else( year(y) != 2016, NA, y)

Error: `false` must be a logical vector, not a `Date` object.

The output data types (NA and y) are different. NA is a logical data type (see typeof(NA)) and y is not. The solution is to force NA as a date object by wrapping it with a date function like mdy(NA) or the base as.Date(NA) function, or by replacing the logical NA with NA_Date_ which is found in the lubridate package.

if_else( year(y) != 2016, mdy(NA), y)

[1] "2016-01-23" "2016-03-02" NA           "2016-11-23"

Likewise, if the input vector is of type character, you need to ensure that all output values are characters too.

y <- c("apple", "banana", "banana", "pear", "apple")
if_else( y == "banana", as.character(NA), y)

[1] "apple" NA      NA      "pear"  "apple"

Recall from the Week 02 lecture notes that R has several NA reserved words for different data types (e.g. NA_character_ and NA_integer_ to name a few). So the last chunk of code could have been written as,

if_else( y == "banana", NA_character_, y)

[1] "apple" NA      NA      "pear"  "apple"

Note that there is no reserved word for an NA date or Posix type in the base R environment. But they are available in the lubridate package (NA_Date_ and NA_POSIXct_).

Recoding factors using `recode`

When working with factors, however, if_else (as of dplyr version 0.7) will produce the following error:

if_else(x == "pear", "apple", x)

Error: `false` must be a character vector, not a `factor` object.

The first outcome listed in the function defines the data type or class. Here, "apple" is a character, so it expects the FALSE outcome to be a character as well (here, the FALSE outcome is a factor). A better option for recoding factors is to use dplyr’s recode function:

recode(x , "pear" = "apple")

[1] apple  banana banana apple  apple 
Levels: apple banana

You can recode more than one factor level. In fact, you can even introduce new levels in the recoding scheme:

recode(x , "pear" = "apple", "banana" = "pineapple" )

[1] apple     pineapple pineapple apple     apple    
Levels: apple pineapple

As with if_else and case_when, recode is strict about preserving data types and classes. So if you want to recode a factor level to NA make sure to use NA_character_ or as.character(NA). recode will automatcially convert the character value to a factor if the input vector (x in our example) is a factor.

Note that recode can also be used with numeric and character data types (i.e. vectors devoid of class attributes). For example,

# Example of recoding a character vector
recode(y , "pear" = "apple", "banana" = "pineapple" )

[1] "apple"     "pineapple" "pineapple" "apple"     "apple"

Note on replacing NA factor levels

One operation you cannot perform with recode is converting an NA level to another factor level. For example, the following will generate an error message:

x[2] <- NA
x

[1] apple  <NA>   banana pear   apple 
Levels: apple banana pear

recode(x, NA = "other")

Error: unexpected '=' in "recode(x, NA ="

The simplest solution is to make use of a specialized factor package called forcats and its function, fct_explicit_na.

library(forcats)
x <- fct_explicit_na(x, na_level = "Other")
x

[1] apple  Other  banana pear   apple 
Levels: apple banana pear Other

Changing values based on multiple conditions: `case_when`

ifelse and if_else work great when a single set of conditions is to be satisfied. But if multiple sets of conditions are to be tested, nested if/else statements become cumbersome and are prone to clerical error. The following code highlights an example of nested if/else statements.

unit <- c("F","F","C", "K")
if_else( unit == "C", "Celsius", if_else(unit == "F", "Fahrenheit", "Kelvin"))

[1] "Fahrenheit" "Fahrenheit" "Celsius"    "Kelvin"

A simpler solution is to use the recode function discussed in the previous section.

recode(unit, "C" = "Celsius",
             "F" = "Fahrenheit",
             "K" = "Kelvin")

[1] "Fahrenheit" "Fahrenheit" "Celsius"    "Kelvin"

recode is well suited for replacing values but it will not allow for more complex operations. For example, given two vectors, unit and temp, we would like to convert all temp values to Fahrenheit by applying a temperature conversion dependent on the unit value.

temp <- c(45.2, 56.3, 11.0, 285)

This operation is best performed using the case_when function.

case_when(unit == "F" ~ temp,
          unit == "C" ~ (temp * 9/5) + 32,
          TRUE ~ (temp - 273.15) * 9/5 + 32)

[1] 45.20 56.30 51.80 53.33

The last parameter, TRUE ~, applies to all conditions not satisfied by the previous two conditions (otherwise, not doing so would return NA values).

Note that the order in which these conditions are listed matters since evaluation stops at the first TRUE outcome encountered. So, had the last condition been moved to the top of the stack, all temp values would be assigned the first conversion option.

# What not to do ...
case_when(TRUE ~ (temp - 273.15) * 9/5 + 32,
          unit == "F" ~ temp,
          unit == "C" ~ (temp * 9/5) + 32)

[1] -378.31 -358.33 -439.87   53.33

As with the if_else function, case_when is strict about data type in that all output must be of the same data type.

Note that ifelse, if_else, recode and case_when can all be used inside of a mutate function. For example, to replace Canada and United States of America in variable Country with CAN and USA respectively and to create a new variable called Type which will take on the values of 1, 2 or 3 depending on the values in variable Source, type the following:

dat1 <- dat %>% 
  mutate(Country = recode(Country, "Canada" = "CAN",
                                   "United States of America" = "USA"),
         Type = case_when(Source == "Calculated data" ~ 1,
                          Source == "Official data" ~ 2,
                          TRUE ~ 3)) 
head(dat1)

  Country         Crop         Information Year      Value          Source Type
1     CAN       Barley Area harvested (Ha) 2012 2060000.00   Official data    2
2     CAN       Barley       Yield (Hg/Ha) 2012   38894.66 Calculated data    1
3     CAN    Buckwheat Area harvested (Ha) 2012       0.00    FAO estimate    3
4     CAN  Canary seed Area harvested (Ha) 2012  101900.00   Official data    2
5     CAN  Canary seed       Yield (Hg/Ha) 2012   12161.92 Calculated data    1
6     CAN Grain, mixed Area harvested (Ha) 2012   57900.00   Official data    2

Miscellaneous

Outputting a vector instead of a table using `pull`

Piping operations will output a table, even if a single value is returned. For example, the following summarization operation returns the total oats yield as a data table:

oats <- dat %>% 
  filter(Crop == "Oats",
         Information == "Yield (Hg/Ha)") %>% 
  summarise(Oats_sum = sum(Value))
oats

  Oats_sum
1  2169334

class(oats)

[1] "data.frame"

There may be times when you want to output as a vector element and not a data table. To output a vector, use the pull() function.

oats <- dat %>% 
  filter(Crop == "Oats",
         Information == "Yield (Hg/Ha)") %>% 
  summarise(Oats_sum = sum(Value)) %>% 
  pull()
oats

[1] 2169334

class(oats)

[1] "numeric"

The pull function can also be used to convert a data table column to a multi-element vector, e.g.:

# This outputs a one column table
yield <- dat %>% 
  filter(Crop == "Oats",
         Information == "Yield (Hg/Ha)") %>% 
  select(Value)

head(yield)

class(yield)

[1] "data.frame"

# This outputs a multi-element vector
yield <- dat %>% 
  filter(Crop == "Oats",
         Information == "Yield (Hg/Ha)") %>% 
  pull(Value)

head(yield)

[1] 24954.79 21974.70 29109.36 20492.37 27364.53 23056.62

class(yield)

[1] "numeric"

Manipulating data tables with dplyr

The dplyr basics

filter: Subset by rows

arrange: Sort rows by column value

select: Subset by column

mutate: Creating and/or calculating column values

mutate_at: Creating and/or calculating values across muliple columns

summarise: Summarize columns

Combining table manipulation functions using %>%