| dplyr | lubridate | forcats |
|---|---|---|
| 1.0.4 | 1.7.9.2 | 0.5.1 |
The data file FAO_grains_NA.csv will be used in this exercise. This dataset consists of grain yield and harvest year by North American country. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014.
Run the following line to load the FAO data file into your current R session.
dat <- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE)dplyr basicsThe basic set of R tools can accomplish many data table queries, but the syntax can be overwhelming and verbose. The package dplyr offers some nifty and simple querying functions as shown in the next subsections. Some of dplyr’s key data manipulation functions are summarized in the following table:
dplyr function |
Description |
|---|---|
filter() |
Subset by row values |
arrange() |
Sort rows by column values |
select() |
Subset columns |
mutate() |
Add columns |
summarise() |
Summarize columns |
Note that all of these functions take as first argument the data table name except when used in a piping operation (pipes will be discussed later in this section). For example:
# When used alone, dataframe dat is inside the select function
dat2 <- select(dat, column1)
# When used in a pipe, dataframe dat is outside of the select function
dat2 <- dat %>% select(column1)Make sure to load the dplyr package before proceeding with the following examples.
library(dplyr)filter: Subset by rowsTables can be subsetted by rows based on column values. For example, we may wish to grab all rows associated with Oats:
dat.query1 <- filter(dat, Crop == "Oats")
summary(dat.query1) Country Crop Information Year
Length:208 Length:208 Length:208 Min. :1961
Class :character Class :character Class :character 1st Qu.:1974
Mode :character Mode :character Mode :character Median :1986
Mean :1986
3rd Qu.:1999
Max. :2012
Value Source
Min. : 12667 Length:208
1st Qu.: 20392 Class :character
Median : 204821 Mode :character
Mean :1350024
3rd Qu.:1638250
Max. :9666550
Note that R is case sensitive, so make sure that you respect each letter’s case (i.e. upper or lower).
We can expand our query by including both Oats, Buckwheat and limiting the country to Canada.
dat.query2 <- filter(dat, Crop == "Oats" | Crop == "Buckwheat",
Country == "Canada")
summary(dat.query2) Country Crop Information Year
Length:200 Length:200 Length:200 Min. :1961
Class :character Class :character Class :character 1st Qu.:1973
Mode :character Mode :character Mode :character Median :1986
Mean :1986
3rd Qu.:1998
Max. :2012
Value Source
Min. : 0 Length:200
1st Qu.: 11822 Class :character
Median : 22197 Mode :character
Mean : 504098
3rd Qu.: 924675
Max. :4277000
The character | is the Boolean operator OR. So in our example, the query can be read as “… crop equals oats OR crop equals buckwheat”. Had we used the AND operator, &, instead as in Crop == "Oats" & Crop == "Buckwheat" the output would have returned zero rows since a Crop value cannot be both Oats AND Buckwheat.
We can expand this query by limiting our output to the years 2005 to 2010
library(dplyr)
dat.query3 <- filter(dat, Crop == "Oats" | Crop == "Buckwheat",
Country == "Canada",
Year >= 2005 & Year <= 2010)
summary(dat.query3) Country Crop Information Year
Length:18 Length:18 Length:18 Min. :2005
Class :character Class :character Class :character 1st Qu.:2006
Mode :character Mode :character Mode :character Median :2007
Mean :2007
3rd Qu.:2008
Max. :2010
Value Source
Min. : 2000 Length:18
1st Qu.: 11500 Class :character
Median : 26615 Mode :character
Mean : 453810
3rd Qu.: 961825
Max. :1815700
Note the use of the AND Boolean operator (&) instead of the OR operator (|) for the Year query. We want the Year value to satisfy two criteria simultaneously: greater than or equal to 2005 AND less than or equal to 2010. Had we used the | operator, R would have returned all years since all year values satisfy at least one of the two criterion.
arrange: Sort rows by column valueYou can sort a table based on a column’s values. For example, to sort dat by crop name type:
dat.sort1 <- arrange(dat, Crop)
head(dat.sort1) Country Crop Information Year Value
1 Canada Barley Area harvested (Ha) 2012 2060000.00
2 Canada Barley Yield (Hg/Ha) 2012 38894.66
3 United States of America Barley Area harvested (Ha) 2012 1312810.00
4 United States of America Barley Yield (Hg/Ha) 2012 36533.24
5 Canada Barley Area harvested (Ha) 2011 2364800.00
6 Canada Barley Yield (Hg/Ha) 2011 32796.43
Source
1 Official data
2 Calculated data
3 Official data
4 Calculated data
5 Official data
6 Calculated data
tail(dat.sort1) Country Crop Information Year Value
1496 Canada Triticale Area harvested (Ha) 1991 1093.00
1497 Canada Triticale Yield (Hg/Ha) 1991 21957.91
1498 Canada Triticale Area harvested (Ha) 1990 1074.00
1499 Canada Triticale Yield (Hg/Ha) 1990 26396.65
1500 Canada Triticale Area harvested (Ha) 1989 1093.00
1501 Canada Triticale Yield (Hg/Ha) 1989 21957.91
Source
1496 Official data
1497 Calculated data
1498 FAO data based on imputation methodology
1499 Calculated data
1500 FAO estimate
1501 Calculated data
By default, arrange sorts by ascending order. To sort by descending order, wrap the column name with the function desc(). For example, to sort the table by Crop in ascending order then by Year in descending order, type:
dat.sort2 <- arrange(dat, Crop, desc(Year))
head(dat.sort2) Country Crop Information Year Value
1 Canada Barley Area harvested (Ha) 2012 2060000.00
2 Canada Barley Yield (Hg/Ha) 2012 38894.66
3 United States of America Barley Area harvested (Ha) 2012 1312810.00
4 United States of America Barley Yield (Hg/Ha) 2012 36533.24
5 Canada Barley Area harvested (Ha) 2011 2364800.00
6 Canada Barley Yield (Hg/Ha) 2011 32796.43
Source
1 Official data
2 Calculated data
3 Official data
4 Calculated data
5 Official data
6 Calculated data
tail(dat.sort2) Country Crop Information Year Value
1496 Canada Triticale Area harvested (Ha) 1991 1093.00
1497 Canada Triticale Yield (Hg/Ha) 1991 21957.91
1498 Canada Triticale Area harvested (Ha) 1990 1074.00
1499 Canada Triticale Yield (Hg/Ha) 1990 26396.65
1500 Canada Triticale Area harvested (Ha) 1989 1093.00
1501 Canada Triticale Yield (Hg/Ha) 1989 21957.91
Source
1496 Official data
1497 Calculated data
1498 FAO data based on imputation methodology
1499 Calculated data
1500 FAO estimate
1501 Calculated data
select: Subset by columnYou can grab a subset of the table by column(s). To extract the columns Crop, Year and Value, type:
dat.subcol <- select(dat, Crop, Year, Value)
head(dat.subcol, 2) Crop Year Value
1 Barley 2012 2060000.00
2 Barley 2012 38894.66
If you want all columns other than Crop, Year and Value, add the negative - symbol before the column name:
dat.subcol <- select(dat, -Crop, -Year, -Value)
head(dat.subcol, 2) Country Information Source
1 Canada Area harvested (Ha) Official data
2 Canada Yield (Hg/Ha) Calculated data
mutate: Creating and/or calculating column valuesYou can add columns (and compute their values) using the mutate function. For example, to add a column Ctr_abbr and assign it the abbreviated values CAN for Canada and USA for the United States of America based on the values in column Country type:
dat.extended <- mutate(dat, Ctr_abbr = ifelse(Country == "Canada", "CAN", "USA"))
head(dat.extended,3) Country Crop Information Year Value Source
1 Canada Barley Area harvested (Ha) 2012 2060000.00 Official data
2 Canada Barley Yield (Hg/Ha) 2012 38894.66 Calculated data
3 Canada Buckwheat Area harvested (Ha) 2012 0.00 FAO estimate
Ctr_abbr
1 CAN
2 CAN
3 CAN
tail(dat.extended,3) Country Crop Information Year Value
1499 United States of America Rye Yield (Hg/Ha) 1961 11121.79
1500 United States of America Sorghum Area harvested (Ha) 1961 4445000.00
1501 United States of America Sorghum Yield (Hg/Ha) 1961 27442.07
Source Ctr_abbr
1499 Calculated data USA
1500 Official data USA
1501 Calculated data USA
Here, we make use of an embedded function, ifelse, which performs a conditional operation: if the Country value is Canada return CAN if not, return USA.
Note that if you wish to rename a column, you can use the rename() function instead of mutate.
You can also use mutate to recompute column values. For example, to replace the Country column values with CAN or USA type:
dat.overwrite <- mutate(dat, Country = ifelse(Country == "Canada", "CAN", "USA"))
head(dat.overwrite,3) Country Crop Information Year Value Source
1 CAN Barley Area harvested (Ha) 2012 2060000.00 Official data
2 CAN Barley Yield (Hg/Ha) 2012 38894.66 Calculated data
3 CAN Buckwheat Area harvested (Ha) 2012 0.00 FAO estimate
tail(dat.overwrite,3) Country Crop Information Year Value Source
1499 USA Rye Yield (Hg/Ha) 1961 11121.79 Calculated data
1500 USA Sorghum Area harvested (Ha) 1961 4445000.00 Official data
1501 USA Sorghum Yield (Hg/Ha) 1961 27442.07 Calculated data
mutate_at: Creating and/or calculating values across muliple columnsYou might find yourself wanting to apply a same set of mutate operations across multiple variables. For example, given the following sample dataset,
met <- data.frame(Wind = c(3.4, 5.0, 99, 4.1, 1.5),
Dir = c(181, 220, 15, 15, 99 ),
Prec = c(99 , 0.5, 0, 99, 99))
met Wind Dir Prec
1 3.4 181 99.0
2 5.0 220 0.5
3 99.0 15 0.0
4 4.1 15 99.0
5 1.5 99 99.0
where the value 99 is a placeholder for a missing value for the variables Wind and Prec but a valid value for Dir, we want to replace all missing values with NA. We could either create two mutate operations as in,
met2 <- mutate( met, Wind = ifelse(Wind == 99, NA, Wind),
Prec = ifelse(Prec == 99, NA, Prec))
met2 Wind Dir Prec
1 3.4 181 NA
2 5.0 220 0.5
3 NA 15 0.0
4 4.1 15 NA
5 1.5 99 NA
or, we could reduce the separate mutate operations into a single mutate_at operation,
met2 <- mutate_at( met, vars(Wind, Prec),
list(~ ifelse( . == 99, NA, .)))
met2 Wind Dir Prec
1 3.4 181 NA
2 5.0 220 0.5
3 NA 15 0.0
4 4.1 15 NA
5 1.5 99 NA
The dot . is a placeholder for each column listed in the vars() function. Also, note the ~ operator in lieu of . =.
summarise: Summarize columnsYou can summarize (or “collapse”) one or more columns using the summarise function. For instance, to get the minimum and maximum years from the Year column, type:
summarise(dat, yr_min = min(Year), yr_max=max(Year)) yr_min yr_max
1 1961 2012
%>%In most cases, you will find yourself wanting to combine several of dplyr’s data manipulation functions. For example,
dat.yield <- filter(dat, Information == "Yield (Hg/Ha)",
Crop == "Oats",
Year == 2012)
dat.rename <- mutate(dat.yield, Country = ifelse(Country == "Canada", "CAN", "USA"))
dat.final <- select(dat.rename, Country, Value)
head(dat.final, 3) Country Value
1 CAN 24954.79
2 USA 21974.70
The downside to this approach is the creation of several intermediate objects (e.g. dat.yield and dat.rename). This can make the workflow difficult to follow and clutter your R session with needless intermediate objects.
Another approach to combining dplyr operations is to use the piping operator ,%>%, which daisy chains dplyr operations. So our previous workflow could look like:
dat.final <- dat %>%
filter(Information == "Yield (Hg/Ha)",
Crop == "Oats",
Year == 2012) %>%
mutate(Country = ifelse(Country == "Canada", "CAN", "USA")) %>%
select(Country, Value)
head(dat.final, 3) Country Value
1 CAN 24954.79
2 USA 21974.70
The chunk of code can be read as “… with the dat table, filter by …, then mutate …., then select …” with the result from one operation being passed on to the next using the %>% operator. Note that the filter, mutate and select functions do not include the data table name making the chunk of code less cluttered and easier to read. The input data table dat appears just once at the beginning of the pipe.
ifelseConditional statements are used when you want to create an output value that is conditioned on an evaluation. For example, if you want to output a value of 1 if an input value is less than 23 and a value of 0 otherwise, you can make use of the ifelse function as follows:
x <- c(12,102, 43, 20, 90, 0, 12, 6)
ifelse(x < 23, 1, 0)[1] 1 0 0 1 0 1 1 1
The base ifelse function works as expected when the input/output values are numeric or character, but does not work as expected when applied to factors or dates. For example, if you wish to replace one factor level with another, the following example will not return the expected output.
x <- as.factor( c("apple", "banana", "banana", "pear", "apple"))
ifelse(x == "pear", "apple", x)[1] "1" "2" "2" "apple" "1"
The output is a character representation of the level number (recall that factors encode level values as numbers behind the scenes, i.e. apple =1, banana=2, etc…). Likewise, if you wish to replace an erroneous date with a missing value you will get:
library(lubridate)
y <- mdy("1/23/2016", "3/2/2016", "12/1/1901", "11/23/2016")
ifelse( year(y) != 2016, NA, y)[1] 16823 16862 NA 17128
Here, ifelse converts the date object to its internal numeric representation as number of days since 1970.
if_else does not preserve the attributes that might be present in a vector. In other words, it will strip away the vector’s class. If you want to ensure that the vector’s class is preserved, a safer alternative is to use dplyr’s if_else function.
dplyr’s if_elseThe if_else function (note the underscore _), will preserve data type but does so strictly. For example, the following code will return an error:
if_else( year(y) != 2016, NA, y)Error: `false` must be a logical vector, not a `Date` object.
The output data types (NA and y) are different. NA is a logical data type (see typeof(NA)) and y is not. The solution is to force NA as a date object by wrapping it with a date function like mdy(NA) or the base as.Date(NA) function, or by replacing the logical NA with NA_Date_ which is found in the lubridate package.
if_else( year(y) != 2016, mdy(NA), y)[1] "2016-01-23" "2016-03-02" NA "2016-11-23"
Likewise, if the input vector is of type character, you need to ensure that all output values are characters too.
y <- c("apple", "banana", "banana", "pear", "apple")
if_else( y == "banana", as.character(NA), y)[1] "apple" NA NA "pear" "apple"
Recall from the Week 02 lecture notes that R has several NA reserved words for different data types (e.g. NA_character_ and NA_integer_ to name a few). So the last chunk of code could have been written as,
if_else( y == "banana", NA_character_, y)[1] "apple" NA NA "pear" "apple"
Note that there is no reserved word for an NA date or Posix type in the base R environment. But they are available in the lubridate package (NA_Date_ and NA_POSIXct_).
recodeWhen working with factors, however, if_else (as of dplyr version 0.7) will produce the following error:
if_else(x == "pear", "apple", x)Error: `false` must be a character vector, not a `factor` object.
The first outcome listed in the function defines the data type or class. Here, "apple" is a character, so it expects the FALSE outcome to be a character as well (here, the FALSE outcome is a factor). A better option for recoding factors is to use dplyr’s recode function:
recode(x , "pear" = "apple")[1] apple banana banana apple apple
Levels: apple banana
You can recode more than one factor level. In fact, you can even introduce new levels in the recoding scheme:
recode(x , "pear" = "apple", "banana" = "pineapple" )[1] apple pineapple pineapple apple apple
Levels: apple pineapple
As with if_else and case_when, recode is strict about preserving data types and classes. So if you want to recode a factor level to NA make sure to use NA_character_ or as.character(NA). recode will automatcially convert the character value to a factor if the input vector (x in our example) is a factor.
Note that recode can also be used with numeric and character data types (i.e. vectors devoid of class attributes). For example,
# Example of recoding a character vector
recode(y , "pear" = "apple", "banana" = "pineapple" )[1] "apple" "pineapple" "pineapple" "apple" "apple"
One operation you cannot perform with recode is converting an NA level to another factor level. For example, the following will generate an error message:
x[2] <- NA
x[1] apple <NA> banana pear apple
Levels: apple banana pear
recode(x, NA = "other")Error: unexpected '=' in "recode(x, NA ="
The simplest solution is to make use of a specialized factor package called forcats and its function, fct_explicit_na.
library(forcats)
x <- fct_explicit_na(x, na_level = "Other")
x[1] apple Other banana pear apple
Levels: apple banana pear Other
case_whenifelse and if_else work great when a single set of conditions is to be satisfied. But if multiple sets of conditions are to be tested, nested if/else statements become cumbersome and are prone to clerical error. The following code highlights an example of nested if/else statements.
unit <- c("F","F","C", "K")
if_else( unit == "C", "Celsius", if_else(unit == "F", "Fahrenheit", "Kelvin"))[1] "Fahrenheit" "Fahrenheit" "Celsius" "Kelvin"
A simpler solution is to use the recode function discussed in the previous section.
recode(unit, "C" = "Celsius",
"F" = "Fahrenheit",
"K" = "Kelvin")[1] "Fahrenheit" "Fahrenheit" "Celsius" "Kelvin"
recode is well suited for replacing values but it will not allow for more complex operations. For example, given two vectors, unit and temp, we would like to convert all temp values to Fahrenheit by applying a temperature conversion dependent on the unit value.
temp <- c(45.2, 56.3, 11.0, 285)This operation is best performed using the case_when function.
case_when(unit == "F" ~ temp,
unit == "C" ~ (temp * 9/5) + 32,
TRUE ~ (temp - 273.15) * 9/5 + 32)[1] 45.20 56.30 51.80 53.33
The last parameter, TRUE ~, applies to all conditions not satisfied by the previous two conditions (otherwise, not doing so would return NA values).
Note that the order in which these conditions are listed matters since evaluation stops at the first TRUE outcome encountered. So, had the last condition been moved to the top of the stack, all temp values would be assigned the first conversion option.
# What not to do ...
case_when(TRUE ~ (temp - 273.15) * 9/5 + 32,
unit == "F" ~ temp,
unit == "C" ~ (temp * 9/5) + 32)[1] -378.31 -358.33 -439.87 53.33
As with the if_else function, case_when is strict about data type in that all output must be of the same data type.
Note that ifelse, if_else, recode and case_when can all be used inside of a mutate function. For example, to replace Canada and United States of America in variable Country with CAN and USA respectively and to create a new variable called Type which will take on the values of 1, 2 or 3 depending on the values in variable Source, type the following:
dat1 <- dat %>%
mutate(Country = recode(Country, "Canada" = "CAN",
"United States of America" = "USA"),
Type = case_when(Source == "Calculated data" ~ 1,
Source == "Official data" ~ 2,
TRUE ~ 3))
head(dat1) Country Crop Information Year Value Source Type
1 CAN Barley Area harvested (Ha) 2012 2060000.00 Official data 2
2 CAN Barley Yield (Hg/Ha) 2012 38894.66 Calculated data 1
3 CAN Buckwheat Area harvested (Ha) 2012 0.00 FAO estimate 3
4 CAN Canary seed Area harvested (Ha) 2012 101900.00 Official data 2
5 CAN Canary seed Yield (Hg/Ha) 2012 12161.92 Calculated data 1
6 CAN Grain, mixed Area harvested (Ha) 2012 57900.00 Official data 2
pullPiping operations will output a table, even if a single value is returned. For example, the following summarization operation returns the total oats yield as a data table:
oats <- dat %>%
filter(Crop == "Oats",
Information == "Yield (Hg/Ha)") %>%
summarise(Oats_sum = sum(Value))
oats Oats_sum
1 2169334
class(oats)[1] "data.frame"
There may be times when you want to output as a vector element and not a data table. To output a vector, use the pull() function.
oats <- dat %>%
filter(Crop == "Oats",
Information == "Yield (Hg/Ha)") %>%
summarise(Oats_sum = sum(Value)) %>%
pull()
oats[1] 2169334
class(oats)[1] "numeric"
The pull function can also be used to convert a data table column to a multi-element vector, e.g.:
# This outputs a one column table
yield <- dat %>%
filter(Crop == "Oats",
Information == "Yield (Hg/Ha)") %>%
select(Value)
head(yield) Value
1 24954.79
2 21974.70
3 29109.36
4 20492.37
5 27364.53
6 23056.62
class(yield)[1] "data.frame"
# This outputs a multi-element vector
yield <- dat %>%
filter(Crop == "Oats",
Information == "Yield (Hg/Ha)") %>%
pull(Value)
head(yield)[1] 24954.79 21974.70 29109.36 20492.37 27364.53 23056.62
class(yield)[1] "numeric"