dplyr | lubridate | forcats |
---|---|---|
1.0.4 | 1.7.9.2 | 0.5.1 |
The data file FAO_grains_NA.csv will be used in this exercise. This dataset consists of grain yield and harvest year by North American country. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014.
Run the following line to load the FAO data file into your current R session.
<- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE) dat
dplyr
basicsThe basic set of R tools can accomplish many data table queries, but the syntax can be overwhelming and verbose. The package dplyr
offers some nifty and simple querying functions as shown in the next subsections. Some of dplyr
’s key data manipulation functions are summarized in the following table:
dplyr function |
Description |
---|---|
filter() |
Subset by row values |
arrange() |
Sort rows by column values |
select() |
Subset columns |
mutate() |
Add columns |
summarise() |
Summarize columns |
Note that all of these functions take as first argument the data table name except when used in a piping operation (pipes will be discussed later in this section). For example:
# When used alone, dataframe dat is inside the select function
<- select(dat, column1)
dat2
# When used in a pipe, dataframe dat is outside of the select function
<- dat %>% select(column1) dat2
Make sure to load the dplyr
package before proceeding with the following examples.
library(dplyr)
filter
: Subset by rowsTables can be subsetted by rows based on column values. For example, we may wish to grab all rows associated with Oats
:
<- filter(dat, Crop == "Oats")
dat.query1 summary(dat.query1)
Country Crop Information Year
Length:208 Length:208 Length:208 Min. :1961
Class :character Class :character Class :character 1st Qu.:1974
Mode :character Mode :character Mode :character Median :1986
Mean :1986
3rd Qu.:1999
Max. :2012
Value Source
Min. : 12667 Length:208
1st Qu.: 20392 Class :character
Median : 204821 Mode :character
Mean :1350024
3rd Qu.:1638250
Max. :9666550
Note that R is case sensitive, so make sure that you respect each letter’s case (i.e. upper or lower).
We can expand our query by including both Oats
, Buckwheat
and limiting the country to Canada
.
<- filter(dat, Crop == "Oats" | Crop == "Buckwheat",
dat.query2 == "Canada")
Country summary(dat.query2)
Country Crop Information Year
Length:200 Length:200 Length:200 Min. :1961
Class :character Class :character Class :character 1st Qu.:1973
Mode :character Mode :character Mode :character Median :1986
Mean :1986
3rd Qu.:1998
Max. :2012
Value Source
Min. : 0 Length:200
1st Qu.: 11822 Class :character
Median : 22197 Mode :character
Mean : 504098
3rd Qu.: 924675
Max. :4277000
The character |
is the Boolean operator OR. So in our example, the query can be read as “… crop equals oats OR crop equals buckwheat”. Had we used the AND operator, &
, instead as in Crop == "Oats" & Crop == "Buckwheat"
the output would have returned zero rows since a Crop
value cannot be both Oats
AND Buckwheat
.
We can expand this query by limiting our output to the years 2005 to 2010
library(dplyr)
<- filter(dat, Crop == "Oats" | Crop == "Buckwheat",
dat.query3 == "Canada",
Country >= 2005 & Year <= 2010)
Year summary(dat.query3)
Country Crop Information Year
Length:18 Length:18 Length:18 Min. :2005
Class :character Class :character Class :character 1st Qu.:2006
Mode :character Mode :character Mode :character Median :2007
Mean :2007
3rd Qu.:2008
Max. :2010
Value Source
Min. : 2000 Length:18
1st Qu.: 11500 Class :character
Median : 26615 Mode :character
Mean : 453810
3rd Qu.: 961825
Max. :1815700
Note the use of the AND Boolean operator (&
) instead of the OR operator (|
) for the Year
query. We want the Year
value to satisfy two criteria simultaneously: greater than or equal to 2005 AND less than or equal to 2010. Had we used the |
operator, R would have returned all years since all year values satisfy at least one of the two criterion.
arrange
: Sort rows by column valueYou can sort a table based on a column’s values. For example, to sort dat
by crop name type:
<- arrange(dat, Crop)
dat.sort1 head(dat.sort1)
Country Crop Information Year Value
1 Canada Barley Area harvested (Ha) 2012 2060000.00
2 Canada Barley Yield (Hg/Ha) 2012 38894.66
3 United States of America Barley Area harvested (Ha) 2012 1312810.00
4 United States of America Barley Yield (Hg/Ha) 2012 36533.24
5 Canada Barley Area harvested (Ha) 2011 2364800.00
6 Canada Barley Yield (Hg/Ha) 2011 32796.43
Source
1 Official data
2 Calculated data
3 Official data
4 Calculated data
5 Official data
6 Calculated data
tail(dat.sort1)
Country Crop Information Year Value
1496 Canada Triticale Area harvested (Ha) 1991 1093.00
1497 Canada Triticale Yield (Hg/Ha) 1991 21957.91
1498 Canada Triticale Area harvested (Ha) 1990 1074.00
1499 Canada Triticale Yield (Hg/Ha) 1990 26396.65
1500 Canada Triticale Area harvested (Ha) 1989 1093.00
1501 Canada Triticale Yield (Hg/Ha) 1989 21957.91
Source
1496 Official data
1497 Calculated data
1498 FAO data based on imputation methodology
1499 Calculated data
1500 FAO estimate
1501 Calculated data
By default, arrange
sorts by ascending order. To sort by descending order, wrap the column name with the function desc()
. For example, to sort the table by Crop
in ascending order then by Year
in descending order, type:
<- arrange(dat, Crop, desc(Year))
dat.sort2 head(dat.sort2)
Country Crop Information Year Value
1 Canada Barley Area harvested (Ha) 2012 2060000.00
2 Canada Barley Yield (Hg/Ha) 2012 38894.66
3 United States of America Barley Area harvested (Ha) 2012 1312810.00
4 United States of America Barley Yield (Hg/Ha) 2012 36533.24
5 Canada Barley Area harvested (Ha) 2011 2364800.00
6 Canada Barley Yield (Hg/Ha) 2011 32796.43
Source
1 Official data
2 Calculated data
3 Official data
4 Calculated data
5 Official data
6 Calculated data
tail(dat.sort2)
Country Crop Information Year Value
1496 Canada Triticale Area harvested (Ha) 1991 1093.00
1497 Canada Triticale Yield (Hg/Ha) 1991 21957.91
1498 Canada Triticale Area harvested (Ha) 1990 1074.00
1499 Canada Triticale Yield (Hg/Ha) 1990 26396.65
1500 Canada Triticale Area harvested (Ha) 1989 1093.00
1501 Canada Triticale Yield (Hg/Ha) 1989 21957.91
Source
1496 Official data
1497 Calculated data
1498 FAO data based on imputation methodology
1499 Calculated data
1500 FAO estimate
1501 Calculated data
select
: Subset by columnYou can grab a subset of the table by column(s). To extract the columns Crop
, Year
and Value
, type:
<- select(dat, Crop, Year, Value)
dat.subcol head(dat.subcol, 2)
Crop Year Value
1 Barley 2012 2060000.00
2 Barley 2012 38894.66
If you want all columns other than Crop
, Year
and Value
, add the negative -
symbol before the column name:
<- select(dat, -Crop, -Year, -Value)
dat.subcol head(dat.subcol, 2)
Country Information Source
1 Canada Area harvested (Ha) Official data
2 Canada Yield (Hg/Ha) Calculated data
mutate
: Creating and/or calculating column valuesYou can add columns (and compute their values) using the mutate
function. For example, to add a column Ctr_abbr
and assign it the abbreviated values CAN
for Canada and USA
for the United States of America based on the values in column Country
type:
<- mutate(dat, Ctr_abbr = ifelse(Country == "Canada", "CAN", "USA"))
dat.extended head(dat.extended,3)
Country Crop Information Year Value Source
1 Canada Barley Area harvested (Ha) 2012 2060000.00 Official data
2 Canada Barley Yield (Hg/Ha) 2012 38894.66 Calculated data
3 Canada Buckwheat Area harvested (Ha) 2012 0.00 FAO estimate
Ctr_abbr
1 CAN
2 CAN
3 CAN
tail(dat.extended,3)
Country Crop Information Year Value
1499 United States of America Rye Yield (Hg/Ha) 1961 11121.79
1500 United States of America Sorghum Area harvested (Ha) 1961 4445000.00
1501 United States of America Sorghum Yield (Hg/Ha) 1961 27442.07
Source Ctr_abbr
1499 Calculated data USA
1500 Official data USA
1501 Calculated data USA
Here, we make use of an embedded function, ifelse
, which performs a conditional operation: if the Country
value is Canada
return CAN
if not, return USA
.
Note that if you wish to rename a column, you can use the rename()
function instead of mutate
.
You can also use mutate
to recompute column values. For example, to replace the Country
column values with CAN
or USA
type:
<- mutate(dat, Country = ifelse(Country == "Canada", "CAN", "USA"))
dat.overwrite head(dat.overwrite,3)
Country Crop Information Year Value Source
1 CAN Barley Area harvested (Ha) 2012 2060000.00 Official data
2 CAN Barley Yield (Hg/Ha) 2012 38894.66 Calculated data
3 CAN Buckwheat Area harvested (Ha) 2012 0.00 FAO estimate
tail(dat.overwrite,3)
Country Crop Information Year Value Source
1499 USA Rye Yield (Hg/Ha) 1961 11121.79 Calculated data
1500 USA Sorghum Area harvested (Ha) 1961 4445000.00 Official data
1501 USA Sorghum Yield (Hg/Ha) 1961 27442.07 Calculated data
mutate_at
: Creating and/or calculating values across muliple columnsYou might find yourself wanting to apply a same set of mutate operations across multiple variables. For example, given the following sample dataset,
<- data.frame(Wind = c(3.4, 5.0, 99, 4.1, 1.5),
met Dir = c(181, 220, 15, 15, 99 ),
Prec = c(99 , 0.5, 0, 99, 99))
met
Wind Dir Prec
1 3.4 181 99.0
2 5.0 220 0.5
3 99.0 15 0.0
4 4.1 15 99.0
5 1.5 99 99.0
where the value 99
is a placeholder for a missing value for the variables Wind
and Prec
but a valid value for Dir
, we want to replace all missing values with NA
. We could either create two mutate operations as in,
<- mutate( met, Wind = ifelse(Wind == 99, NA, Wind),
met2 Prec = ifelse(Prec == 99, NA, Prec))
met2
Wind Dir Prec
1 3.4 181 NA
2 5.0 220 0.5
3 NA 15 0.0
4 4.1 15 NA
5 1.5 99 NA
or, we could reduce the separate mutate
operations into a single mutate_at
operation,
<- mutate_at( met, vars(Wind, Prec),
met2 list(~ ifelse( . == 99, NA, .)))
met2
Wind Dir Prec
1 3.4 181 NA
2 5.0 220 0.5
3 NA 15 0.0
4 4.1 15 NA
5 1.5 99 NA
The dot .
is a placeholder for each column listed in the vars()
function. Also, note the ~
operator in lieu of . =
.
summarise
: Summarize columnsYou can summarize (or “collapse”) one or more columns using the summarise
function. For instance, to get the minimum and maximum years from the Year
column, type:
summarise(dat, yr_min = min(Year), yr_max=max(Year))
yr_min yr_max
1 1961 2012
%>%
In most cases, you will find yourself wanting to combine several of dplyr
’s data manipulation functions. For example,
<- filter(dat, Information == "Yield (Hg/Ha)",
dat.yield == "Oats",
Crop == 2012)
Year <- mutate(dat.yield, Country = ifelse(Country == "Canada", "CAN", "USA"))
dat.rename <- select(dat.rename, Country, Value)
dat.final
head(dat.final, 3)
Country Value
1 CAN 24954.79
2 USA 21974.70
The downside to this approach is the creation of several intermediate objects (e.g. dat.yield
and dat.rename
). This can make the workflow difficult to follow and clutter your R session with needless intermediate objects.
Another approach to combining dplyr
operations is to use the piping operator ,%>%
, which daisy chains dplyr
operations. So our previous workflow could look like:
<- dat %>%
dat.final filter(Information == "Yield (Hg/Ha)",
== "Oats",
Crop == 2012) %>%
Year mutate(Country = ifelse(Country == "Canada", "CAN", "USA")) %>%
select(Country, Value)
head(dat.final, 3)
Country Value
1 CAN 24954.79
2 USA 21974.70
The chunk of code can be read as “… with the dat
table, filter
by …, then mutate
…., then select
…” with the result from one operation being passed on to the next using the %>%
operator. Note that the filter
, mutate
and select
functions do not include the data table name making the chunk of code less cluttered and easier to read. The input data table dat
appears just once at the beginning of the pipe.
ifelse
Conditional statements are used when you want to create an output value that is conditioned on an evaluation. For example, if you want to output a value of 1
if an input value is less than 23
and a value of 0
otherwise, you can make use of the ifelse
function as follows:
<- c(12,102, 43, 20, 90, 0, 12, 6)
x ifelse(x < 23, 1, 0)
[1] 1 0 0 1 0 1 1 1
The base ifelse
function works as expected when the input/output values are numeric or character, but does not work as expected when applied to factors or dates. For example, if you wish to replace one factor level with another, the following example will not return the expected output.
<- as.factor( c("apple", "banana", "banana", "pear", "apple"))
x ifelse(x == "pear", "apple", x)
[1] "1" "2" "2" "apple" "1"
The output is a character representation of the level number (recall that factors encode level values as numbers behind the scenes, i.e. apple
=1, banana
=2, etc…). Likewise, if you wish to replace an erroneous date with a missing value you will get:
library(lubridate)
<- mdy("1/23/2016", "3/2/2016", "12/1/1901", "11/23/2016")
y ifelse( year(y) != 2016, NA, y)
[1] 16823 16862 NA 17128
Here, ifelse
converts the date object to its internal numeric representation as number of days since 1970.
if_else
does not preserve the attributes that might be present in a vector. In other words, it will strip away the vector’s class
. If you want to ensure that the vector’s class
is preserved, a safer alternative is to use dplyr
’s if_else
function.
dplyr
’s if_else
The if_else
function (note the underscore _
), will preserve data type but does so strictly. For example, the following code will return an error:
if_else( year(y) != 2016, NA, y)
Error: `false` must be a logical vector, not a `Date` object.
The output data types (NA
and y
) are different. NA
is a logical data type (see typeof(NA)
) and y
is not. The solution is to force NA
as a date object by wrapping it with a date function like mdy(NA)
or the base as.Date(NA)
function, or by replacing the logical NA
with NA_Date_
which is found in the lubridate
package.
if_else( year(y) != 2016, mdy(NA), y)
[1] "2016-01-23" "2016-03-02" NA "2016-11-23"
Likewise, if the input vector is of type character, you need to ensure that all output values are characters too.
<- c("apple", "banana", "banana", "pear", "apple")
y if_else( y == "banana", as.character(NA), y)
[1] "apple" NA NA "pear" "apple"
Recall from the Week 02 lecture notes that R has several NA
reserved words for different data types (e.g. NA_character_
and NA_integer_
to name a few). So the last chunk of code could have been written as,
if_else( y == "banana", NA_character_, y)
[1] "apple" NA NA "pear" "apple"
Note that there is no reserved word for an NA
date or Posix type in the base R environment. But they are available in the lubridate
package (NA_Date_
and NA_POSIXct_
).
recode
When working with factors, however, if_else
(as of dplyr
version 0.7
) will produce the following error:
if_else(x == "pear", "apple", x)
Error: `false` must be a character vector, not a `factor` object.
The first outcome listed in the function defines the data type or class. Here, "apple"
is a character, so it expects the FALSE outcome to be a character as well (here, the FALSE outcome is a factor). A better option for recoding factors is to use dplyr
’s recode
function:
recode(x , "pear" = "apple")
[1] apple banana banana apple apple
Levels: apple banana
You can recode more than one factor level. In fact, you can even introduce new levels in the recoding scheme:
recode(x , "pear" = "apple", "banana" = "pineapple" )
[1] apple pineapple pineapple apple apple
Levels: apple pineapple
As with if_else
and case_when
, recode
is strict about preserving data types and classes. So if you want to recode a factor level to NA
make sure to use NA_character_
or as.character(NA)
. recode
will automatcially convert the character value to a factor if the input vector (x
in our example) is a factor.
Note that recode
can also be used with numeric and character data types (i.e. vectors devoid of class attributes). For example,
# Example of recoding a character vector
recode(y , "pear" = "apple", "banana" = "pineapple" )
[1] "apple" "pineapple" "pineapple" "apple" "apple"
One operation you cannot perform with recode
is converting an NA
level to another factor level. For example, the following will generate an error message:
2] <- NA
x[ x
[1] apple <NA> banana pear apple
Levels: apple banana pear
recode(x, NA = "other")
Error: unexpected '=' in "recode(x, NA ="
The simplest solution is to make use of a specialized factor package called forcats
and its function, fct_explicit_na
.
library(forcats)
<- fct_explicit_na(x, na_level = "Other")
x x
[1] apple Other banana pear apple
Levels: apple banana pear Other
case_when
ifelse
and if_else
work great when a single set of conditions is to be satisfied. But if multiple sets of conditions are to be tested, nested if/else statements become cumbersome and are prone to clerical error. The following code highlights an example of nested if/else statements.
<- c("F","F","C", "K")
unit if_else( unit == "C", "Celsius", if_else(unit == "F", "Fahrenheit", "Kelvin"))
[1] "Fahrenheit" "Fahrenheit" "Celsius" "Kelvin"
A simpler solution is to use the recode
function discussed in the previous section.
recode(unit, "C" = "Celsius",
"F" = "Fahrenheit",
"K" = "Kelvin")
[1] "Fahrenheit" "Fahrenheit" "Celsius" "Kelvin"
recode
is well suited for replacing values but it will not allow for more complex operations. For example, given two vectors, unit
and temp
, we would like to convert all temp
values to Fahrenheit by applying a temperature conversion dependent on the unit
value.
<- c(45.2, 56.3, 11.0, 285) temp
This operation is best performed using the case_when
function.
case_when(unit == "F" ~ temp,
== "C" ~ (temp * 9/5) + 32,
unit TRUE ~ (temp - 273.15) * 9/5 + 32)
[1] 45.20 56.30 51.80 53.33
The last parameter, TRUE ~
, applies to all conditions not satisfied by the previous two conditions (otherwise, not doing so would return NA
values).
Note that the order in which these conditions are listed matters since evaluation stops at the first TRUE
outcome encountered. So, had the last condition been moved to the top of the stack, all temp
values would be assigned the first conversion option.
# What not to do ...
case_when(TRUE ~ (temp - 273.15) * 9/5 + 32,
== "F" ~ temp,
unit == "C" ~ (temp * 9/5) + 32) unit
[1] -378.31 -358.33 -439.87 53.33
As with the if_else
function, case_when
is strict about data type in that all output must be of the same data type.
Note that ifelse
, if_else
, recode
and case_when
can all be used inside of a mutate
function. For example, to replace Canada
and United States of America
in variable Country
with CAN
and USA
respectively and to create a new variable called Type
which will take on the values of 1
, 2
or 3
depending on the values in variable Source
, type the following:
<- dat %>%
dat1 mutate(Country = recode(Country, "Canada" = "CAN",
"United States of America" = "USA"),
Type = case_when(Source == "Calculated data" ~ 1,
== "Official data" ~ 2,
Source TRUE ~ 3))
head(dat1)
Country Crop Information Year Value Source Type
1 CAN Barley Area harvested (Ha) 2012 2060000.00 Official data 2
2 CAN Barley Yield (Hg/Ha) 2012 38894.66 Calculated data 1
3 CAN Buckwheat Area harvested (Ha) 2012 0.00 FAO estimate 3
4 CAN Canary seed Area harvested (Ha) 2012 101900.00 Official data 2
5 CAN Canary seed Yield (Hg/Ha) 2012 12161.92 Calculated data 1
6 CAN Grain, mixed Area harvested (Ha) 2012 57900.00 Official data 2
pull
Piping operations will output a table, even if a single value is returned. For example, the following summarization operation returns the total oats yield as a data table:
<- dat %>%
oats filter(Crop == "Oats",
== "Yield (Hg/Ha)") %>%
Information summarise(Oats_sum = sum(Value))
oats
Oats_sum
1 2169334
class(oats)
[1] "data.frame"
There may be times when you want to output as a vector element and not a data table. To output a vector, use the pull()
function.
<- dat %>%
oats filter(Crop == "Oats",
== "Yield (Hg/Ha)") %>%
Information summarise(Oats_sum = sum(Value)) %>%
pull()
oats
[1] 2169334
class(oats)
[1] "numeric"
The pull
function can also be used to convert a data table column to a multi-element vector, e.g.:
# This outputs a one column table
<- dat %>%
yield filter(Crop == "Oats",
== "Yield (Hg/Ha)") %>%
Information select(Value)
head(yield)
Value
1 24954.79
2 21974.70
3 29109.36
4 20492.37
5 27364.53
6 23056.62
class(yield)
[1] "data.frame"
# This outputs a multi-element vector
<- dat %>%
yield filter(Crop == "Oats",
== "Yield (Hg/Ha)") %>%
Information pull(Value)
head(yield)
[1] 24954.79 21974.70 29109.36 20492.37 27364.53 23056.62
class(yield)
[1] "numeric"