stringr |
---|
1.4.0 |
Numbers and dates are not the only variable types we might be interested in exploring. We often find ourselves having to manipulate character (text) objects as well. In the programming environment, such queries are often referred to as string searches. String queries may involve assessing if a variable matches or contains an exact set of characters; it can also involve extracting a certain set of characters given some pattern. R has a very capable set of string operations built into its environment however, many find it difficult to master. A package that will be used in this tutorial that simplifies this task is called stringr
. A write-up of its capabilities can be found here.
This is the simplest string operation one can perform. It involves assessing if a variable is equal (or not) to a complete text string.
We’ve already seen how the conditional statements can be used to check whether a variable is equal to, less than or greater than a number. We can use conditional statements to evaluate if a variable matches an exact string. For example, the following chunk of code returns TRUE
since the strings match exactly.
<- "Abc def"
a == "Abc def" a
[1] TRUE
However, note that R differentiates cases so that the following query, returns FALSE
since the first character does not match in case (i.e. upper case A
vs. lower case a
).
== "abc def" a
[1] FALSE
If you want R to ignore cases in any string operations, simply force all variables to a lower case and define the pattern being compared against in lower case. For example:
tolower(a) == "abc def"
[1] TRUE
To check if object a
has the pattern "c d"
(note the space in between the letters) anywhere in its string, use stringr
’s str_detect
function as follows:
library(stringr)
str_detect(a, "c d")
[1] TRUE
The following example compares the string to "cd"
(note the omission of the space):
str_detect(a, "cd")
[1] FALSE
To check if object a
starts with the pattern "c d"
add the carat character ^
in front of the pattern as in:
str_detect(a, "^c d")
[1] FALSE
To check if object a
ends with the pattern "Abc"
add the dollar character $
to the end of the pattern as in:
str_detect(a, "Abc$")
[1] FALSE
If you want to find where a particular pattern lies in a string, use the str_locate
function. For example, to find where the pattern "c d"
occurs in object a
type:
str_locate(a, "c d")
start end
[1,] 3 5
The function returns two values: the position in the string where the pattern starts (e.g. position 3) and the position where the pattern ends (e.g. position 5 )
Note that if the pattern is not found, str_locate
returns NA
’s:
str_locate(a, "cd")
start end
[1,] NA NA
Note too that the str_locate
function only returns the position of the first occurrence. For example, the following chunk will only return the start/end positions of the first occurrence of Ab
.
<- "Abc def Abg"
b str_locate(b, "Ab")
start end
[1,] 1 2
To find all occurrences, use the str_locate_all()
function as in:
str_locate_all(b,"Ab")
[[1]]
start end
[1,] 1 2
[2,] 9 10
The function returns a list
object. To extract the position values into a dateframe, simply wrap the function in a call to as.data.frame
, for example:
<- as.data.frame(str_locate_all(b,"Ab"))
str.pos str.pos
start end
1 1 2
2 9 10
The reason str_locate_all
returns a list and not a matrix or a data frame can be understood in the following example:
# Create a 5 element string vector
<- c("Abc", "Def ", "Abc Def Ab", " bc ", "ef ")
d
# Search for all instances of "Ab"
str_locate_all(d,"Ab")
[[1]]
start end
[1,] 1 2
[[2]]
start end
[[3]]
start end
[1,] 1 2
[2,] 9 10
[[4]]
start end
[[5]]
start end
Here, d
is a five element string vector (so far we’ve worked with single element vectors). The str_locate_all
function returns a result for each element of that vector, and since patterns can be found multiple times in a same vector element, the output can only be conveniently stored in a list.
A natural extension to finding the positions of patterns in a text is to find the string’s total length. This can be accomplished with the str_length()
function:
str_length(b)
[1] 11
For a multi-element vector, the output looks like this:
str_length(d)
[1] 3 4 10 4 3
To find out how often the pattern Ab
occurs in each element of object d
, use the str_count()
function.
str_count(d, "Ab")
[1] 1 0 2 0 0
The str_pad()
function can be used to pad numbers with leading zeros. Note that in doing so, you are creating a character object from a numeric object.
<- c(12, 2, 503, 20, 0)
e str_pad(e, width=3, side="left", pad = "0" )
[1] "012" "002" "503" "020" "000"
You can append strings with custom text using the str_c()
functions. For example, to add the string length at the end of each vector element in b
type,
str_c(d, " has ", str_length(d), " characters" )
[1] "Abc has 3 characters" "Def has 4 characters" "Abc Def Ab has 10 characters"
[4] " bc has 4 characters" "ef has 3 characters"
You can remove leading or ending (or both) white spaces from a string. For example, to remove leading white spaces from object d
type,
<- str_trim(d, side="left") d.left.trim
Now let’s compare the original to the left-trimmed version:
str_length(d)
[1] 3 4 10 4 3
str_length(d.left.trim)
[1] 3 4 10 3 3
To remove trailing spaces set side = "right"
and to remove both leading and trailing spaces set side = "both"
.
To replace all instances of a specified set of characters in a string with another set of characters, use the str_replace_all()
function. For example, to replace all spaces in object b
with dashes, type:
str_replace_all(b, " ", "-")
[1] "Abc-def-Abg"
To find the character elements of a vector at a given position of a given string, use the str_sub()
function. For example, to find the characters between positions two and five (inclusive) type:
str_sub(b, start=2, end=5)
[1] "bc d"
If you don’t specify a start
position, then all characters up to and including the end
position will be returned. Likewise, if the end
position is not specified then all characters from the start
position to the end of the string will be returned.
If you want to break a string up into individual components based on a character delimiter, use the str_split()
function. For example, to split the following string into separate elements by comma, type the following:
<- "Year:2000, Month:Jan, Day:23"
g str_split(g, ",")
[[1]]
[1] "Year:2000" " Month:Jan" " Day:23"
The output is a one element list. If object g
consists of more than one element, the output will be a list of as many elements as there are g
elements.
Depending on your workflow, you may need to convert the str_split
output to an atomic vector. For example, if you want to find an element in the above str_split
output that matches the string Year:2000
, the following will return FALSE
and not TRUE
as expected:
"Year:2000" %in% str_split(g, ",")
[1] FALSE
The workaround is to convert the right-hand output to a single vector using the unlist
function:
"Year:2000" %in% unlist(str_split(g, ","))
[1] TRUE
If you are applying the split function to a column of data from a dataframe, you will want to use the function str_split_fixed
instead. This function assumes that the number of components to be extracted via the split will be the same for each vector element. For example, the following vector, T1
, has two time components that need to be extracted. The separator is a dash, -
.
<- c("9:30am-10:45am", "9:00am- 9:50am", "1:00pm- 2:15pm")
T1 T1
[1] "9:30am-10:45am" "9:00am- 9:50am" "1:00pm- 2:15pm"
str_split_fixed(T1, "-", 2)
[,1] [,2]
[1,] "9:30am" "10:45am"
[2,] "9:00am" " 9:50am"
[3,] "1:00pm" " 2:15pm"
The third parameter in the str_split_fixed
function is the number of elements to return which also defines the output dimension (here, a three row and two column table). If you want to extract both times to separate vectors, reference the columns by index number:
<- str_split_fixed(T1, "-", 2)[ ,1]
T1.start T1.start
[1] "9:30am" "9:00am" "1:00pm"
<- str_split_fixed(T1, "-", 2)[ ,2]
T1.end T1.end
[1] "10:45am" " 9:50am" " 2:15pm"
You will want to use the indexes if you are extracting strings in a data frame. For example:
<- data.frame( Time = c("9:30am-10:45am", "9:00am-9:50am", "1:00pm-2:15pm"))
dat $Start_time <- str_split_fixed(dat$Time, "-", 2)[ , 1]
dat$End_time <- str_split_fixed(dat$Time, "-", 2)[ , 2]
dat dat
Time Start_time End_time
1 9:30am-10:45am 9:30am 10:45am
2 9:00am-9:50am 9:00am 9:50am
3 1:00pm-2:15pm 1:00pm 2:15pm
To extract the three letter months from object g
(defined in the last example), you can use a combination of stringr
functions as in:
<- str_locate(g, "Month:")
loc str_sub(g, start = loc[,"end"] + 1, end = loc[,"end"]+3)
[1] "Jan"
The above chunk of code first identifies the position of the Month:
string and passes its output to the object loc
(a matrix). It then uses the loc
’s end
position in the call to str_sub
to extract the three characters making up the month abbreviation. The value 1
is added to the start
parameter in str_sub
to omit the last character of Month:
(recall that the str_locate
positions are inclusive).
This can be extend to multi-element vectors as follows:
# Note the differences in spaces and string lenghts between the vector
# elements.
<- c("Year:2000, Month:Jan, Day:23",
gs "Year:345, Month:Mar, Day:30",
"Year:1867 , Month:Nov, Day:5")
<- str_locate(gs, "Month:")
loc str_sub(gs, start = loc[,"end"] + 1, end = loc[,"end"]+3)
[1] "Jan" "Mar" "Nov"
Note the non-uniformity in each element’s length and Month:
position which requires that we explicitly search for the Month:
string position in each element. Had all elements been of equal length and format, we could have simply assigned the position numbers in the call to str_sub
function.