Often data sets we deal include date and time variables. Dealing with
dates and time variables accurately can be a complicated task due to the
variety in formats, time-zone differences and leap years. In this
section you will be introduced to the basics of date manipulations
(i.e., getting current date & time, converting strings to dates,
extracting and manipulating dates, and date arithmetic) using Base R and
lubridate
functions.
String/character manipulations are often overlooked in data analysis
because the focus typically remains on numeric values. However, the
growth in text mining resulted in greater emphasis on handling, cleaning
and processing character strings. In the second part of this module I
will give the foundation of working with characters by covering string
manipulation with Base R and stringr
and the set operations
for character strings.
In preparation of this section, I heavily used our recommended textbooks (Boehmke (2016) and Wickham and Grolemund (2016)), lubridate and stringr reference manuals.
The learning objectives of this module are as follows:
lubridate
functions.stringr
functions.Base R has functions to get the current date and time. Also, the
lubridate
package offers fast and user-friendly parsing of
date-time data. In this section I will use both Base R and
lubridate
functions to demonstrate date-time
manipulations.
In order to get the current date and time information you can use
Sys.timezone()
, Sys.Date()
and
Sys.time()
base R functions:
# get time zone information
Sys.timezone()
## [1] "Australia/Sydney"
# get date information
Sys.Date()
## [1] "2024-05-01"
# get current time
Sys.time()
## [1] "2024-05-01 12:08:41 AEST"
You may also get the same information using the
lubridate
functions:
#install.packages("lubridate")
library(lubridate)
# get current time using `lubridate`
now()
## [1] "2024-05-01 12:08:41 AEST"
When date and time data are imported into R, they will often default
to a character string (or factors if you are using
stringsAsFactors = TRUE
option). If this is the case, we
need to convert strings to proper date format.
To illustrate, let’s read in the candy production data which is available here candy_production.csv
candy <- read.csv("data/candy_production.csv")
head(candy)
## observation_date IPG3113N
## 1 1972-01-01 85.6945
## 2 1972-02-01 71.8200
## 3 1972-03-01 66.0229
## 4 1972-04-01 64.5645
## 5 1972-05-01 65.0100
## 6 1972-06-01 67.6467
str(candy$observation_date)
## chr [1:548] "1972-01-01" "1972-02-01" "1972-03-01" "1972-04-01" ...
The observation_date
variable was read in as a
character. In order to convert this to a date format, we can use
different strategies. First one is to convert using
as.Date()
function under Base R.
candy$observation_date <- as.Date(candy$observation_date)
str(candy$observation_date)
## Date[1:548], format: "1972-01-01" "1972-02-01" "1972-03-01" "1972-04-01" "1972-05-01" ...
Note that the default date format is YYYY-MM-DD;
therefore, if your string is of different format you must incorporate
the format
argument. There are multiple formats that dates
can be in; for a complete list of formatting code options in R type
?strftime
in your console.
Have a look at these two examples:
x <- c("08/03/2018", "23/03/2016", "30/01/2018")
y <- c("08.03.2018", "23.03.2016", "30.01.2018")
This time the string format is DD/MM/YYYY for
x
and DD.MM.YYYY for y
;
therefore, we need to specify the format
argument
explicitly.
x_date <- as.Date(x, format = "%d/%m/%Y")
x_date
## [1] "2018-03-08" "2016-03-23" "2018-01-30"
y_date <- as.Date(y, format = "%d.%m.%Y")
y_date
## [1] "2018-03-08" "2016-03-23" "2018-01-30"
The lubridate
package on the other hand can
automatically recognise the common separators used when recording dates
(-
, /
, .
, and ). As
a result, you only need to focus on specifying the order of the date
elements to determine the parsing function applied. Here is the list of
lubridate functions used for this purpose:
Function | Order of elements in date-time |
---|---|
ymd() | year, month, day |
ydm() | year, day, month |
mdy() | month, day, year |
dmy() | day, month, year |
hm() | hour, minute |
hms() | hour, minute, second |
ymd_hms() | year, month, day, hour, minute, second |
If the strings are in different formats like the following, the lubridate functions can easily handle these.
z <- c("08.03.2018", "29062017", "23/03/2016", "30-01-2018")
z <- dmy(z)
z
## [1] "2018-03-08" "2017-06-29" "2016-03-23" "2018-01-30"
As seen above, even if we used different separators within the same
vector, dmy()
function was able to fetch this information
easily.
Sometimes, instead of a single string, we will have the individual
components of the date-time spread across multiple columns. Remember the
flights data which is in the nycflights13
package.
library(nycflights13)
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
This data frame includes 19 variables, for date manipulations, we
will use only the year
, month
,
day
, hour
and minute
columns.
flights_new <- flights %>%
dplyr::select(year, month, day, hour, minute)
head(flights_new)
## # A tibble: 6 × 5
## year month day hour minute
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 5 15
## 2 2013 1 1 5 29
## 3 2013 1 1 5 40
## 4 2013 1 1 5 45
## 5 2013 1 1 6 0
## 6 2013 1 1 5 58
As seen in the output, the components of the date information are
given in multiple columns. To create a date/time from this sort of
input, we can use make_date()
for dates and
make_datetime()
for date-times.
flights_new<- flights_new %>% mutate(departure = make_datetime(year, month, day, hour, minute))
head(flights_new)
## # A tibble: 6 × 6
## year month day hour minute departure
## <int> <int> <int> <dbl> <dbl> <dttm>
## 1 2013 1 1 5 15 2013-01-01 05:15:00
## 2 2013 1 1 5 29 2013-01-01 05:29:00
## 3 2013 1 1 5 40 2013-01-01 05:40:00
## 4 2013 1 1 5 45 2013-01-01 05:45:00
## 5 2013 1 1 6 0 2013-01-01 06:00:00
## 6 2013 1 1 5 58 2013-01-01 05:58:00
Now, let’s explore functions that let us get and set individual
components of date and time.
We can extract individual parts of the date with the accessor
functions in lubridate
. Here is the list of available
functions:
Accessor Function | Extracts |
---|---|
year() | year |
month() | month |
mday() | day of the month |
yday() | day of the year |
wday() | day of the week |
hour() | hour |
minute() | minute |
second() | second |
For example, to extract the year information of the
flights_new$departure
column we can use:
flights_new$departure %>% year() %>% head()
## [1] 2013 2013 2013 2013 2013 2013
For month()
and wday()
we can set
label = TRUE
argument to return the abbreviated name of the
month or day of the week. We can also set abbr = FALSE
to
return the full name:
flights_new$departure %>% month(label = TRUE, abbr = TRUE) %>% head()
## [1] Jan Jan Jan Jan Jan Jan
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
flights_new$departure %>% month(label = TRUE, abbr = FALSE) %>% head()
## [1] January January January January January January
## 12 Levels: January < February < March < April < May < June < ... < December
We can also use each accessor function to set the components of a date/time:
# create a date
datetime <- ymd_hms("2016-07-08 12:34:56")
#replace the year component with 2020
year(datetime) <- 2020
datetime
## [1] "2020-07-08 12:34:56 UTC"
# replace the month component with Jan
month(datetime) <- 01
datetime
## [1] "2020-01-08 12:34:56 UTC"
# add one hour
hour(datetime) <- hour(datetime) + 1
datetime
## [1] "2020-01-08 13:34:56 UTC"
Often, we may require computing a new variable from the date - time information. In this section, you will learn to create a sequence of dates and how arithmetic with dates works (including subtraction, addition, and division).
For example, to create a sequence of dates we can use the
seq()
function with specifying the four arguments
seq(from, to, by, and length.out)
.
# create a sequence of years from 1980 to 2018 by 2
even_years <- seq(from = 1980, to=2018, by = 2)
even_years
## [1] 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
## [16] 2010 2012 2014 2016 2018
This can be applied for days, months, minutes, seconds, etc.
hour_list <- seq (ymd_hm("2018-1-1 9:00"), ymd_hm("2018-1-1 12:00"), by = "hour")
hour_list
## [1] "2018-01-01 09:00:00 UTC" "2018-01-01 10:00:00 UTC"
## [3] "2018-01-01 11:00:00 UTC" "2018-01-01 12:00:00 UTC"
month_list <- seq (ymd_hm("2018-1-1 9:00"), ymd_hm("2018-12-1 9:00"), by = "month")
month_list
## [1] "2018-01-01 09:00:00 UTC" "2018-02-01 09:00:00 UTC"
## [3] "2018-03-01 09:00:00 UTC" "2018-04-01 09:00:00 UTC"
## [5] "2018-05-01 09:00:00 UTC" "2018-06-01 09:00:00 UTC"
## [7] "2018-07-01 09:00:00 UTC" "2018-08-01 09:00:00 UTC"
## [9] "2018-09-01 09:00:00 UTC" "2018-10-01 09:00:00 UTC"
## [11] "2018-11-01 09:00:00 UTC" "2018-12-01 09:00:00 UTC"
In R, when you subtract two dates, you get a time
intervals/differences object (a.k.a difftime
in R)
. To illustrate let’s calculate my age using:
my_age <- today() - ymd(19810529)
my_age
## Time difference of 15678 days
Or, equivalently we can use:
difftime(today(), ymd(19810529))
## Time difference of 15678 days
As seen in the output, subtraction of two date-time objects gives an
object of time difference class. In order to change the time difference
to another unit we can use units
argument:
difftime(today(), ymd(19810529), units = "weeks")
## Time difference of 2239.714 weeks
Note that only seconds, minutes, hours, days, and weeks are supported. Units larger than weeks are not used due to their variability.
Logical comparisons are also available for date-time variables.
your_age <- today() - ymd(19890101)
your_age
## Time difference of 12904 days
your_age == my_age
## [1] FALSE
your_age < my_age
## [1] TRUE
We can also deal with time intervals/differences by using the
duration functions in lubridate
. Duration simply measure
the time span between start and end dates. lubridate
provides simplistic syntax to calculate duration with the desired
measurement (seconds, minutes, hours, etc.).
It should be noted that the lubridate
package uses
seconds as the unit of calculation. Therefore, duration always record
the time span in seconds. Larger units are created by converting
minutes, hours, days, weeks, and years to seconds at the standard rate
(60 seconds in a minute, 60 minutes in an hour, 24 hours in day,
7 days in a week, 365 days in a year).
# create a new duration (represented in seconds)
duration(1)
## [1] "1s"
# create duration for minutes
dminutes(1)
## [1] "60s (~1 minutes)"
# create duration for hours
dhours(1)
## [1] "3600s (~1 hours)"
# create duration for years
dyears(1)
## [1] "31557600s (~1 years)"
# add/subtract duration from date/time object
x <- ymd_hms("2015-09-22 12:00:00")
x + dhours(10)
## [1] "2015-09-22 22:00:00 UTC"
x + dhours(10) + dminutes(33) + dseconds(54)
## [1] "2015-09-22 22:33:54 UTC"
String/character manipulations are often overlooked in data analysis
because the focus typically remains on numeric values. However, the
growth in text mining resulted in greater emphasis on handling, cleaning
and processing character strings. In the second part of this module I
will give the foundation of working with characters by covering string
manipulation with Base R and stringr
and the set operations
for character strings.
This section includes how to create, convert and print character strings along with how to count the number of elements and characters in a string.
The most basic way to create strings is to use quotation marks and assign a string to an object similar to creating number sequences like this:
a <- "MATH2349" # create string a
b <- "is awesome" # create string b
The paste()
function under Base R is used for creating
and building strings. It takes one or more R objects, converts them to
character, and then it concatenates (pastes) them to form one or several
character strings.
Here are some examples of paste()
function:
# paste together string a & b
paste(a, b)
## [1] "MATH2349 is awesome"
# paste character and number strings (converts numbers to character class)
paste("The life of", pi)
## [1] "The life of 3.14159265358979"
# paste multiple strings
paste("I", "love", "Data Preprocessing")
## [1] "I love Data Preprocessing"
# paste multiple strings with a separating character
paste("I", "love", "Data", "Preprocessing", sep = "-")
## [1] "I-love-Data-Preprocessing"
# use paste0() to paste without spaces between characters
paste0("I", "love", "Data", "Preprocessing")
## [1] "IloveDataPreprocessing"
# paste objects with different lengths
paste("R", 1:5, sep = " v1.")
## [1] "R v1.1" "R v1.2" "R v1.3" "R v1.4" "R v1.5"
## [1] "R v1.1" "R v1.2" "R v1.3" "R v1.4" "R v1.5"
Sorting character strings is very simple using sort()
function:
a <- c("MATH2349", "MATH1324")
sort(a)
## [1] "MATH1324" "MATH2349"
Similar to the numerics, strings and characters can be tested with
is.character()
and any other data format can be converted
into string/character with as.character()
or with
toString()
.
a <- "The life of"
b <- pi
is.character(a)
## [1] TRUE
is.character(b)
## [1] FALSE
c <- as.character(b)
is.character(c)
## [1] TRUE
toString(c("Jul", 25, 2017))
## [1] "Jul, 25, 2017"
Printing strings/characters can be done with the following
functions:
Function | Usage |
---|---|
print() |
generic printing |
noquote() |
print with no quotes |
cat() |
concatenate and print with no quotes |
The primary printing function in R is print().
# basic printing
a <- "MATH2349 is awesome"
print(a)
## [1] "MATH2349 is awesome"
# print without quotes
print(a, quote = FALSE)
## [1] MATH2349 is awesome
# alternative to print without quotes
noquote(a)
## [1] MATH2349 is awesome
The cat()
function allows us to concatenate objects and
print them either on screen or to a file. The output result is very
similar to noquote()
; however, cat()
does not
print the numeric line indicator. As a result, cat()
can be
useful for printing nicely formatted responses to users.
# basic printing (similar to noquote)
cat(a)
## MATH2349 is awesome
# combining character string
cat(a, "and I love R")
## MATH2349 is awesome and I love R
# basic printing of alphabet
cat(letters)
## a b c d e f g h i j k l m n o p q r s t u v w x y z
# specify a separator between the combined characters
cat(letters, sep = "-")
## a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z
# collapse the space between the combine characters
cat(letters, sep = "")
## abcdefghijklmnopqrstuvwxyz
You can also format the line width for printing long strings using
the fill
argument:
x <- "Today I am learning how to manipulate strings."
y <- "Tomorrow I plan to work on my assignment."
z <- "The day after I will take a break and drink a beer :)"
# No breaks between lines
cat(x, y, z, fill = FALSE)
## Today I am learning how to manipulate strings. Tomorrow I plan to work on my assignment. The day after I will take a break and drink a beer :)
# Breaks between lines
cat(x, y, z, fill = TRUE)
## Today I am learning how to manipulate strings.
## Tomorrow I plan to work on my assignment.
## The day after I will take a break and drink a beer :)
To count the number of elements in a string use length():
length("How many elements are in this string?")
## [1] 1
length(c("How", "many", "elements", "are", "in", "this", "string?"))
## [1] 7
To count the number of characters in a string use
nchar()
:
nchar("How many characters are in this string?")
## [1] 39
nchar(c("How", "many", "characters", "are", "in", "this", "string?"))
## [1] 3 4 10 3 2 4 7
Basic string manipulation typically includes case conversion, simple character replacement, abbreviating, substring replacement, adding/removing white space, and performing set operations to compare similarities and differences between two-character vectors.
These operations can all be performed with base R functions; however,
some operations are greatly simplified with the stringr
package. Therefore, after illustrating base R string manipulation for
case conversion, simple character replacement, abbreviating, and
substring replacement, we will switch to stringr
package to
cover many of the other fundamental string manipulation tasks.
To convert all upper case characters to lower case we will use
tolower()
:
a <- "MATH2349 is AWesomE"
tolower(a)
## [1] "math2349 is awesome"
To convert all lower case characters to upper case we will use
toupper()
:
toupper(x)
## [1] "TODAY I AM LEARNING HOW TO MANIPULATE STRINGS."
To replace a character (or multiple characters) in a string we can
use chartr()
:
# replace 'A' with 'a'
x <- "This is A string."
chartr(old = "A", new = "a", x)
## [1] "This is a string."
# multiple character replacements
# replace any 'd' with 't' and any 'z' with 'a'
y <- "Tomorrow I plzn do lezrn zbout dexduzl znzlysis."
chartr(old = "dz", new = "ta", y)
## [1] "Tomorrow I plan to learn about textual analysis."
Note that chartr()
replaces every identified letter for
replacement so you need to use it when you are certain that you want to
change every possible occurrence of that letter(s).
To abbreviate strings we can use abbreviate()
:
streets <- c("Victoria", "Yarra", "Russell", "Williams", "Swanston")
# default abbreviations
abbreviate(streets)
## Victoria Yarra Russell Williams Swanston
## "Vctr" "Yarr" "Rssl" "Wllm" "Swns"
# set minimum length of abbreviation
abbreviate(streets, minlength = 2)
## Victoria Yarra Russell Williams Swanston
## "Vc" "Yr" "Rs" "Wl" "Sw"
To extract or replace substrings in a character vector there are two
primary base R functions to use: substr()
and
strsplit()
.
The purpose of substr()
is to extract and replace
substrings with specified starting and stopping characters. Here are
some examples on substr()
usage:
alphabet <- paste(LETTERS, collapse = "")
alphabet
## [1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# extract 18th character in alphabet
substr(alphabet, start = 18, stop = 18)
## [1] "R"
# extract 18-24th characters in alphabet
substr(alphabet, start = 18, stop = 24)
## [1] "RSTUVWX"
# replace 19-24th characters with `R`
substr(alphabet, start = 19, stop = 24) <- "RRRRRR"
alphabet
## [1] "ABCDEFGHIJKLMNOPQRRRRRRRYZ"
To split the elements of a character string we can use
strsplit()
. Here are some examples:
z <- "The day after I will take a break and drink a beer :)"
strsplit(z, split = " ")
## [[1]]
## [1] "The" "day" "after" "I" "will" "take" "a" "break" "and"
## [10] "drink" "a" "beer" ":)"
a <- "Victoria-Yarra-Russell-Williams-Swanston"
strsplit(a, split = "-")
## [[1]]
## [1] "Victoria" "Yarra" "Russell" "Williams" "Swanston"
Note that the output of strsplit()
is a list. To convert
the output to a simple atomic vector simply wrap in
unlist()
:
unlist(strsplit(a, split = "-"))
## [1] "Victoria" "Yarra" "Russell" "Williams" "Swanston"
There are also base R functions that allows for assessing the set union, intersection, difference, equality, and membership of two vectors.
To obtain the elements of the union between two character vectors we
can use union()
:
set_1 <- c("lagunitas", "bells", "dogfish", "summit", "odell")
set_2 <- c("sierra", "bells", "harpoon", "lagunitas", "founders")
union(set_1, set_2)
## [1] "lagunitas" "bells" "dogfish" "summit" "odell" "sierra"
## [7] "harpoon" "founders"
To obtain the common elements of two character vectors we can use
intersect()
.
intersect(set_1, set_2)
## [1] "lagunitas" "bells"
In order to obtain the non-common elements, or the difference, of two
character vectors we can use setdiff()
.
# returns elements in set_1 not in set_2
setdiff(set_1, set_2)
## [1] "dogfish" "summit" "odell"
# returns elements in set_2 not in set_1
setdiff(set_2, set_1)
## [1] "sierra" "harpoon" "founders"
In order to test if two vectors contain the same elements regardless
of order we can use setequal()
set_3 <- c("VIC", "NSW", "TAS")
set_4 <- c("WA", "SA", "NSW")
set_5 <- c("NSW", "SA", "WA")
setequal(set_3, set_4)
## [1] FALSE
setequal(set_4, set_5)
## [1] TRUE
We can use identical()
to test if two character vectors
are equal in content and order.
set_6 <- c("VIC", "NSW", "TAS")
set_7 <- c("NSW", "VIC", "TAS")
set_8 <- c("VIC", "NSW", "TAS")
identical(set_6, set_7)
## [1] FALSE
identical(set_6, set_8)
## [1] TRUE
In order to test if an element is contained within a character vector
use is.element()
or %in%
. Here are some
examples:
set_6 <- c("VIC", "NSW", "TAS")
set_7 <- c("NSW", "VIC", "TAS")
set_8 <- c("VIC", "NSW", "TAS")
is.element("VIC", set_8)
## [1] TRUE
"VIC" %in% set_8
## [1] TRUE
"WA" %in% set_8
## [1] FALSE
The stringr
package was developed by Hadley Wickham to
provide a consistent and simple wrapper to common string operations.
Before using these functions, we need to install and load the
stringr
package.
#install.packages("stringr")
library(stringr)
There are three string functions that are closely related to their base R equivalents, but with a few enhancements. They are:
str_c()
str_length()
str_sub()
str_c()
is equivalent to the paste()
function in Base R.
# same as paste0()
str_c("Learning", "to", "use", "the", "stringr", "package")
## [1] "Learningtousethestringrpackage"
# same as paste()
str_c("Learning", "to", "use", "the", "stringr", "package", sep = " ")
## [1] "Learning to use the stringr package"
str_length()
is similiar to the nchar()
function in BaseR:
# some text
text = c("Learning", "to", NA, "use", "the", NA, "stringr", "package")
# `nchar()`
nchar(text)
## [1] 8 2 NA 3 3 NA 7 7
# `str_length()` same as above
str_length(text)
## [1] 8 2 NA 3 3 NA 7 7
As seen above, both str_length()
and
nchar()
functions count the number of characters for
non-missing values. For the missing values, they return NA
as a value.
You can access individual character using str_sub()
. The
str_sub()
function is equivalent to the
substr()
function in BaseR and it takes three arguments: a
character vector, a start position and an end position. Either position
can either be a positive integer, which counts from the left, or a
negative integer which counts from the right. Note that the positions
are inclusive, and if longer than the string, will be silently
truncated.
# Some text
y <- c("abcdef", "ghifjk")
# The 3rd letter
str_sub(y, 3, 3)
## [1] "c" "i"
# The 2nd to 2nd-to-last character
str_sub(y, 2, -2)
## [1] "bcde" "hifj"
You can also use str_sub()
to modify strings:
# Change the third letter with X
str_sub(y, 3, 3) <- "X"
y
## [1] "abXdef" "ghXfjk"
The stringr
provides a new functionality using
str_dup()
in which base R does not have a specific function
for character duplication.
str_dup("apples", times = 4)
## [1] "applesapplesapplesapples"
str_dup("apples", times = 1:4)
## [1] "apples" "applesapples"
## [3] "applesapplesapples" "applesapplesapplesapples"
In string processing, a common task is parsing text into individual
words. Often, this results in words having blank spaces (white spaces)
on either end of the word. The str_trim()
can be used to
remove these spaces. Here are some examples:
text <- c("Text ", " with", " whitespace ", " on", "both ", " sides ")
text
## [1] "Text " " with" " whitespace " " on" "both "
## [6] " sides "
# remove white spaces on the left side
str_trim(text, side = "left")
## [1] "Text " "with" "whitespace " "on" "both "
## [6] "sides "
# remove white spaces on the right side
str_trim(text, side = "right")
## [1] "Text" " with" " whitespace" " on" "both"
## [6] " sides"
# remove white spaces on both sides
str_trim(text, side = "both")
## [1] "Text" "with" "whitespace" "on" "both"
## [6] "sides"
To add white space, or to pad a string, we will use
str_pad()
. We can also use str_pad()
to pad a
string with specified characters. The width
argument will
give width of padded strings and the pad
argument will
specify the padding characters. Here are some examples:
str_pad("apples", width = 10, side = "left")
## [1] " apples"
str_pad("apples", width = 10, side = "both")
## [1] " apples "
str_pad("apples", width = 10, side = "right", pad = "!")
## [1] "apples!!!!"
Similar to tolower()
and toupper()
functions in BaseR, stringr
also has upper/lower case
transformation functions:
# Some text
x <- "I like VeGGies."
# convert all upper-case characters to lower case
str_to_lower(x)
## [1] "i like veggies."
#convert all lower-case characters to upper case
str_to_upper(x)
## [1] "I LIKE VEGGIES."
# convert only first letters of words to upper case
str_to_title(x)
## [1] "I Like Veggies."
String ordering and sorting can be done using str_order
and str_sort
functions.
# some text
x <- c("y", "i", "k")
# return the index of ordered values
str_order(x)
## [1] 2 3 1
Note that str_order()
returns the index of ordered
values, not the order.
# sort characters
str_sort(x)
## [1] "i" "k" "y"
Most string manipulations require pattern matching for a given text.
Good news is, stringr
package has many specialized pattern
matching functions to detect, subset, locate, count, extract, and
replace strings. Here, I will demonstrate the commonly used pattern
matching functions under stringr
. For more information on
other useful pattern matching functions, please refer to the
[stringr
vignette] (https://stringr.tidyverse.org/articles/stringr.html).
Note that, each pattern matching function described below has the
same first two arguments, a character vector of strings to process and a
single pattern to match specified by the pattern =
argument.
str_detect()
detects the presence or absence of a
pattern and returns a logical vector. Here is an example of its
usage:# detects pattern "ea"
x <- c("apple", "banana", "pear")
str_detect(x, pattern ="ea")
## [1] FALSE FALSE TRUE
#same as above
str_detect(x, "ea")
## [1] FALSE FALSE TRUE
While matching patterns, one can also use the regular
expressions. Regular expressions (a.k.a. regex’s) are a
language that allow you to describe patterns in strings. They take a
little while to get your head around, but once you understand them,
you’ll find them extremely useful.
# Same as above using regex
x <- c("apple", "banana", "pear")
str_detect(x, regex("ea"))
## [1] FALSE FALSE TRUE
You can also perform a case-insensitive match using
ignore.cases = TRUE
:
bananas <- c("banana", "Banana", "BANANA")
#case sensitive match
str_detect(bananas, "banana")
## [1] TRUE FALSE FALSE
#case insensitive match
str_detect(bananas, regex("banana", ignore_case = TRUE))
## [1] TRUE TRUE TRUE
With regex, you can create your own character classes using
[ ]
. For example:
[abc]
: matches a, b, or c.[a-z]
: matches every character between a and z (in
Unicode code point order).[^abc]
: matches anything except a, b, or c.[\^\-]
: matches ^ or -.There are a number of pre-built classes that you can use inside
[ ]
:
[:punct:]
: punctuation.[:alpha:]
: letters.[:lower:]
: lowercase letters.[:upper:]
: upperclass letters.[:digit:]
: digits.[:xdigit:]
: hex digits.[:alnum:]
: letters and numbers.[:cntrl:]
: control characters.[:graph:]
: letters, numbers, and punctuation.[:print:]
: letters, numbers, punctuation, and white
space.[:space:]
: space characters (basically equivalent to
).[:blank:]
: space and tab.Here are some examples:
emails <- c("s123546@student.rmit.edu.au", "sona.taheri@rmit.edu.au", "s2342565@rmit.edu.vn")
#detect the emails containing numbers
str_detect(emails, "[:digit:]")
## [1] TRUE FALSE TRUE
#detect the emails containing lowercase letters
str_detect(emails, "[:lower:]")
## [1] TRUE TRUE TRUE
For more information on the regex capabilities, please refer to
[regular expressions vignette] (https://stringr.tidyverse.org/articles/regular-expressions.html)
under stringr
package.
str_subset()
returns the elements of a character vector
that match a regular expression.#subset emails that contains numbers
str_subset(emails, "[:digit:]")
## [1] "s123546@student.rmit.edu.au" "s2342565@rmit.edu.vn"
str_extract()
extracts text corresponding to the first
match, returning a character vector.# extract the digits 1234
str_extract("Let's extract the digits 1234", pattern = "1234")
## [1] "1234"
# extract the digits 23 in emails
str_extract(emails, pattern = "23")
## [1] "23" NA "23"
str_locate()
locates the first position of a pattern
and returns a numeric matrix with columns start and end whereas
str_locate_all()
# locate the first i in the string
str_locate("Locate the first i in this string", "i")
## start end
## [1,] 13 13
# locate all the i's in this string
str_locate_all("Locate all the i's in this string", "i")
## [[1]]
## start end
## [1,] 16 16
## [2,] 20 20
## [3,] 25 25
## [4,] 31 31
# locate all the full stops (\\.) in this string
str_locate_all("Full stop separates two sentences. Now I will locate all full stops.", "\\.")
## [[1]]
## start end
## [1,] 34 34
## [2,] 68 68
str_count()
counts the number of matches for a given
string.# Counts the digits
str_count("90 Dollars", "[:digit:]")
## [1] 2
# Counts the letters
str_count("90 Dollars", "[:alpha:]")
## [1] 7
str_replace()
replaces a string with another one. The
pattern
argument will give the string that is going to be
replaced and replacement
argument will specify the
replacement string.# replace Dollars with AUD
str_replace("90 Dollars", pattern = "Dollars", replacement = "AUD")
## [1] "90 AUD"
str_replace_all()
replaces all matches.# replace all l's with "" (delete l's)
str_replace_all("Hello world", pattern = "l", replacement = "")
## [1] "Heo word"
For more information on lubridate
and
stringr
packages and available functions, you can refer to
the [lubridate
package manual] (https://cran.r-project.org/web/packages/lubridate/lubridate.pdf),
the [stringr
package manual] (https://cran.r-project.org/web/packages/stringr/stringr.pdf)
and its vignette.
For the regular expressions, refer to [regular expressions vignette] (https://stringr.tidyverse.org/articles/regular-expressions.html).
Our recommended textbooks (Boehmke (2016) and Wickham and Grolemund (2016)) are great resources for the basics of date and character manipulations. If you want to learn more on the high-level text manipulations and text mining, you may refer to “Automated Data Collection with R: A practical guide to web scraping and text mining” (by Munzert et al. (2014)).