• Overview
    • Summary
    • Learning Objectives
  • Getting current date and time
  • Converting strings to dates
  • Extract & manipulate parts of dates
  • Date arithmetic
  • Dealing with Characters/Strings
  • Character string basics
    • Creating Strings
    • Converting to Strings
    • Printing Strings
    • Concatenating strings
    • Counting string elements and characters
  • String manipulation with Base R
    • Upper/lower case conversion
    • Simple Character Replacement
    • String Abbreviations
    • Extract/Replace Substrings
    • Set operations for character strings
  • Strings manipulation with stringr
    • Basic operations
    • Duplicate Characters within a String
    • Remove Leading and Trailing White space
    • Pad a String with White space
    • Upper/lower case conversion
    • String ordering and sorting
    • Pattern matching (Pattern detect, subset, locate, count, extract, and replace)
  • Additional Resources and Further Reading
  • References

Overview

Summary

Often data sets we deal include date and time variables. Dealing with dates and time variables accurately can be a complicated task due to the variety in formats, time-zone differences and leap years. In this section you will be introduced to the basics of date manipulations (i.e., getting current date & time, converting strings to dates, extracting and manipulating dates, and date arithmetic) using Base R and lubridate functions.

String/character manipulations are often overlooked in data analysis because the focus typically remains on numeric values. However, the growth in text mining resulted in greater emphasis on handling, cleaning and processing character strings. In the second part of this module I will give the foundation of working with characters by covering string manipulation with Base R and stringr and the set operations for character strings.

In preparation of this section, I heavily used our recommended textbooks (Boehmke (2016) and Wickham and Grolemund (2016)), lubridate and stringr reference manuals.

Learning Objectives

stringr lubridate

The learning objectives of this module are as follows:

  • Apply basic date-time manipulations using Base R functions.
  • Apply basic date-time manipulations using lubridate functions.
  • Learn basic string manipulations using Base R functions.
  • Learn basic string manipulations using stringr functions.

Getting current date and time

Base R has functions to get the current date and time. Also, the lubridate package offers fast and user-friendly parsing of date-time data. In this section I will use both Base R and lubridate functions to demonstrate date-time manipulations.

In order to get the current date and time information you can use Sys.timezone() , Sys.Date() and Sys.time() base R functions:

# get time zone information

Sys.timezone()
## [1] "Australia/Sydney"
# get date information

Sys.Date()
## [1] "2024-05-01"
# get current time

Sys.time()
## [1] "2024-05-01 12:08:41 AEST"

You may also get the same information using the lubridate functions:

#install.packages("lubridate")
library(lubridate)
# get current time using `lubridate`

now()
## [1] "2024-05-01 12:08:41 AEST"

Converting strings to dates

When date and time data are imported into R, they will often default to a character string (or factors if you are using stringsAsFactors = TRUE option). If this is the case, we need to convert strings to proper date format.

To illustrate, let’s read in the candy production data which is available here candy_production.csv

candy <- read.csv("data/candy_production.csv")
head(candy)
##   observation_date IPG3113N
## 1       1972-01-01  85.6945
## 2       1972-02-01  71.8200
## 3       1972-03-01  66.0229
## 4       1972-04-01  64.5645
## 5       1972-05-01  65.0100
## 6       1972-06-01  67.6467
str(candy$observation_date)
##  chr [1:548] "1972-01-01" "1972-02-01" "1972-03-01" "1972-04-01" ...

The observation_date variable was read in as a character. In order to convert this to a date format, we can use different strategies. First one is to convert using as.Date() function under Base R.

candy$observation_date <- as.Date(candy$observation_date)
str(candy$observation_date)
##  Date[1:548], format: "1972-01-01" "1972-02-01" "1972-03-01" "1972-04-01" "1972-05-01" ...

Note that the default date format is YYYY-MM-DD; therefore, if your string is of different format you must incorporate the format argument. There are multiple formats that dates can be in; for a complete list of formatting code options in R type ?strftime in your console.

Have a look at these two examples:

x <- c("08/03/2018", "23/03/2016", "30/01/2018")
y <- c("08.03.2018", "23.03.2016", "30.01.2018")

This time the string format is DD/MM/YYYY for x and DD.MM.YYYY for y; therefore, we need to specify the format argument explicitly.

x_date <- as.Date(x, format = "%d/%m/%Y")
x_date
## [1] "2018-03-08" "2016-03-23" "2018-01-30"
y_date <- as.Date(y, format = "%d.%m.%Y")
y_date
## [1] "2018-03-08" "2016-03-23" "2018-01-30"

The lubridate package on the other hand can automatically recognise the common separators used when recording dates (-, /, ., and ). As a result, you only need to focus on specifying the order of the date elements to determine the parsing function applied. Here is the list of lubridate functions used for this purpose:

Function Order of elements in date-time
ymd() year, month, day
ydm() year, day, month
mdy() month, day, year
dmy() day, month, year
hm() hour, minute
hms() hour, minute, second
ymd_hms() year, month, day, hour, minute, second

If the strings are in different formats like the following, the lubridate functions can easily handle these.

z <- c("08.03.2018", "29062017", "23/03/2016", "30-01-2018")
z <- dmy(z)
z
## [1] "2018-03-08" "2017-06-29" "2016-03-23" "2018-01-30"

As seen above, even if we used different separators within the same vector, dmy() function was able to fetch this information easily.

Extract & manipulate parts of dates

Sometimes, instead of a single string, we will have the individual components of the date-time spread across multiple columns. Remember the flights data which is in the nycflights13 package.

library(nycflights13)
head(flights)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

This data frame includes 19 variables, for date manipulations, we will use only the year, month, day, hour and minute columns.

flights_new <- flights %>%
dplyr::select(year, month, day, hour, minute)
head(flights_new)
## # A tibble: 6 × 5
##    year month   day  hour minute
##   <int> <int> <int> <dbl>  <dbl>
## 1  2013     1     1     5     15
## 2  2013     1     1     5     29
## 3  2013     1     1     5     40
## 4  2013     1     1     5     45
## 5  2013     1     1     6      0
## 6  2013     1     1     5     58

As seen in the output, the components of the date information are given in multiple columns. To create a date/time from this sort of input, we can use make_date() for dates and make_datetime() for date-times.

flights_new<- flights_new %>% mutate(departure = make_datetime(year, month, day, hour, minute))
head(flights_new)
## # A tibble: 6 × 6
##    year month   day  hour minute departure          
##   <int> <int> <int> <dbl>  <dbl> <dttm>             
## 1  2013     1     1     5     15 2013-01-01 05:15:00
## 2  2013     1     1     5     29 2013-01-01 05:29:00
## 3  2013     1     1     5     40 2013-01-01 05:40:00
## 4  2013     1     1     5     45 2013-01-01 05:45:00
## 5  2013     1     1     6      0 2013-01-01 06:00:00
## 6  2013     1     1     5     58 2013-01-01 05:58:00

Now, let’s explore functions that let us get and set individual components of date and time.

We can extract individual parts of the date with the accessor functions in lubridate. Here is the list of available functions:

Accessor Function Extracts
year() year
month() month
mday() day of the month
yday() day of the year
wday() day of the week
hour() hour
minute() minute
second() second

For example, to extract the year information of the flights_new$departure column we can use:

flights_new$departure %>% year() %>% head()
## [1] 2013 2013 2013 2013 2013 2013

For month() and wday() we can set label = TRUE argument to return the abbreviated name of the month or day of the week. We can also set abbr = FALSE to return the full name:

flights_new$departure %>% month(label = TRUE, abbr = TRUE) %>% head()
## [1] Jan Jan Jan Jan Jan Jan
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
flights_new$departure %>% month(label = TRUE, abbr = FALSE) %>% head()
## [1] January January January January January January
## 12 Levels: January < February < March < April < May < June < ... < December

We can also use each accessor function to set the components of a date/time:

# create a date

datetime <- ymd_hms("2016-07-08 12:34:56")

#replace the year component with 2020
year(datetime) <- 2020
datetime
## [1] "2020-07-08 12:34:56 UTC"
# replace the month component with Jan
month(datetime) <- 01
datetime
## [1] "2020-01-08 12:34:56 UTC"
# add one hour
hour(datetime) <- hour(datetime) + 1
datetime
## [1] "2020-01-08 13:34:56 UTC"

Date arithmetic

Often, we may require computing a new variable from the date - time information. In this section, you will learn to create a sequence of dates and how arithmetic with dates works (including subtraction, addition, and division). 

For example, to create a sequence of dates we can use the seq() function with specifying the four arguments seq(from, to, by, and length.out).

# create a sequence of years from 1980 to 2018 by 2

even_years <- seq(from = 1980, to=2018, by = 2)
even_years
##  [1] 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
## [16] 2010 2012 2014 2016 2018

This can be applied for days, months, minutes, seconds, etc.

hour_list <- seq (ymd_hm("2018-1-1 9:00"), ymd_hm("2018-1-1 12:00"), by = "hour")
hour_list
## [1] "2018-01-01 09:00:00 UTC" "2018-01-01 10:00:00 UTC"
## [3] "2018-01-01 11:00:00 UTC" "2018-01-01 12:00:00 UTC"
month_list <- seq (ymd_hm("2018-1-1 9:00"), ymd_hm("2018-12-1 9:00"), by = "month")
month_list
##  [1] "2018-01-01 09:00:00 UTC" "2018-02-01 09:00:00 UTC"
##  [3] "2018-03-01 09:00:00 UTC" "2018-04-01 09:00:00 UTC"
##  [5] "2018-05-01 09:00:00 UTC" "2018-06-01 09:00:00 UTC"
##  [7] "2018-07-01 09:00:00 UTC" "2018-08-01 09:00:00 UTC"
##  [9] "2018-09-01 09:00:00 UTC" "2018-10-01 09:00:00 UTC"
## [11] "2018-11-01 09:00:00 UTC" "2018-12-01 09:00:00 UTC"

In R, when you subtract two dates, you get a time intervals/differences object (a.k.a difftime in R) . To illustrate let’s calculate my age using:

my_age <- today() - ymd(19810529)
my_age
## Time difference of 15678 days

Or, equivalently we can use:

difftime(today(), ymd(19810529))
## Time difference of 15678 days

As seen in the output, subtraction of two date-time objects gives an object of time difference class. In order to change the time difference to another unit we can use units argument:

difftime(today(), ymd(19810529), units = "weeks")
## Time difference of 2239.714 weeks

Note that only seconds, minutes, hours, days, and weeks are supported. Units larger than weeks are not used due to their variability.

Logical comparisons are also available for date-time variables.

your_age <- today() - ymd(19890101)
your_age
## Time difference of 12904 days
your_age == my_age
## [1] FALSE
your_age < my_age
## [1] TRUE

We can also deal with time intervals/differences by using the duration functions in lubridate. Duration simply measure the time span between start and end dates. lubridate provides simplistic syntax to calculate duration with the desired measurement (seconds, minutes, hours, etc.).

It should be noted that the lubridate package uses seconds as the unit of calculation. Therefore, duration always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds at the standard rate (60 seconds in a minute, 60 minutes in an hour, 24 hours in day, 7 days in a week, 365 days in a year).

# create a new duration (represented in seconds)

duration(1)
## [1] "1s"
# create duration for minutes
dminutes(1)
## [1] "60s (~1 minutes)"
# create duration for hours
dhours(1)
## [1] "3600s (~1 hours)"
# create duration for years
dyears(1)
## [1] "31557600s (~1 years)"
# add/subtract duration from date/time object
x <- ymd_hms("2015-09-22 12:00:00")
x + dhours(10)
## [1] "2015-09-22 22:00:00 UTC"
x + dhours(10) + dminutes(33) + dseconds(54)
## [1] "2015-09-22 22:33:54 UTC"

Dealing with Characters/Strings

String/character manipulations are often overlooked in data analysis because the focus typically remains on numeric values. However, the growth in text mining resulted in greater emphasis on handling, cleaning and processing character strings. In the second part of this module I will give the foundation of working with characters by covering string manipulation with Base R and stringr and the set operations for character strings.

Character string basics

This section includes how to create, convert and print character strings along with how to count the number of elements and characters in a string.

Creating Strings

The most basic way to create strings is to use quotation marks and assign a string to an object similar to creating number sequences like this:

a <- "MATH2349"    # create string a
b <- "is awesome"     # create string b

The paste() function under Base R is used for creating and building strings. It takes one or more R objects, converts them to character, and then it concatenates (pastes) them to form one or several character strings.

Here are some examples of paste() function:

# paste together string a & b

paste(a, b)
## [1] "MATH2349 is awesome"
# paste character and number strings (converts numbers to character class)

paste("The life of", pi)           
## [1] "The life of 3.14159265358979"
# paste multiple strings

paste("I", "love", "Data Preprocessing")            
## [1] "I love Data Preprocessing"
# paste multiple strings with a separating character

paste("I", "love", "Data", "Preprocessing", sep = "-")  
## [1] "I-love-Data-Preprocessing"
# use paste0() to paste without spaces between characters

paste0("I", "love",  "Data", "Preprocessing")  
## [1] "IloveDataPreprocessing"
# paste objects with different lengths

paste("R", 1:5, sep = " v1.")       
## [1] "R v1.1" "R v1.2" "R v1.3" "R v1.4" "R v1.5"
## [1] "R v1.1" "R v1.2" "R v1.3" "R v1.4" "R v1.5"

Sorting character strings is very simple using sort() function:

a <- c("MATH2349", "MATH1324")   
sort(a)
## [1] "MATH1324" "MATH2349"

Converting to Strings

Similar to the numerics, strings and characters can be tested with is.character() and any other data format can be converted into string/character with as.character() or with toString().

a <- "The life of"    
b <- pi

is.character(a)
## [1] TRUE
is.character(b)
## [1] FALSE
c <- as.character(b)
is.character(c)
## [1] TRUE
toString(c("Jul", 25, 2017))
## [1] "Jul, 25, 2017"

Printing Strings

Printing strings/characters can be done with the following functions:

Function Usage
print() generic printing
noquote() print with no quotes
cat() concatenate and print with no quotes

The primary printing function in R is print().

# basic printing

a <- "MATH2349 is awesome"    
print(a)
## [1] "MATH2349 is awesome"
# print without quotes

print(a, quote = FALSE)  
## [1] MATH2349 is awesome
# alternative to print without quotes

noquote(a)
## [1] MATH2349 is awesome

Concatenating strings

The cat() function allows us to concatenate objects and print them either on screen or to a file. The output result is very similar to noquote(); however, cat() does not print the numeric line indicator. As a result, cat() can be useful for printing nicely formatted responses to users.

# basic printing (similar to noquote)

cat(a)                   
## MATH2349 is awesome
# combining character string

cat(a, "and I love R")           
## MATH2349 is awesome and I love R
# basic printing of alphabet

cat(letters)             
## a b c d e f g h i j k l m n o p q r s t u v w x y z
# specify a separator between the combined characters

cat(letters, sep = "-")  
## a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z
# collapse the space between the combine characters

cat(letters, sep = "")   
## abcdefghijklmnopqrstuvwxyz

You can also format the line width for printing long strings using the fill argument:

x <- "Today I am learning how to manipulate strings."
y <- "Tomorrow I plan to work on my assignment."
z <- "The day after I will take a break and drink a beer :)"

# No breaks between lines

cat(x, y, z, fill = FALSE)
## Today I am learning how to manipulate strings. Tomorrow I plan to work on my assignment. The day after I will take a break and drink a beer :)
# Breaks between lines

cat(x, y, z, fill = TRUE)
## Today I am learning how to manipulate strings. 
## Tomorrow I plan to work on my assignment. 
## The day after I will take a break and drink a beer :)

Counting string elements and characters

To count the number of elements in a string use length():

length("How many elements are in this string?")
## [1] 1
length(c("How", "many", "elements", "are", "in", "this", "string?"))
## [1] 7

To count the number of characters in a string use nchar():

nchar("How many characters are in this string?")
## [1] 39
nchar(c("How", "many", "characters", "are", "in", "this", "string?"))
## [1]  3  4 10  3  2  4  7

String manipulation with Base R

Basic string manipulation typically includes case conversion, simple character replacement, abbreviating, substring replacement, adding/removing white space, and performing set operations to compare similarities and differences between two-character vectors. 

These operations can all be performed with base R functions; however, some operations are greatly simplified with the stringr package. Therefore, after illustrating base R string manipulation for case conversion, simple character replacement, abbreviating, and substring replacement, we will switch to stringr package to cover many of the other fundamental string manipulation tasks.

Upper/lower case conversion

To convert all upper case characters to lower case we will use tolower():

a <- "MATH2349 is AWesomE"
tolower(a)
## [1] "math2349 is awesome"

To convert all lower case characters to upper case we will use toupper():

toupper(x)
## [1] "TODAY I AM LEARNING HOW TO MANIPULATE STRINGS."

Simple Character Replacement

To replace a character (or multiple characters) in a string we can use chartr():

# replace 'A' with 'a'

x <- "This is A string."
chartr(old = "A", new = "a", x)
## [1] "This is a string."
# multiple character replacements
# replace any 'd' with 't' and any 'z' with 'a'

y <- "Tomorrow I plzn do lezrn zbout dexduzl znzlysis."
chartr(old = "dz", new = "ta", y)
## [1] "Tomorrow I plan to learn about textual analysis."

Note that chartr() replaces every identified letter for replacement so you need to use it when you are certain that you want to change every possible occurrence of that letter(s).

String Abbreviations

To abbreviate strings we can use abbreviate():

streets <- c("Victoria", "Yarra", "Russell", "Williams", "Swanston")

# default abbreviations
abbreviate(streets)
## Victoria    Yarra  Russell Williams Swanston 
##   "Vctr"   "Yarr"   "Rssl"   "Wllm"   "Swns"
# set minimum length of abbreviation
abbreviate(streets, minlength = 2)
## Victoria    Yarra  Russell Williams Swanston 
##     "Vc"     "Yr"     "Rs"     "Wl"     "Sw"

Extract/Replace Substrings

To extract or replace substrings in a character vector there are two primary base R functions to use: substr() and strsplit()

The purpose of substr() is to extract and replace substrings with specified starting and stopping characters. Here are some examples on substr() usage:

alphabet <- paste(LETTERS, collapse = "")
alphabet
## [1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# extract 18th character in alphabet
substr(alphabet, start = 18, stop = 18)
## [1] "R"
# extract 18-24th characters in alphabet
substr(alphabet, start = 18, stop = 24)
## [1] "RSTUVWX"
# replace 19-24th characters with `R`
substr(alphabet, start = 19, stop = 24) <- "RRRRRR"
alphabet
## [1] "ABCDEFGHIJKLMNOPQRRRRRRRYZ"

To split the elements of a character string we can use strsplit(). Here are some examples:

z <- "The day after I will take a break and drink a beer :)"
strsplit(z, split = " ")
## [[1]]
##  [1] "The"   "day"   "after" "I"     "will"  "take"  "a"     "break" "and"  
## [10] "drink" "a"     "beer"  ":)"
a <- "Victoria-Yarra-Russell-Williams-Swanston"
strsplit(a, split = "-") 
## [[1]]
## [1] "Victoria" "Yarra"    "Russell"  "Williams" "Swanston"

Note that the output of strsplit() is a list. To convert the output to a simple atomic vector simply wrap in unlist():

unlist(strsplit(a, split = "-"))
## [1] "Victoria" "Yarra"    "Russell"  "Williams" "Swanston"

Set operations for character strings

There are also base R functions that allows for assessing the set union, intersection, difference, equality, and membership of two vectors.

To obtain the elements of the union between two character vectors we can use union():

set_1 <- c("lagunitas", "bells", "dogfish", "summit", "odell")
set_2 <- c("sierra", "bells", "harpoon", "lagunitas", "founders")
union(set_1, set_2)
## [1] "lagunitas" "bells"     "dogfish"   "summit"    "odell"     "sierra"   
## [7] "harpoon"   "founders"

To obtain the common elements of two character vectors we can use intersect().

intersect(set_1, set_2)
## [1] "lagunitas" "bells"

In order to obtain the non-common elements, or the difference, of two character vectors we can use setdiff().

# returns elements in set_1 not in set_2

setdiff(set_1, set_2)
## [1] "dogfish" "summit"  "odell"
# returns elements in set_2 not in set_1
setdiff(set_2, set_1)
## [1] "sierra"   "harpoon"  "founders"

In order to test if two vectors contain the same elements regardless of order we can use setequal()

set_3 <- c("VIC", "NSW", "TAS")
set_4 <- c("WA", "SA", "NSW")
set_5 <- c("NSW", "SA", "WA")

setequal(set_3, set_4)
## [1] FALSE
setequal(set_4, set_5)
## [1] TRUE

We can use identical() to test if two character vectors are equal in content and order.

set_6 <- c("VIC", "NSW", "TAS")
set_7 <- c("NSW", "VIC", "TAS")
set_8 <- c("VIC", "NSW", "TAS")

identical(set_6, set_7)
## [1] FALSE
identical(set_6, set_8)
## [1] TRUE

In order to test if an element is contained within a character vector use is.element() or %in%. Here are some examples:

set_6 <- c("VIC", "NSW", "TAS")
set_7 <- c("NSW", "VIC", "TAS")
set_8 <- c("VIC", "NSW", "TAS")

is.element("VIC", set_8)
## [1] TRUE
"VIC" %in% set_8
## [1] TRUE
"WA" %in% set_8
## [1] FALSE

Strings manipulation with stringr

The stringr package was developed by Hadley Wickham to provide a consistent and simple wrapper to common string operations. Before using these functions, we need to install and load the stringr package.

#install.packages("stringr")
library(stringr)

Basic operations

There are three string functions that are closely related to their base R equivalents, but with a few enhancements. They are:

  • Concatenate with str_c()
  • Number of characters with str_length()
  • Substring with str_sub()

str_c() is equivalent to the paste() function in Base R.

# same as paste0()

str_c("Learning", "to", "use", "the", "stringr", "package")
## [1] "Learningtousethestringrpackage"
# same as paste()

str_c("Learning", "to", "use", "the", "stringr", "package", sep = " ")
## [1] "Learning to use the stringr package"

str_length() is similiar to the nchar() function in BaseR:

# some text

text = c("Learning", "to", NA, "use", "the", NA, "stringr", "package")

# `nchar()`

nchar(text)
## [1]  8  2 NA  3  3 NA  7  7
# `str_length()` same as above

str_length(text)
## [1]  8  2 NA  3  3 NA  7  7

As seen above, both str_length() and nchar() functions count the number of characters for non-missing values. For the missing values, they return NA as a value.

You can access individual character using str_sub(). The str_sub() function is equivalent to the substr() function in BaseR and it takes three arguments: a character vector, a start position and an end position. Either position can either be a positive integer, which counts from the left, or a negative integer which counts from the right. Note that the positions are inclusive, and if longer than the string, will be silently truncated.

# Some text

y <- c("abcdef", "ghifjk")

# The 3rd letter

str_sub(y, 3, 3)
## [1] "c" "i"
# The 2nd to 2nd-to-last character

str_sub(y, 2, -2)
## [1] "bcde" "hifj"

You can also use str_sub() to modify strings:

# Change the third letter with X

str_sub(y, 3, 3) <- "X"
y
## [1] "abXdef" "ghXfjk"

Duplicate Characters within a String

The stringr provides a new functionality using str_dup() in which base R does not have a specific function for character duplication.

str_dup("apples", times = 4)
## [1] "applesapplesapplesapples"
str_dup("apples", times = 1:4)
## [1] "apples"                   "applesapples"            
## [3] "applesapplesapples"       "applesapplesapplesapples"

Remove Leading and Trailing White space

In string processing, a common task is parsing text into individual words. Often, this results in words having blank spaces (white spaces) on either end of the word. The str_trim() can be used to remove these spaces. Here are some examples:

text <- c("Text ", "  with", " whitespace ", " on", "both ", " sides ")
text
## [1] "Text "        "  with"       " whitespace " " on"          "both "       
## [6] " sides "
# remove white spaces on the left side
str_trim(text, side = "left")
## [1] "Text "       "with"        "whitespace " "on"          "both "      
## [6] "sides "
# remove white spaces on the right side
str_trim(text, side = "right")
## [1] "Text"        "  with"      " whitespace" " on"         "both"       
## [6] " sides"
# remove white spaces on both sides
str_trim(text, side = "both")
## [1] "Text"       "with"       "whitespace" "on"         "both"      
## [6] "sides"

Pad a String with White space

To add white space, or to pad a string, we will use str_pad(). We can also use str_pad() to pad a string with specified characters. The width argument will give width of padded strings and the pad argument will specify the padding characters. Here are some examples:

str_pad("apples", width = 10, side = "left")
## [1] "    apples"
str_pad("apples", width = 10, side = "both")
## [1] "  apples  "
str_pad("apples", width = 10, side = "right", pad = "!")
## [1] "apples!!!!"

Upper/lower case conversion

Similar to tolower() and toupper() functions in BaseR, stringr also has upper/lower case transformation functions:

# Some text

x <- "I like VeGGies."

# convert all upper-case characters to lower case

str_to_lower(x)
## [1] "i like veggies."
#convert all lower-case characters to upper case

str_to_upper(x)
## [1] "I LIKE VEGGIES."
# convert only first letters of words to upper case

str_to_title(x)
## [1] "I Like Veggies."

String ordering and sorting

String ordering and sorting can be done using str_order and str_sort functions.

# some text

x <- c("y", "i", "k")

# return the index of ordered values

str_order(x)
## [1] 2 3 1

Note that str_order() returns the index of ordered values, not the order.

# sort characters

str_sort(x)
## [1] "i" "k" "y"

Pattern matching (Pattern detect, subset, locate, count, extract, and replace)

Most string manipulations require pattern matching for a given text. Good news is, stringr package has many specialized pattern matching functions to detect, subset, locate, count, extract, and replace strings. Here, I will demonstrate the commonly used pattern matching functions under stringr. For more information on other useful pattern matching functions, please refer to the [stringr vignette] (https://stringr.tidyverse.org/articles/stringr.html).

Note that, each pattern matching function described below has the same first two arguments, a character vector of strings to process and a single pattern to match specified by the pattern = argument.

  • str_detect() detects the presence or absence of a pattern and returns a logical vector. Here is an example of its usage:
# detects pattern "ea"

x <- c("apple", "banana", "pear")
str_detect(x, pattern ="ea")
## [1] FALSE FALSE  TRUE
#same as above
str_detect(x, "ea")
## [1] FALSE FALSE  TRUE

While matching patterns, one can also use the regular expressions. Regular expressions (a.k.a. regex’s) are a language that allow you to describe patterns in strings. They take a little while to get your head around, but once you understand them, you’ll find them extremely useful.

# Same as above using regex

x <- c("apple", "banana", "pear")
str_detect(x, regex("ea"))
## [1] FALSE FALSE  TRUE

You can also perform a case-insensitive match using ignore.cases = TRUE:

bananas <- c("banana", "Banana", "BANANA")

#case sensitive match

str_detect(bananas, "banana") 
## [1]  TRUE FALSE FALSE
#case insensitive match

str_detect(bananas, regex("banana", ignore_case = TRUE)) 
## [1] TRUE TRUE TRUE

With regex, you can create your own character classes using [ ]. For example:

  • [abc]: matches a, b, or c.
  • [a-z]: matches every character between a and z (in Unicode code point order).
  • [^abc]: matches anything except a, b, or c.
  • [\^\-]: matches ^ or -.

There are a number of pre-built classes that you can use inside [ ]:

  • [:punct:]: punctuation.
  • [:alpha:]: letters.
  • [:lower:]: lowercase letters.
  • [:upper:]: upperclass letters.
  • [:digit:]: digits.
  • [:xdigit:]: hex digits.
  • [:alnum:]: letters and numbers.
  • [:cntrl:]: control characters.
  • [:graph:]: letters, numbers, and punctuation.
  • [:print:]: letters, numbers, punctuation, and white space.
  • [:space:]: space characters (basically equivalent to ).
  • [:blank:]: space and tab.

Here are some examples:

emails <- c("s123546@student.rmit.edu.au", "sona.taheri@rmit.edu.au", "s2342565@rmit.edu.vn")

#detect the emails containing numbers

str_detect(emails, "[:digit:]")
## [1]  TRUE FALSE  TRUE
#detect the emails containing lowercase letters

str_detect(emails, "[:lower:]")
## [1] TRUE TRUE TRUE

For more information on the regex capabilities, please refer to [regular expressions vignette] (https://stringr.tidyverse.org/articles/regular-expressions.html) under stringr package.

  • str_subset() returns the elements of a character vector that match a regular expression.
#subset emails that contains numbers

str_subset(emails, "[:digit:]")
## [1] "s123546@student.rmit.edu.au" "s2342565@rmit.edu.vn"
  • str_extract() extracts text corresponding to the first match, returning a character vector.
# extract the digits 1234

str_extract("Let's extract the digits 1234", pattern = "1234")
## [1] "1234"
# extract the digits 23 in emails

str_extract(emails, pattern = "23")
## [1] "23" NA   "23"
  • str_locate() locates the first position of a pattern and returns a numeric matrix with columns start and end whereas str_locate_all()
# locate the first i in the string

str_locate("Locate the first i in this string", "i")
##      start end
## [1,]    13  13
# locate all the i's in this string

str_locate_all("Locate all the i's in this string", "i")
## [[1]]
##      start end
## [1,]    16  16
## [2,]    20  20
## [3,]    25  25
## [4,]    31  31
# locate all the full stops (\\.) in this string

str_locate_all("Full stop separates two sentences. Now I will locate all full stops.", "\\.")
## [[1]]
##      start end
## [1,]    34  34
## [2,]    68  68
  • str_count() counts the number of matches for a given string.
# Counts the digits

str_count("90 Dollars", "[:digit:]") 
## [1] 2
# Counts the letters

str_count("90 Dollars", "[:alpha:]") 
## [1] 7
  • str_replace() replaces a string with another one. The pattern argument will give the string that is going to be replaced and replacement argument will specify the replacement string.
# replace Dollars with AUD

str_replace("90 Dollars", pattern = "Dollars", replacement = "AUD")
## [1] "90 AUD"
  • str_replace_all() replaces all matches.
# replace all l's with "" (delete l's)

str_replace_all("Hello world", pattern = "l", replacement = "")
## [1] "Heo word"

Additional Resources and Further Reading

For more information on lubridate and stringr packages and available functions, you can refer to the [lubridate package manual] (https://cran.r-project.org/web/packages/lubridate/lubridate.pdf), the [stringr package manual] (https://cran.r-project.org/web/packages/stringr/stringr.pdf) and its vignette. For the regular expressions, refer to [regular expressions vignette] (https://stringr.tidyverse.org/articles/regular-expressions.html).

Our recommended textbooks (Boehmke (2016) and Wickham and Grolemund (2016)) are great resources for the basics of date and character manipulations. If you want to learn more on the high-level text manipulations and text mining, you may refer to “Automated Data Collection with R: A practical guide to web scraping and text mining” (by Munzert et al. (2014)).

References

Boehmke, Bradley C. 2016. Data Wrangling with r. Springer.
Munzert, Simon, Christian Rubba, Peter Meißner, and Dominic Nyhuis. 2014. Automated Data Collection with r: A Practical Guide to Web Scraping and Text Mining. John Wiley & Sons.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".