Module 8-2 Demonstration

Special Operations: Dealing with date/time and character variables

Creating Strings

  • The most basic way to create strings is to use quotation marks and assign a string to an object.
quote <- "The most valuable thing you can have as a leader is clear data"
author <- "Ruth Porat"
  • The paste() function under Base R is used for creating and building strings. str_c() is equivalent to the paste() function.
paste(quote, "by", author)
## [1] "The most valuable thing you can have as a leader is clear data by Ruth Porat"
  • Use paste0() to paste without spaces between characters.
paste0("I", "love", "Data", "Preprocessing")
## [1] "IloveDataPreprocessing"
Converting to Strings

  • Strings and characters can be tested with is.character() and any other data format can be converted into strings/characters with as.character().
## [1] TRUE
## [1] "3.14159265358979"
Printing Strings

Printing strings/characters can be done with the following:

Function Usage
print() generic printing
noquote() print with no quotes
cat() concatenate and print with no quotes (no line number)
# print without quotes
print( paste(quote,author) , quote = FALSE)
## [1] The most valuable thing you can have as a leader is clear data Ruth Porat
# same as above, only difference `cat()` does not print the numeric line indicator
cat( paste(quote,author) )
## The most valuable thing you can have as a leader is clear data Ruth Porat
Printing Strings Cont.

# basic printing of alphabet
## a b c d e f g h i j k l m n o p q r s t u v w x y z
# specify a seperator between the combined characters
cat(letters, sep = "-")
## a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z
Printing Strings Cont.

  • To format the line width for printing long strings use the fill argument.
# No breaks between lines
cat(quote, author, fill = FALSE)
## The most valuable thing you can have as a leader is clear data Ruth Porat
# Breaks between lines
cat(letters, letters, letters, fill = TRUE)
## a b c d e f g h i j k l m n o p q r s t u v w x y z a b c d e f g h i j k l m n
## o p q r s t u v w x y z a b c d e f g h i j k l m n o p q r s t u v w x y z
Counting string elements and characters

  • To count the number of elements in a string use length().
length("How many elements are in this string?")
## [1] 1
length( c("How", "many", "elements", "are", "in", "this", "string?") )
## [1] 7
  • To count the number of characters in a string use nchar().
nchar("How many characters are in this string?")
## [1] 39
nchar(c("How", "many", "characters", "are", "in", "this", "string?"))
## [1] 3 4 10 3 2 4 7
String manipulation with Base R

  • Basic string manipulation typically includes:

    • case conversion;
    • simple character replacement;
    • pattern replacement;
    • abbreviating;
    • substring replacement;
    • adding/removing white space;
    • set operations.
  • These operations can all be performed with base R functions; however, some operations are greatly simplified with the stringr package.

Upper/lower case conversion

  • To convert all upper case characters to lower case use tolower().
  • To convert all lower case characters to upper case use toupper().
a <- "MATH2349 is AWesomE"
## [1] "math2349 is awesome"
## [1] "MATH2349 IS AWESOME"
Simple Character Replacement

  • To replace a character (or multiple characters) in a string use chartr().
# replace 'z' with 's'
american <- "This is how we analyze."
chartr(old = "z", new = "s", american)
## [1] "This is how we analyse."
# replace 'i' with 'w', 'X' with 'h' and 's' with 'y'
x <- "MiXeD cAsE 123"
chartr(old ="iXs", new ="why", x)
## [1] "MwheD cAyE 123"
Pattern Replacement

  • To replace a pattern in a string use gsub().
# replace "ot" pattern with "ut"
x <- "R Totorial"
gsub(pattern = "ot", replacement="ut", x)
## [1] "R Tutorial"
String Abbreviations

  • To abbreviate strings we can use abbreviate().
streets <- c("Victoria", "Yarra", "Russell", "Williams", "Swanston")
# default abbreviations
## Victoria Yarra Russell Williams Swanston
## "Vctr" "Yarr" "Rssl" "Wllm" "Swns"
# set minimum length of abbreviation
abbreviate(streets, minlength = 2)
## Victoria Yarra Russell Williams Swanston
## "Vc" "Yr" "Rs" "Wl" "Sw"
Extract/Replace Substrings

  • The purpose of substr() is to extract and replace substrings with specified starting and stopping characters.
alphabet <- paste(LETTERS, collapse = "")
# extract 18-24th characters in alphabet
substr(alphabet, start = 18, stop = 24)
## [1] "RSTUVWX"
# replace 19-24th characters with `R`
substr(alphabet, start = 19, stop = 24) <- "RRRRRR"
Extract/Replace Substrings

  • To split the elements of a character string use strsplit().
z <- "Victoria Yarra Russell Williams Swanston"
strsplit(z, split = " ")
## [[1]]
## [1] "Victoria" "Yarra" "Russell" "Williams" "Swanston"
a <- "Victoria-Yarra-Russell-Williams-Swanston"
strsplit(a, split = "-")
## [[1]]
## [1] "Victoria" "Yarra" "Russell" "Williams" "Swanston"
  • Note that the output of strsplit() is a list. To convert the output to a simple atomic vector simply use unlist().
unlist(strsplit(a, split = "-"))
## [1] "Victoria" "Yarra" "Russell" "Williams" "Swanston"
Set operatons for character strings

Function Usage
union() obtain union between two character vectors
intersect() obtain the common elements of two character vectors
setdiff() obtain the non-common elements, or the difference
setequal() tests if two vectors contain the same elements regardless of order
identical() tests if two character vectors are equal in content and order
Set operatons for character strings Cont.

set_1 <- c("VIC", "NSW", "WA", "TAS")
set_2 <- c("TAS", "QLD", "SA", "NSW")
union(set_1, set_2)
## [1] "VIC" "NSW" "WA" "TAS" "QLD" "SA"
intersect(set_1, set_2)
## [1] "NSW" "TAS"
setdiff(set_1, set_2)
## [1] "VIC" "WA"
String manipulation with stringr

  • The stringr package was developed by Hadley Wickham to provide a consistent and simple wrappers to common string operations.

  • These functions are closely related to their base R equivalents:

    • Concatenate with str_c() ( paste() and paste0()).

    • Number of characters with str_length() ( nchar()).

    • Substring with str_sub() ( substr() ).

Duplicate Characters within a String

  • The stringr provides a new functionality using str_dup() in which base R does not have a specific function for character duplication.
str_dup("apples", times = 4)
## [1] "applesapplesapplesapples"
str_dup("apples", times = 1:4)
## [1] "apples" "applesapples"
## [3] "applesapplesapples" "applesapplesapplesapples"
Remove Leading and Trailing White space

  • In string processing, a common task is parsing text into individual words.

  • Often, this results in words having blank spaces (white spaces) on either end of the word. The str_trim() can be used to remove these spaces.

text <- c("Text ", " with", " whitespace ", " on", "both ", " sides ")
## [1] "Text " " with" " whitespace " " on" "both "
## [6] " sides "
str_trim(text, side = "left")
## [1] "Text " "with" "whitespace " "on" "both "
## [6] "sides "
str_trim(text, side = "right")
## [1] "Text" " with" " whitespace" " on" "both"
## [6] " sides"
str_trim(text, side = "both")
## [1] "Text" "with" "whitespace" "on" "both"
## [6] "sides"
Pad a String with White space

  • Conversely, to add whitespace, or to pad a string, we can use str_pad().
str_pad("apples", width = 10, side = "left")
## [1] " apples"
str_pad("apples", width = 10, side = "both")
## [1] " apples "
  • Use str_pad() to pad a string with specified characters. The width argument will give width of padded strings and the pad argument will specify the padding characters.
str_pad("apples", width = 10, side = "right", pad = "!")
## [1] "apples!!!!"
Pattern matching

  • The vast majority of string manipulations require pattern matching for a given text.

  • Good news is, stringr package has pattern matching functions to detect, subset, locate, count, extract, and replace strings.

Pattern detection with str_detect()

  • str_detect() detects the presence or absence of a pattern and returns a logical vector.
# detects pattern "ea"
x <- c("apple", "banana", "pear")
str_detect(x, pattern ="ea")
#same as above
str_detect(x, "ea")
Remark: Regular expressions (Regex)

  • While matching patterns, you can also use the regular expressions.

  • Regular expressions (a.k.a. regex's) are a language that allow you to describe patterns in strings.

# Same as above using regex
x <- c("apple", "banana", "pear")
str_detect(x, regex("ea"))
  • You can perform a case-insensitive match using ignore_case = TRUE.
bananas <- c("banana", "Banana", "BANANA")
#case insensitive match
str_detect(bananas, regex("banana",ignore_case = TRUE))
Remark: Regular expressions (Regex) Cont.

  • With regex, you can create your own character classes using [ ]. For example:
  • [abc]: matches a, b, or c.
  • [a-z]: matches every character between a and z (in Unicode code point order).
  • [^abc]: matches anything except a, b, or c.
  • [\^\-]: matches ^ or -.
  • They take a little while to get your head around, but once you understand them, you’ll find them extremely useful.

  • For more information on the regex capabilities, please refer to regular expressions vignette under stringr package.

Remark: Regular expressions (Regex) Cont.

  • There are a number of pre-built classes that you can use inside [ ]:
  • [:punct:]: punctuation.
  • [:alpha:]: letters.
  • [:lower:]: lowercase letters.
  • [:upper:]: upperclass letters.
  • [:digit:]: digits.
  • [:xdigit:]: hex digits.
  • [:alnum:]: letters and numbers.
  • [:cntrl:]: control characters.
  • [:graph:]: letters, numbers, and punctuation.
  • [:print:]: letters, numbers, punctuation, and white space.
  • [:space:]: space characters (basically equivalent to \s).
  • [:blank:]: space and tab.
Your turn!

  • Using the commonly used words (in English) data set under stringr.
## [1] "a" "able" "about" "absolute" "accept" "account"
## [1] 980
#Task 1:
str_detect(words, pattern = regex("ing")) %>% sum()
## [1] 10
# Same as above:
str_detect(words, "ing") %>% sum()
## [1] 10
# Task 2:
str_detect(words, "ing$") %>% sum()
## [1] 9
# Task 3:
words[str_detect(words, "ing$")]
## [1] "bring" "during" "evening" "king" "meaning" "morning" "ring"
## [8] "sing" "thing"

String subsetting with str_subset()

  • str_subset() returns the elements of a character vector that match a regular expression.

  • Using starwars data set, let's subset the character names that contain any punctuation.

## [1] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader"
## [5] "Leia Organa" "Owen Lars"
str_subset(starwars$name, "[:punct:]")
## [1] "C-3PO" "R2-D2" "R5-D4" "Obi-Wan Kenobi"
## [5] "IG-88" "Qui-Gon Jinn" "Ki-Adi-Mundi" "R4-P17"
String extract using str_extract()

  • str_extract() extracts text corresponding to the first match, returning a character vector.
str_extract(starwars$name, "[:punct:]")
## [1] NA "-" "-" NA NA NA NA "-" NA "-" NA NA NA NA NA NA NA NA NA
Finding pattern locations using str_locate()

  • str_locate() locates the first position of a pattern and returns a numeric matrix with columns start and end whereas str_locate_all() locates all positions of a given pattern.
str_locate(starwars$name, "[:punct:]") %>% head()
## start end
## [1,] NA NA
## [2,] 2 2
## [3,] 3 3
## [4,] NA NA
## [5,] NA NA
## [6,] NA NA
Pattern counting using str_count()

  • str_count() counts the number of matches for a given string.
str_count(starwars$name, "[:punct:]")
## [1] 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## [77] 0 0 0 0 0 0 0 0 0 0 0
String replacing with str_replace()

  • str_replace() replaces a string with another one.

  • The pattern argument will give the string that is going to be replaced and replacement argument will specify the replacement string.

## [1] "apple" "apricot" "avocado" "banana" "bell pepper"
## [6] "bilberry"
# Replace berry with berries
str_replace(fruit, pattern = "berry", replacement = "berries")
## [1] "apple" "apricot" "avocado"
## [4] "banana" "bell pepper" "bilberries"
## [7] "blackberries" "blackcurrant" "blood orange"
## [10] "blueberries" "boysenberries" "breadfruit"
## [13] "canary melon" "cantaloupe" "cherimoya"
## [16] "cherry" "chili pepper" "clementine"
## [19] "cloudberries" "coconut" "cranberries"
## [22] "cucumber" "currant" "damson"
## [25] "date" "dragonfruit" "durian"
## [28] "eggplant" "elderberries" "feijoa"
## [31] "fig" "goji berries" "gooseberries"
## [34] "grape" "grapefruit" "guava"
## [37] "honeydew" "huckleberries" "jackfruit"
## [40] "jambul" "jujube" "kiwi fruit"
## [43] "kumquat" "lemon" "lime"
## [46] "loquat" "lychee" "mandarine"
## [49] "mango" "mulberries" "nectarine"
## [52] "nut" "olive" "orange"
## [55] "pamelo" "papaya" "passionfruit"
## [58] "peach" "pear" "persimmon"
## [61] "physalis" "pineapple" "plum"
## [64] "pomegranate" "pomelo" "purple mangosteen"
## [67] "quince" "raisin" "rambutan"
## [70] "raspberries" "redcurrant" "rock melon"
## [73] "salal berries" "satsuma" "star fruit"
## [76] "strawberries" "tamarillo" "tangerine"
## [79] "ugli fruit" "watermelon"
String replacing with str_replace() Cont.

#replace first l with "" (delete first l)
str_replace("Hello world", pattern = "l", replacement = "")
## [1] "Helo world"
# replace all l's with "" (delete l's)
str_replace_all("Hello world", pattern = "l", replacement = "")
## [1] "Heo word"
Functions to Remember for Week 11

  • String manipulations using BaseR and stringr.

  • Usage of regular expressions.

  • Pattern matching functions.

  • Practice!

Your turn! Class Worksheet

  • Working in small groups, complete the following worksheet:

