+ - 0:00:00
Notes for current slide
Notes for next slide

Module 8-2 Demonstration

Special Operations: Dealing with date/time and character variables

1 / 34

Creating Strings

  • The most basic way to create strings is to use quotation marks and assign a string to an object.
quote <- "The most valuable thing you can have as a leader is clear data"
author <- "Ruth Porat"
  • The paste() function under Base R is used for creating and building strings. str_c() is equivalent to the paste() function.
paste(quote, "by", author)
## [1] "The most valuable thing you can have as a leader is clear data by Ruth Porat"
  • Use paste0() to paste without spaces between characters.
paste0("I", "love", "Data", "Preprocessing")
## [1] "IloveDataPreprocessing"
2 / 34

Converting to Strings

  • Strings and characters can be tested with is.character() and any other data format can be converted into strings/characters with as.character().
is.character(quote)
## [1] TRUE
as.character(pi)
## [1] "3.14159265358979"
3 / 34

Printing Strings

Printing strings/characters can be done with the following:

Function Usage
print() generic printing
noquote() print with no quotes
cat() concatenate and print with no quotes (no line number)
# print without quotes
print( paste(quote,author) , quote = FALSE)
## [1] The most valuable thing you can have as a leader is clear data Ruth Porat
# same as above, only difference `cat()` does not print the numeric line indicator
cat( paste(quote,author) )
## The most valuable thing you can have as a leader is clear data Ruth Porat
4 / 34

Printing Strings Cont.

# basic printing of alphabet
cat(letters)
## a b c d e f g h i j k l m n o p q r s t u v w x y z
# specify a seperator between the combined characters
cat(letters, sep = "-")
## a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z
5 / 34

Printing Strings Cont.

  • To format the line width for printing long strings use the fill argument.
# No breaks between lines
cat(quote, author, fill = FALSE)
## The most valuable thing you can have as a leader is clear data Ruth Porat
# Breaks between lines
cat(letters, letters, letters, fill = TRUE)
## a b c d e f g h i j k l m n o p q r s t u v w x y z a b c d e f g h i j k l m n
## o p q r s t u v w x y z a b c d e f g h i j k l m n o p q r s t u v w x y z
6 / 34

Counting string elements and characters

  • To count the number of elements in a string use length().
length("How many elements are in this string?")
## [1] 1
length( c("How", "many", "elements", "are", "in", "this", "string?") )
## [1] 7
  • To count the number of characters in a string use nchar().
nchar("How many characters are in this string?")
## [1] 39
nchar(c("How", "many", "characters", "are", "in", "this", "string?"))
## [1] 3 4 10 3 2 4 7
7 / 34

String manipulation with Base R

  • Basic string manipulation typically includes:

    • case conversion;
    • simple character replacement;
    • pattern replacement;
    • abbreviating;
    • substring replacement;
    • adding/removing white space;
    • set operations.
  • These operations can all be performed with base R functions; however, some operations are greatly simplified with the stringr package.

8 / 34

Upper/lower case conversion

  • To convert all upper case characters to lower case use tolower().
  • To convert all lower case characters to upper case use toupper().
a <- "MATH2349 is AWesomE"
tolower(a)
## [1] "math2349 is awesome"
toupper(a)
## [1] "MATH2349 IS AWESOME"
9 / 34

Simple Character Replacement

  • To replace a character (or multiple characters) in a string use chartr().
# replace 'z' with 's'
american <- "This is how we analyze."
chartr(old = "z", new = "s", american)
## [1] "This is how we analyse."
# replace 'i' with 'w', 'X' with 'h' and 's' with 'y'
x <- "MiXeD cAsE 123"
chartr(old ="iXs", new ="why", x)
## [1] "MwheD cAyE 123"
10 / 34

Pattern Replacement

  • To replace a pattern in a string use gsub().
# replace "ot" pattern with "ut"
x <- "R Totorial"
gsub(pattern = "ot", replacement="ut", x)
## [1] "R Tutorial"
11 / 34

String Abbreviations

  • To abbreviate strings we can use abbreviate().
streets <- c("Victoria", "Yarra", "Russell", "Williams", "Swanston")
# default abbreviations
abbreviate(streets)
## Victoria Yarra Russell Williams Swanston
## "Vctr" "Yarr" "Rssl" "Wllm" "Swns"
# set minimum length of abbreviation
abbreviate(streets, minlength = 2)
## Victoria Yarra Russell Williams Swanston
## "Vc" "Yr" "Rs" "Wl" "Sw"
12 / 34

Extract/Replace Substrings

  • The purpose of substr() is to extract and replace substrings with specified starting and stopping characters.
alphabet <- paste(LETTERS, collapse = "")
alphabet
## [1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# extract 18-24th characters in alphabet
substr(alphabet, start = 18, stop = 24)
## [1] "RSTUVWX"
# replace 19-24th characters with `R`
substr(alphabet, start = 19, stop = 24) <- "RRRRRR"
alphabet
## [1] "ABCDEFGHIJKLMNOPQRRRRRRRYZ"
13 / 34

Extract/Replace Substrings

  • To split the elements of a character string use strsplit().
z <- "Victoria Yarra Russell Williams Swanston"
strsplit(z, split = " ")
## [[1]]
## [1] "Victoria" "Yarra" "Russell" "Williams" "Swanston"
a <- "Victoria-Yarra-Russell-Williams-Swanston"
strsplit(a, split = "-")
## [[1]]
## [1] "Victoria" "Yarra" "Russell" "Williams" "Swanston"
  • Note that the output of strsplit() is a list. To convert the output to a simple atomic vector simply use unlist().
unlist(strsplit(a, split = "-"))
## [1] "Victoria" "Yarra" "Russell" "Williams" "Swanston"
14 / 34

Set operatons for character strings

Function Usage
union() obtain union between two character vectors
intersect() obtain the common elements of two character vectors
setdiff() obtain the non-common elements, or the difference
setequal() tests if two vectors contain the same elements regardless of order
identical() tests if two character vectors are equal in content and order
15 / 34

Set operatons for character strings Cont.

set_1 <- c("VIC", "NSW", "WA", "TAS")
set_2 <- c("TAS", "QLD", "SA", "NSW")
union(set_1, set_2)
## [1] "VIC" "NSW" "WA" "TAS" "QLD" "SA"
intersect(set_1, set_2)
## [1] "NSW" "TAS"
setdiff(set_1, set_2)
## [1] "VIC" "WA"
16 / 34

String manipulation with stringr

  • The stringr package was developed by Hadley Wickham to provide a consistent and simple wrappers to common string operations.

  • These functions are closely related to their base R equivalents:

    • Concatenate with str_c() ( paste() and paste0()).

    • Number of characters with str_length() ( nchar()).

    • Substring with str_sub() ( substr() ).

17 / 34

Duplicate Characters within a String

  • The stringr provides a new functionality using str_dup() in which base R does not have a specific function for character duplication.
str_dup("apples", times = 4)
## [1] "applesapplesapplesapples"
str_dup("apples", times = 1:4)
## [1] "apples" "applesapples"
## [3] "applesapplesapples" "applesapplesapplesapples"
18 / 34

Remove Leading and Trailing White space

  • In string processing, a common task is parsing text into individual words.

  • Often, this results in words having blank spaces (white spaces) on either end of the word. The str_trim() can be used to remove these spaces.

text <- c("Text ", " with", " whitespace ", " on", "both ", " sides ")
text
## [1] "Text " " with" " whitespace " " on" "both "
## [6] " sides "
str_trim(text, side = "left")
## [1] "Text " "with" "whitespace " "on" "both "
## [6] "sides "
str_trim(text, side = "right")
## [1] "Text" " with" " whitespace" " on" "both"
## [6] " sides"
str_trim(text, side = "both")
## [1] "Text" "with" "whitespace" "on" "both"
## [6] "sides"
19 / 34

Pad a String with White space

  • Conversely, to add whitespace, or to pad a string, we can use str_pad().
str_pad("apples", width = 10, side = "left")
## [1] " apples"
str_pad("apples", width = 10, side = "both")
## [1] " apples "
  • Use str_pad() to pad a string with specified characters. The width argument will give width of padded strings and the pad argument will specify the padding characters.
str_pad("apples", width = 10, side = "right", pad = "!")
## [1] "apples!!!!"
20 / 34

Pattern matching

  • The vast majority of string manipulations require pattern matching for a given text.

  • Good news is, stringr package has pattern matching functions to detect, subset, locate, count, extract, and replace strings.

21 / 34

Pattern detection with str_detect()

  • str_detect() detects the presence or absence of a pattern and returns a logical vector.
# detects pattern "ea"
x <- c("apple", "banana", "pear")
str_detect(x, pattern ="ea")
## [1] FALSE FALSE TRUE
#same as above
str_detect(x, "ea")
## [1] FALSE FALSE TRUE
22 / 34

Remark: Regular expressions (Regex)

  • While matching patterns, you can also use the regular expressions.

  • Regular expressions (a.k.a. regex's) are a language that allow you to describe patterns in strings.

# Same as above using regex
x <- c("apple", "banana", "pear")
str_detect(x, regex("ea"))
## [1] FALSE FALSE TRUE
  • You can perform a case-insensitive match using ignore_case = TRUE.
bananas <- c("banana", "Banana", "BANANA")
#case insensitive match
str_detect(bananas, regex("banana",ignore_case = TRUE))
## [1] TRUE TRUE TRUE
23 / 34

Remark: Regular expressions (Regex) Cont.

  • With regex, you can create your own character classes using [ ]. For example:
  • [abc]: matches a, b, or c.
  • [a-z]: matches every character between a and z (in Unicode code point order).
  • [^abc]: matches anything except a, b, or c.
  • [\^\-]: matches ^ or -.
  • They take a little while to get your head around, but once you understand them, you’ll find them extremely useful.

  • For more information on the regex capabilities, please refer to regular expressions vignette under stringr package.

24 / 34

Remark: Regular expressions (Regex) Cont.

  • There are a number of pre-built classes that you can use inside [ ]:
  • [:punct:]: punctuation.
  • [:alpha:]: letters.
  • [:lower:]: lowercase letters.
  • [:upper:]: upperclass letters.
  • [:digit:]: digits.
  • [:xdigit:]: hex digits.
  • [:alnum:]: letters and numbers.
  • [:cntrl:]: control characters.
  • [:graph:]: letters, numbers, and punctuation.
  • [:print:]: letters, numbers, punctuation, and white space.
  • [:space:]: space characters (basically equivalent to \s).
  • [:blank:]: space and tab.
25 / 34

Your turn!

  • Using the commonly used words (in English) data set under stringr.
library(stringr)
head(words)
## [1] "a" "able" "about" "absolute" "accept" "account"
length(words)
## [1] 980
26 / 34
#Task 1:
str_detect(words, pattern = regex("ing")) %>% sum()
## [1] 10
# Same as above:
str_detect(words, "ing") %>% sum()
## [1] 10
# Task 2:
str_detect(words, "ing$") %>% sum()
## [1] 9
# Task 3:
words[str_detect(words, "ing$")]
## [1] "bring" "during" "evening" "king" "meaning" "morning" "ring"
## [8] "sing" "thing"

String subsetting with str_subset()

  • str_subset() returns the elements of a character vector that match a regular expression.

  • Using starwars data set, let's subset the character names that contain any punctuation.

head(starwars$name)
## [1] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader"
## [5] "Leia Organa" "Owen Lars"
str_subset(starwars$name, "[:punct:]")
## [1] "C-3PO" "R2-D2" "R5-D4" "Obi-Wan Kenobi"
## [5] "IG-88" "Qui-Gon Jinn" "Ki-Adi-Mundi" "R4-P17"
27 / 34

String extract using str_extract()

  • str_extract() extracts text corresponding to the first match, returning a character vector.
str_extract(starwars$name, "[:punct:]")
## [1] NA "-" "-" NA NA NA NA "-" NA "-" NA NA NA NA NA NA NA NA NA
## [20] NA NA "-" NA NA NA NA NA NA NA NA "-" NA NA NA NA NA NA NA
## [39] NA NA NA NA NA NA NA NA NA NA NA NA "-" NA NA NA NA NA NA
## [58] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA "-" NA NA
## [77] NA NA NA NA NA NA NA NA NA NA NA
28 / 34

Finding pattern locations using str_locate()

  • str_locate() locates the first position of a pattern and returns a numeric matrix with columns start and end whereas str_locate_all() locates all positions of a given pattern.
str_locate(starwars$name, "[:punct:]") %>% head()
## start end
## [1,] NA NA
## [2,] 2 2
## [3,] 3 3
## [4,] NA NA
## [5,] NA NA
## [6,] NA NA
29 / 34

Pattern counting using str_count()

  • str_count() counts the number of matches for a given string.
str_count(starwars$name, "[:punct:]")
## [1] 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## [77] 0 0 0 0 0 0 0 0 0 0 0
30 / 34

String replacing with str_replace()

  • str_replace() replaces a string with another one.

  • The pattern argument will give the string that is going to be replaced and replacement argument will specify the replacement string.

head(fruit)
## [1] "apple" "apricot" "avocado" "banana" "bell pepper"
## [6] "bilberry"
# Replace berry with berries
str_replace(fruit, pattern = "berry", replacement = "berries")
## [1] "apple" "apricot" "avocado"
## [4] "banana" "bell pepper" "bilberries"
## [7] "blackberries" "blackcurrant" "blood orange"
## [10] "blueberries" "boysenberries" "breadfruit"
## [13] "canary melon" "cantaloupe" "cherimoya"
## [16] "cherry" "chili pepper" "clementine"
## [19] "cloudberries" "coconut" "cranberries"
## [22] "cucumber" "currant" "damson"
## [25] "date" "dragonfruit" "durian"
## [28] "eggplant" "elderberries" "feijoa"
## [31] "fig" "goji berries" "gooseberries"
## [34] "grape" "grapefruit" "guava"
## [37] "honeydew" "huckleberries" "jackfruit"
## [40] "jambul" "jujube" "kiwi fruit"
## [43] "kumquat" "lemon" "lime"
## [46] "loquat" "lychee" "mandarine"
## [49] "mango" "mulberries" "nectarine"
## [52] "nut" "olive" "orange"
## [55] "pamelo" "papaya" "passionfruit"
## [58] "peach" "pear" "persimmon"
## [61] "physalis" "pineapple" "plum"
## [64] "pomegranate" "pomelo" "purple mangosteen"
## [67] "quince" "raisin" "rambutan"
## [70] "raspberries" "redcurrant" "rock melon"
## [73] "salal berries" "satsuma" "star fruit"
## [76] "strawberries" "tamarillo" "tangerine"
## [79] "ugli fruit" "watermelon"
31 / 34

String replacing with str_replace() Cont.

#replace first l with "" (delete first l)
str_replace("Hello world", pattern = "l", replacement = "")
## [1] "Helo world"
# replace all l's with "" (delete l's)
str_replace_all("Hello world", pattern = "l", replacement = "")
## [1] "Heo word"
32 / 34

Functions to Remember for Week 11

  • String manipulations using BaseR and stringr.

  • Usage of regular expressions.

  • Pattern matching functions.

  • Practice!

33 / 34

Your turn! Class Worksheet

  • Working in small groups, complete the following worksheet:

Module 8-2 Worksheet

  • Once completed, feel free to work on your Assessments.




Return to Course Website

34 / 34

Creating Strings

  • The most basic way to create strings is to use quotation marks and assign a string to an object.
quote <- "The most valuable thing you can have as a leader is clear data"
author <- "Ruth Porat"
  • The paste() function under Base R is used for creating and building strings. str_c() is equivalent to the paste() function.
paste(quote, "by", author)
## [1] "The most valuable thing you can have as a leader is clear data by Ruth Porat"
  • Use paste0() to paste without spaces between characters.
paste0("I", "love", "Data", "Preprocessing")
## [1] "IloveDataPreprocessing"
2 / 34
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow