Module 8-2 Demonstration
Special Operations: Dealing with character variables
1 / 34

Creating Strings

The most basic way to create strings is to use quotation marks and assign a string to an object.

quote <- "Most valuable thing you have as a leader is clear data"
author <- "Ruth Porat"

The paste() function under Base R is used for creating and building strings. str_c() is equivalent to the paste() function.

paste(quote, "by", author)

## [1] "Most valuable thing you have as a leader is clear data by Ruth Porat"

Use paste0() to paste without spaces between characters.

paste0("I", "love",  "Data", "Wrangling")

## [1] "IloveDataWrangling"

2 / 34

Converting to Strings

Strings and characters can be tested with is.character() and any other data format can be converted into strings/characters with as.character().

is.character(quote)

## [1] TRUE

as.character(3.54)

## [1] "3.54"

3 / 34

Printing Strings

Printing strings/characters can be done with the following:

Function	Usage
`print()`	generic printing
`noquote()`	print with no quotes
`cat()`	concatenate and print with no quotes & no line number

print( paste(quote,author) , quote = FALSE)

## [1] Most valuable thing you have as a leader is clear data Ruth Porat

noquote( paste(quote,author) )

## [1] Most valuable thing you have as a leader is clear data Ruth Porat

cat( paste(quote,author) )

## Most valuable thing you have as a leader is clear data Ruth Porat

4 / 34

Printing Strings Cont.

# basic printing of alphabet
cat(letters)

## a b c d e f g h i j k l m n o p q r s t u v w x y z

# specify a seperator between the combined characters
cat(letters, sep = "-")

## a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z

5 / 34

Printing Strings Cont.

To format the line width for printing long strings use the fill argument.

# No breaks between lines
cat(quote, author, quote, author, fill = FALSE)

## Most valuable thing you have as a leader is clear data Ruth Porat Most valuable thing you have as a leader is clear data Ruth Porat

# Breaks between lines
cat(quote, author, quote, author, fill = TRUE)

## Most valuable thing you have as a leader is clear data Ruth Porat 
## Most valuable thing you have as a leader is clear data Ruth Porat

6 / 34

Counting string elements and characters

To count the number of elements in a string use length().

length("How many elements are in this string?")

## [1] 1

length( c("How", "many", "elements", "are", "in", "this", "string?") )

## [1] 7

To count the number of characters in a string use nchar().

nchar("How many characters are in this string?")

## [1] 39

nchar(c("How", "many", "characters", "are", "in", "this", "string?"))

## [1]  3  4 10  3  2  4  7

7 / 34

String manipulation with Base R

Basic string manipulation typically includes:
- case conversion;
- simple character replacement;
- pattern replacement;
- abbreviating;
- substring replacement;
- adding/removing white space;
- set operations.
These operations can all be performed with base R functions; however, some operations are greatly simplified with the stringr package.

8 / 34

Upper/lower case conversion

To convert all upper case characters to lower case use tolower().

To convert all lower case characters to upper case use toupper().

a <- "MATH2349 is AWesomE"
tolower(a)

## [1] "math2349 is awesome"

toupper(a)

## [1] "MATH2349 IS AWESOME"

9 / 34

Simple Character Replacement

To replace a character (or multiple characters) in a string use chartr().

# replace 'z' with 's'
american <- "This is how we analyze."
chartr(old = "z", new = "s", american)

## [1] "This is how we analyse."

# replace 'i' with 'w', 'X' with 'h' and 's' with 'y'
x <- "MiXeD cAsE 123"
chartr(old ="iXs", new ="why", x)

## [1] "MwheD cAyE 123"

10 / 34

Pattern Replacement

To replace a pattern in a string use gsub().

# replace "ot" pattern with "ut"
x <- "R Totorial"
gsub(pattern = "ot", replacement="ut", x)

## [1] "R Tutorial"

11 / 34

String Abbreviations

To abbreviate strings we can use abbreviate().

streets <- c("Victoria", "Yarra", "Russell", "Williams", "Swanston")
# default abbreviations
abbreviate(streets)

## Victoria    Yarra  Russell Williams Swanston 
##   "Vctr"   "Yarr"   "Rssl"   "Wllm"   "Swns"

# set minimum length of abbreviation
abbreviate(streets, minlength = 5)

## Victoria    Yarra  Russell Williams Swanston 
##  "Victr"  "Yarra"  "Rssll"  "Wllms"  "Swnst"

12 / 34

Extract/Replace Substrings

The purpose of substr() is to extract and replace substrings with specified starting and stopping characters.

alphabet <- paste(LETTERS, collapse = "")
alphabet

## [1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

# extract 18-24th characters in alphabet
substr(alphabet, start = 18, stop = 24)

## [1] "RSTUVWX"

# replace 19-24th characters with `R`
substr(alphabet, start = 19, stop = 24) <- "RRRRRR"
alphabet

## [1] "ABCDEFGHIJKLMNOPQRRRRRRRYZ"

13 / 34

Extract/Replace Substrings

To split the elements of a character string use strsplit().

z <- "Victoria Yarra Russell Williams Swanston"
strsplit(z, split = " ")

## [[1]]
## [1] "Victoria" "Yarra"    "Russell"  "Williams" "Swanston"

a <- "Victoria-Yarra-Russell-Williams-Swanston"
strsplit(a, split = "-")

## [[1]]
## [1] "Victoria" "Yarra"    "Russell"  "Williams" "Swanston"

Note that the output of strsplit() is a list. To convert the output to a simple atomic vector simply use unlist().

  unlist(strsplit(a, split = "-"))

## [1] "Victoria" "Yarra"    "Russell"  "Williams" "Swanston"

14 / 34

Set operatons for character strings

Function
Usage


union()
obtain union between two character vectors

intersect()
obtain the common elements of two character vectors

setdiff()
obtain the non-common elements, or the difference

setequal()
tests if two vectors contain the same elements regardless of order

identical()
tests if two character vectors are equal in content and order

15 / 34

Function	Usage
`union()`	obtain union between two character vectors
`intersect()`	obtain the common elements of two character vectors
`setdiff()`	obtain the non-common elements, or the difference
`setequal()`	tests if two vectors contain the same elements regardless of order
`identical()`	tests if two character vectors are equal in content and order

Set operatons for character strings Cont.

set_1 <- c("VIC", "NSW", "WA", "TAS")
set_2 <- c("TAS", "QLD", "SA", "NSW")
union(set_1, set_2)

## [1] "VIC" "NSW" "WA"  "TAS" "QLD" "SA"

intersect(set_1, set_2)

## [1] "NSW" "TAS"

setdiff(set_1, set_2)

## [1] "VIC" "WA"

setdiff(set_2, set_1)

## [1] "QLD" "SA"

16 / 34

String manipulation with stringr

The stringr package was developed by Hadley Wickham to provide a consistent and simple wrappers to common string operations.
These functions are closely related to their base R equivalents:
- Concatenate with str_c() ( $\sim$ paste() and paste0()).
- Number of characters with str_length() ( $\sim$ nchar()).
- Substring with str_sub() ( $\sim$ substr() ).

17 / 34

Duplicate Characters within a String

In addition, the stringr has a new functionality using str_dup()

str_dup("Data", times = 4)

## [1] "DataDataDataData"

str_dup("Data", times = 1:4)

## [1] "Data"             "DataData"         "DataDataData"     "DataDataDataData"

18 / 34

Remove Leading and Trailing White space

In string processing, a common task is parsing text into individual words.
Often, this results in words having blank spaces (white spaces) on either end of the word. The str_trim() can be used to remove these spaces.

text <- c("Text ", "  with", " whitespace ")
text

## [1] "Text "        "  with"       " whitespace "

str_trim(text, side = "left")

## [1] "Text "       "with"        "whitespace "

str_trim(text, side = "right")

## [1] "Text"        "  with"      " whitespace"

str_trim(text, side = "both")

## [1] "Text"       "with"       "whitespace"

19 / 34

Pad a String with White space

Conversely, to add whitespace, or to pad a string, we can use str_pad().

str_pad("Data", width = 10, side = "left")

## [1] "      Data"

str_pad("Data", width = 10, side = "both")

## [1] "   Data   "

Use str_pad() to pad a string with specified characters. The width argument will give width of padded strings and the pad argument will specify the padding characters.

str_pad("Data", width = 10, side = "right", pad = "!")

## [1] "Data!!!!!!"

20 / 34

Pattern matching

The vast majority of string manipulations require pattern matching for a given text.
Good news is, stringr package has pattern matching functions to detect, subset, locate, count, extract, and replace strings.

21 / 34

Pattern detection with str_detect()

str_detect() detects the presence or absence of a pattern and returns a logical vector.

# detects pattern "ea"
x <- c("apple", "banana", "pear","pEAr")
str_detect(x, pattern ="ea")

## [1] FALSE FALSE  TRUE FALSE

#same as above
str_detect(x, "ea")

## [1] FALSE FALSE  TRUE FALSE

22 / 34

Remark: Regular expressions (Regex)

While matching patterns, you can also use the regular expressions.
Regular expressions (a.k.a. regex's) are a language that allow you to describe patterns in strings.

# Same as above using regex
x <- c("apple", "banana", "pear","pEAr")
str_detect(x, regex("ea"))

## [1] FALSE FALSE  TRUE FALSE

You can perform a case-insensitive match using ignore_case = TRUE.

str_detect(x, regex("ea",ignore_case = TRUE))

## [1] FALSE FALSE  TRUE  TRUE

23 / 34

Remark: Regular expressions (Regex) Cont.

With regex, you can create your own character classes using [ ]. For example:

[abc]: matches a, b, or c.
[a-z]: matches every character between a and z (in Unicode code point order).
[^abc]: matches anything except a, b, or c.
[\^\-]: matches ^ or -.

They take a little while to get your head around, but once you understand them, you’ll find them extremely useful.
For more information on the regex capabilities, please refer to regular expressions vignette under stringr package.

24 / 34

Remark: Regular expressions (Regex) Cont.There are a number of pre-built classes that you can use inside [ ]:
[:punct:]: punctuation.
[:alpha:]: letters.
[:lower:]: lowercase letters.
[:upper:]: upperclass letters.
[:digit:]: digits.
[:xdigit:]: hex digits.
[:alnum:]: letters and numbers.
[:cntrl:]: control characters.
[:graph:]: letters, numbers, and punctuation.
[:print:]: letters, numbers, punctuation, and white space.
[:space:]: space characters (basically equivalent to \s).
[:blank:]: space and tab.
25 / 34

Your turn!

Using the commonly used words (in English) data set under stringr.

library(stringr)
head(words)

## [1] "a"        "able"     "about"    "absolute" "accept"   "account"

length(words)

## [1] 980

Task 1. Find out how many words have "ing" pattern?
Task 2. Find out how many words end in "ing"? Hint: (Use anchors)[https://stringr.tidyverse.org/articles/regular-expressions.html#anchors].
Task 3. Find out which words end with "ing"?

26 / 34

#Task 1:
str_detect(words, pattern = regex("ing")) %>% sum()

## [1] 10

# Same as above:
str_detect(words, "ing") %>% sum()

## [1] 10

# Task 2:
str_detect(words, "ing$") %>% sum()

## [1] 9

# Task 3:
words[str_detect(words, "ing$")]

## [1] "bring"   "during"  "evening" "king"    "meaning" "morning" "ring"   
## [8] "sing"    "thing"

String subsetting with str_subset()

str_subset() returns the elements of a character vector that match a regular expression.
Using starwars data set, let's subset the character names that contain any punctuation.

head(starwars$name)

## [1] "Luke Skywalker" "C-3PO"          "R2-D2"          "Darth Vader"   
## [5] "Leia Organa"    "Owen Lars"

  str_subset(starwars$name, "[:punct:]")

## [1] "C-3PO"          "R2-D2"          "R5-D4"          "Obi-Wan Kenobi"
## [5] "IG-88"          "Qui-Gon Jinn"   "Ki-Adi-Mundi"   "R4-P17"

27 / 34

String extract using str_extract()

str_extract() extracts text corresponding to the first match, returning a character vector.

str_extract(starwars$name, "[:punct:]")

##  [1] NA  "-" "-" NA  NA  NA  NA  "-" NA  "-" NA  NA  NA  NA  NA  NA  NA  NA  NA 
## [20] NA  NA  "-" NA  NA  NA  NA  NA  NA  NA  NA  "-" NA  NA  NA  NA  NA  NA  NA 
## [39] NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  "-" NA  NA  NA  NA  NA  NA 
## [58] NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  "-" NA  NA 
## [77] NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA

28 / 34

Finding pattern locations using str_locate()

str_locate() locates the first position of a pattern and returns a numeric matrix with columns start and end whereas str_locate_all() locates all positions of a given pattern.

str_locate(starwars$name, "[:punct:]") %>% head()

##      start end
## [1,]    NA  NA
## [2,]     2   2
## [3,]     3   3
## [4,]    NA  NA
## [5,]    NA  NA
## [6,]    NA  NA

29 / 34

Pattern counting using str_count()

str_count() counts the number of matches for a given string.

str_count(starwars$name, "[:punct:]")

##  [1] 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## [77] 0 0 0 0 0 0 0 0 0 0 0

30 / 34

String replacing with str_replace()

str_replace() replaces a string with another one.
The pattern argument will give the string that is going to be replaced and replacement argument will specify the replacement string.

head(fruit)

## [1] "apple"       "apricot"     "avocado"     "banana"      "bell pepper"
## [6] "bilberry"

# Replace berry with berries
str_replace(fruit, pattern = "berry", replacement = "berries")

##  [1] "apple"             "apricot"           "avocado"          
##  [4] "banana"            "bell pepper"       "bilberries"       
##  [7] "blackberries"      "blackcurrant"      "blood orange"     
## [10] "blueberries"       "boysenberries"     "breadfruit"       
## [13] "canary melon"      "cantaloupe"        "cherimoya"        
## [16] "cherry"            "chili pepper"      "clementine"       
## [19] "cloudberries"      "coconut"           "cranberries"      
## [22] "cucumber"          "currant"           "damson"           
## [25] "date"              "dragonfruit"       "durian"           
## [28] "eggplant"          "elderberries"      "feijoa"           
## [31] "fig"               "goji berries"      "gooseberries"     
## [34] "grape"             "grapefruit"        "guava"            
## [37] "honeydew"          "huckleberries"     "jackfruit"        
## [40] "jambul"            "jujube"            "kiwi fruit"       
## [43] "kumquat"           "lemon"             "lime"             
## [46] "loquat"            "lychee"            "mandarine"        
## [49] "mango"             "mulberries"        "nectarine"        
## [52] "nut"               "olive"             "orange"           
## [55] "pamelo"            "papaya"            "passionfruit"     
## [58] "peach"             "pear"              "persimmon"        
## [61] "physalis"          "pineapple"         "plum"             
## [64] "pomegranate"       "pomelo"            "purple mangosteen"
## [67] "quince"            "raisin"            "rambutan"         
## [70] "raspberries"       "redcurrant"        "rock melon"       
## [73] "salal berries"     "satsuma"           "star fruit"       
## [76] "strawberries"      "tamarillo"         "tangerine"        
## [79] "ugli fruit"        "watermelon"

31 / 34

String replacing with str_replace() Cont.

#replace first l with "" (delete first l)
str_replace("Hello world", pattern = "l", replacement = "")

## [1] "Helo world"

# replace all l's with "" (delete l's)
str_replace_all("Hello world", pattern = "l", replacement = "")

## [1] "Heo word"

32 / 34

Functions to Remember for Week 11

String manipulations using BaseR and stringr.
Usage of regular expressions.
Pattern matching functions.
Practice!

33 / 34

Your turn! Class Worksheet

Working in small groups, complete the following worksheet:

Module 8-2 Worksheet

Once completed, feel free to work on your Assessments.

Return to Course Website

34 / 34

Creating Strings

The most basic way to create strings is to use quotation marks and assign a string to an object.

quote <- "Most valuable thing you have as a leader is clear data" author <- "Ruth Porat"

The paste() function under Base R is used for creating and building strings. str_c() is equivalent to the paste() function.

paste(quote, "by", author)

## [1] "Most valuable thing you have as a leader is clear data by Ruth Porat"

Use paste0() to paste without spaces between characters.

paste0("I", "love", "Data", "Wrangling")

## [1] "IloveDataWrangling"

2 / 34

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help