Generating Synthetic Data

Setting a Seed

Computers cannot generate truly random data (for true random data, visit, for example, Random.org). Instead, they use a complicated formula or algorithm to generate a pseudo-random number. It is possible to set the initial value used in this formula in order to make the results reproducible. This is called setting the seed, and is very easy to do with the set.seed() function. According to the help file on random number generation, ?RNG, if you do not set a seed, “a new one is created from the current time and the process ID when one is required. Hence different sessions will give different simulation results, by default.”

To demonstrate, we can set a seed and then generate 5 numbers from the normal distribution, with a mean of 0 and a standard deviation of 1:

set.seed(12345)
rnorm(n = 5, mean = 0, sd = 1)

## [1]  0.5855288  0.7094660 -0.1093033 -0.4534972  0.6058875

How do we know that these results are reproducible, based on the seed? We can generate them again, with the same seed, and output the same numbers:

set.seed(12345)
rnorm(n = 5, mean = 0, sd = 1)

## [1]  0.5855288  0.7094660 -0.1093033 -0.4534972  0.6058875

Each seed will result in different values being generated:

set.seed(42)
rnorm(n = 5, mean = 0, sd = 1)

## [1]  1.3709584 -0.5646982  0.3631284  0.6328626  0.4042683

Drawing from the Uniform Distribution

One of the simplest ways to generate synthetic data is to draw from the continuous uniform distribution. When drawing from this distribution, every value between the minimum and the maximum is equally likely to be drawn. It takes the arguments n (number of values to generate), min (lower limit of the distribution), and max (upper limit of the distribution). By default, the minimum and maximum values are 0 and 1.

# Generate 10 values from the continuous uniform distribution, using defaults 
set.seed(123) 
runif(10)

##  [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673 0.0455565 0.5281055
##  [8] 0.8924190 0.5514350 0.4566147

If we want the distribution to cover a different range of values, we can change the min and max arguments.

set.seed(321)
runif(10, min = 10, max = 100)

##  [1] 96.03044 94.35570 31.43984 32.95663 45.14608 40.70619 50.71425 36.09395
##  [9] 50.56059 82.59361

We can check that every value between the minimum and maximum are equally likely to be selected by generating a large number of observations and plotting the results in a histogram. There will be some small deviations because of randomness, but each bin should contain approximately the same number of observations.

set.seed(54321) 
runif(n = 10000) %>% 
  hist(main = "Histogram of Data Drawn from the Uniform Distribution", 
       xlab = "Random Value")

Sampling a Categorical Variable (with Replacement)

Sometimes, instead of a numeric variable, we might wish to generate a categorical variable. In this case, instead of using the runif function, we can sample from a vector of possible values. First we create the vector, and then we sample from the vector. For example, we are generating synthetic data relating to a car dealership and need to specify the colours of twenty cars at the dealership. The cars could be black, white, grey, blue, green, or red. Because two cars could both be red, we need to sample with replacement, so we set the replace argument to TRUE.

# Specify the colours 
colours <- c("black", "white", "grey", "blue", "green", "red")

# Generate the sample 
set.seed(987)
car_colours <- sample(colours, size = 10, replace = TRUE) 
car_colours

##  [1] "black" "green" "black" "red"   "green" "blue"  "white" "black" "black"
## [10] "grey"

We can now create our data frame of cars at the dealership:

dealership_cars <- data.frame(car_num = 1:10, 
                              car_colour = car_colours)

dealership_cars

##    car_num car_colour
## 1        1      black
## 2        2      green
## 3        3      black
## 4        4        red
## 5        5      green
## 6        6       blue
## 7        7      white
## 8        8      black
## 9        9      black
## 10      10       grey

Alternatively, we could include the sample while creating the data frame:

set.seed(789)
dealership_cars <- data.frame(car_num = 1:10, 
                              car_colour = sample(colours, size = 10, replace = TRUE))

dealership_cars

##    car_num car_colour
## 1        1      green
## 2        2       blue
## 3        3       blue
## 4        4        red
## 5        5      white
## 6        6      white
## 7        7       grey
## 8        8      green
## 9        9       blue
## 10      10       grey

Or we can mutate this column:

set.seed(123)
dealership_cars <- data.frame(car_num = 1:10) 

dealership_cars %<>% mutate(car_colour = sample(colours, size = 10, replace = TRUE))

dealership_cars

##    car_num car_colour
## 1        1       grey
## 2        2        red
## 3        3       grey
## 4        4      white
## 5        5      white
## 6        6        red
## 7        7       grey
## 8        8      green
## 9        9       blue
## 10      10        red

Sampling a Categorical Variable (without Replacement)

We saw how we can sample with replacement, to reflect that sometimes duplicates exist, however we can also sample without replacement, to generate a sample where duplicates cannot exist. For example, a regular deck of playing cards contain four suits (Spades, Clubs, Hearts, Diamonds) each with 13 cards (Ace, 2-10, Jack, Queen, King). If I deal three cards from a regular deck of cards, I cannot deal myself the Ace of Spades twice. This type of sampling is sampling without replacement.

First let’s generate a deck of cards:

deck_of_cards <- data.frame(value = rep(c("Two", "Three", "Four", "Five", "Six", 
                                        "Seven", "Eight", "Nine", "Ten", "Jack", 
                                        "Queen", "King", "Ace"), 4), 
                            suit = rep(c("Spades", "Clubs", "Diamonds", "Hearts"), 13)) %>% 
  mutate(card = paste(value, "of", suit)) %>% 
  pull(card) 
# First we're repeating the values of the cards four times, then we're repeating 
# the suits for each card in the suit, then we put the two together and extract 
# just the card column as a vector. Don't worry if you don't follow everything 
# that this is doing, it's just to create our deck of cards.

Now we can deal ourselves five cards:

set.seed(112358)
hand_of_cards <- sample(deck_of_cards, size = 5, replace = FALSE) 
hand_of_cards

## [1] "Ace of Clubs"   "King of Hearts" "Seven of Clubs" "Jack of Clubs" 
## [5] "Five of Spades"

We can also use this technique to put all objects in a vector in order, by specifying the size as the same as the number of objects in our vector. For example, if we need to determine the order in which eight students are to present to the class, we can draw a sample of eight without replacement:

students <- c("Alice", "Bob", "Cameron", "Deborah", "Elizabeth", "Fred", "Gareth", "Hubert")

set.seed(123)
sample(students, size = 8, replace = FALSE)

## [1] "Gareth"    "Hubert"    "Cameron"   "Fred"      "Bob"       "Deborah"  
## [7] "Elizabeth" "Alice"

Sampling a Categorical Variable with Weights

Sometimes particular categories are more prevalent than other categories. We could adapt the sampling approach above by having numerous observations with the same value in the vector, but this seems a bit inefficient… fortunately, we can specify weights, so that the sampling occurs using the weights to specify the prevalence of each of the values.

Revisiting our earlier car dealership example, perhaps some car colours are more prevalent than other car colours:

# Specify the colours 
colours <- data.frame(colour = c("black", "white", "grey", "blue", "green", "red"), 
                      prevalence = c(0.4, 0.25, 0.15, 0.12, 0.03, 0.05))

set.seed(456)
dealership_cars <- data.frame(car_num = 1:100, 
                              car_colour = sample(colours$colour, 
                                                  size = 100, 
                                                  replace = TRUE, 
                                                  prob = colours$prevalence))

dealership_cars %>% count(car_colour)

##   car_colour  n
## 1      black 34
## 2       blue 11
## 3      green  4
## 4       grey 15
## 5        red  8
## 6      white 28

Drawing from the Normal Distribution

R isn’t limited to generating data from the uniform distribution, there are many different distributions that can be used, including the normal distribution.

set.seed(789)
normal <- rnorm(10000, 200, 30)
hist(normal, breaks = 50)

Correlated Random Variables

We can use the normal distribution to generate correlated random data. This is data where there is a relationship between two variables, plus some random variability. For example, we could create data for a hypothetical scenario in which a university surveyed students about time spent studying and the marks they received. The hours spent studying was found to be distributed uniformly, and marks were positively correlated to study time. First, let’s create a vector of the hours spent studying

set.seed(2468)

students <- data.frame(student_id = 1:5000, 
                       hours = runif(n = 5000, min = 24, max = 240))

The study found that a student’s marks was expected to increase based on time spent studying, according to the following relationship:

A student’s marks don’t perfectly relate to the hours studied, however, there is some random variability from student to student. This error term was normally distributed with a mean of 0 and a standard deviation of 2 marks.

set.seed(250)
students %<>% 
  mutate(error = rnorm(n = 5000, mean = 0, sd = 2))

Now that the data frame contains the hours each student spent studying, and the normally distributed error term, it is a simple matter to create the synthetic data regarding the marks each student received.

students %<>% 
  mutate(marks = (hours * 0.2) + 45 + error)

How does this correlated synthetic data appear?

plot(x = students$hours, 
     y = students$marks, 
     main = "Hours Spent Studying and Marks", 
     xlab = "Hours Spent Studying", 
     ylab = "Marks")

Creating Missing Values

When using the case_when function, any observation that does not meet any of the criteria becomes missing. We can use this to our advantage to create synthetic data with missing values.

Let’s use our previous hypothetical scenario to assume that some students didn’t report their hours spent studying, however their marks were known by the university. We begin by allocating a random value between 0 and 1 for each record:

set.seed(98765)
students %<>% 
  mutate(rand = runif(5000, min = 0, max = 1))

Now we choose to delete 2% of the hours from the dataset, selected at random. We can do this by keeping values for records with a value in this new rand column that are greater than 0.02 (and converting to missing all records where the rand value is less than 0.02). Because the case_when function will convert to missing any values that don’t meet the criteria, we only need to specify the criteria for values to be kept:

students %<>% 
  mutate(hours = case_when(rand >= 0.02 ~ hours))

Although we set 2% of observations to be missing marks, other proportions could be chosen and applied in a similar manner.

Note - if deleting data from multiple rows, you will need to use different rand columns, each generated with a different seed (or all generated together within a single mutate) so that the same observations don’t have all the data deleted.

Generating Variables with Outliers

A large dataset generated from the normal distribution is likely to contain outliers, however some distributions, such as Gosset’s t-distribution, will generate a larger number of outliers. Here, we shall generate 200 observations from the t-distribution with three degrees of freedom:

set.seed(1000)
t_example <- rt(n = 200, df = 3) 
boxplot(t_example)

We will learn more about outliers, however all you need to know for now is that the circles on the box plot represent outliers.

Drawing from the Beta Distribution

Not all data is uniform or normal, some is skewed, such as salaries or house prices. The beta distribution can be quite useful for generating skewed data. It can also be used to generate other shapes - triangular or mound-shaped, for example. The choice of shape parameters allow the shape of the distribution to change. The beta distribution always is within the range of 0 to 1, so it may be necessary to multiply it by a constant to get the desired range of values.

For example, here we generate some skewed data to represent salary.

set.seed(100)
salary <- rbeta(1000, shape1 = 2, shape2 = 15) * 200000 + 20000
hist(salary, breaks = 50)

These shape parameters result in a more mound-shaped distribution:

set.seed(100)
beta_2 <- rbeta(1000, shape1 = 2, shape2 = 2) 
hist(beta_2, breaks = 20)

Experiment with some other shape parameter values to see the effect:

set.seed(100)
beta_3 <- rbeta(1000, shape1 = 15, shape2 = 1) 
hist(beta_3, breaks = 50)

set.seed(100)
beta_4 <- rbeta(1000, shape1 = 0.5, shape2 = 0.5) 
hist(beta_4, breaks = 50)

Other Distributions

The statistical roots of R is demonstrated in the plethora of statistical distributions available. More information is available through the help file, ?Distributions