As you move forward in your master's program, you will learn the details of these specific transformations used in different subjects like:
We won't cover the technical details of these transformations.
Our focus will be on the most common and useful ones that can be easily implemented in R.
Below are the situations where we might need transformations:
To change the scale of a variable or standardise the values of a variable for better understanding.
To transform complex non-linear relationships into a linear one (i.e. to improve linearity).
To improve assumptions of normality and homogeneity of variance, etc (i.e., reduce skewness and/or heterogeneity of variances).
Have a look at three data examples with visualizations.
Specify the reason(s) for transforming data for each case.
Press P to reveal the answers.
## name prev_salary increase## 1 Mark 100000 5000## 2 Sue 50000 3000## 3 Jayden 80000 2000## 4 Joe 120000 4000
## name prev_salary increase## 1 Mark 100000 5000## 2 Sue 50000 3000## 3 Jayden 80000 2000## 4 Joe 120000 4000
df %>% mutate(percentage_increase = (increase/prev_salary*100))
## name prev_salary increase percentage_increase## 1 Mark 100000 5000 5.000000## 2 Sue 50000 3000 6.000000## 3 Jayden 80000 2000 2.500000## 4 Joe 120000 4000 3.333333
To better understand who got the highest/lowest increase we applied a simple transformation on the increase amount.
We calculated the percentage increase using percentage_increase = (increase/prev_salaryx100) to have a better undertanding.
Transformation | Power | R function |
---|---|---|
logarithm base 10 | NA |
log10(y) |
logarithm base e | NA |
log(y) |
reciprocal square | -2 | y^(-2) |
reciprocal | -1 | y^(-1) |
cube root | 1/3 | y^(1/3) |
square root | 1/2 | y^(1/2) or sqrt() |
square | 2 | y^2 |
cube | 3 | y^3 |
fourth power | 4 | y^4 |
Here are some recommendations on mathematical transformations:
To reduce right skewness in the distribution, taking roots or logarithms or reciprocals work well.
To reduce left skewness, taking squares or cubes or higher powers work well.
These are general recommendations and may not work for every data set.
The best strategy is to apply different transformations on the same data and select the one that works best.
Log transformations are commonly used for reducing right skewness. It can not be applied to 0 or negative values directly but you can add a non-negative constant to all observations and then take the logarithm.
Square root transformation is also used for reducing right skewness, and also has the advantage that it can be applied to zero values.
Reciprocal transformation is a very strong transformation with a drastic effect on the distribution shape. It will compress large values to smaller values.
Let y denote the variable at the original scale and y′ the transformed variable. The BoxCox transformation is defined as:
y′=yλ−1λ,if λ≠0
y′=log(y),if λ=0
forecast
package will be used to apply BoxCox transformation and find the best λ parameter.BoxCox_x<- BoxCox(x, lambda = "auto")
The Cars.csv data set from the data repository. This dataset contains data from over 400 vehicles from 2003.
## # A tibble: 6 × 19## Vehicle_name Sports Sport_utility Wagon Minivan Pickup All_wheel_drive## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Chevrolet Aveo 4dr 0 0 0 0 0 0## 2 Chevrolet Aveo LS 4… 0 0 0 0 0 0## 3 Chevrolet Cavalier … 0 0 0 0 0 0## 4 Chevrolet Cavalier … 0 0 0 0 0 0## 5 Chevrolet Cavalier … 0 0 0 0 0 0## 6 Dodge Neon SE 4dr 0 0 0 0 0 0## # ℹ 12 more variables: Rear_wheel_drive <dbl>, Retail_price <dbl>,## # Dealer_cost <dbl>, Engine_size <dbl>, Cylinders <dbl>, Kilowatts <dbl>,## # Economy_city <dbl>, Economy_highway <dbl>, Weight <dbl>, Wheel_base <dbl>,## # Length <dbl>, Width <dbl>
Economy_highway
: Kilometres per liter for highway driving and Weight
: weight of car (kg). Here are univariate and bivariate visualizations:log()
transformation is a great way to correct for skewness for this data.par(mfrow=c(2,2))Cars$Weight %>% hist(main = "Weight")log(Cars$Weight) %>% hist(main = "log(Weight)")Cars$Economy_highway %>% hist(main = "Economy")log(Cars$Economy_highway) %>% hist(main = "log(Economy)")
plot(log(Cars$Economy_highway), log(Cars$Weight))
There are different normalization techniques used in machine learning:
Centering (using mean)
Scaling (using standard deviation)
z-score transformation (i.e., centering and scaling using both mean and standard deviation)
min-max, (a.k.a. range and (0-1) ) transformation
Normalisation technique | Formula | R Function |
---|---|---|
Centering | y∗=y−¯y | scale(y, center = TRUE, scale = FALSE) |
Scaling (using RMS) | y∗=yRMSy | scale(y, center = FALSE, scale = TRUE) |
Scaling (using SD) | y∗=ySDy | scale(y, center = FALSE, scale = sd(y)) |
z-score transformation | z=y−¯ySDy | scale(y, center = TRUE, scale = TRUE) |
Min-max transformation | y∗=y−yminymax−ymin | (y- min(y)) /(max(y)-min(y)) |
Use the Cars.csv data set:
Task 1: Check the distribution of Economy_highway
variable using boxplot()
.
Task 2: Apply z-score transformation on Economy_highway
variable and check the distribution again. Did the shape of the distribution change?
Task 3: Check the distribution of `Weight
variable using boxplot()
.
Weight
variable and check the distribution again. Did the shape of the distribution change?# Task 1:Cars <- read.csv("../data/Cars.csv")boxplot(Cars$Economy_highway)
# Task 2:z<- Cars$Economy_highway %>% scale(center = TRUE, scale = TRUE)boxplot(z)
# To compare easily:par(mfrow=c(1,2))boxplot(Cars$Economy_highway)boxplot(z)
# Task 3:boxplot(Cars$Weight)# Task 4minmaxnormalise <- function(x){(x- min(x, na.rm = TRUE)) /(max(x, na.rm = TRUE)-min(x,na.rm = TRUE))}min_max <- Cars$Weight %>% minmaxnormalise()boxplot(min_max)# To compare easily:par(mfrow=c(1,2))boxplot(Cars$Weight)boxplot(min_max)
Main difference between transformation and normalisation:
With transformation, the scale of the variable and its distribution will change. Thus the variable will completely be transformed.
With normalisation, the scale of the variable will change but the distribution won't change. The variable will be mapped to a different scale but the distributional properties are kept.
Main difference between transformation and normalisation:
With transformation, the scale of the variable and its distribution will change. Thus the variable will completely be transformed.
With normalisation, the scale of the variable will change but the distribution won't change. The variable will be mapped to a different scale but the distributional properties are kept.
Main difference between transformation and normalisation:
With transformation, the scale of the variable and its distribution will change. Thus the variable will completely be transformed.
With normalisation, the scale of the variable will change but the distribution won't change. The variable will be mapped to a different scale but the distributional properties are kept.
This is why we apply transformations to change the distributional properties of a variable (i.e., to reduce skewness, improve normality, linearity).
Note that sometimes the name "transformation" is also used for "z-score transformation" and "Min-max transformation", but actually they are normalisation techniques.
Sometimes we may need to discretise numeric values as analysis methods require discrete values as a input or output variables.
Binning or discretisation methods transform numerical variables into categorical counterparts by using some strategies:
Binning strategy | Function (from infotheo ) |
---|---|
Equal width (distance) binning | discretize(y, disc = "equalwidth") |
Equal depth (frequency) binning | discretize(y, disc = "equalfreq") |
In equal-width binning, the variable is divided into n intervals of equal size.
In equal-depth binning, the variable is divided into n intervals, each containing approximately the same number of observations (frequencies).
As mentioned in Module 6 Scan: Outliers, binning is also useful to deal with possible outliers.
Our focus would be on gaining brief information on each method, i.e. differences between methods, why and when they are used, etc.
We won't cover the technical details nor their implementation in R as they will be taught in the Machine Learning course in details. You may refer to the “Optional Reading” and “Additional Resources and Further Reading” sections to find out more on the topic.
There are different strategies to select features depending on the problem that you are dealing with.
Feature extraction is different from feature selection.
Both methods seek to reduce the number of attributes in the data set:
feature extraction methods do so by creating new combinations of attributes;
where as feature selection methods include and exclude attributes present in the data without changing them.
The new extracted features are orthogonal, which means that they are uncorrelated.
The extracted components are ranked in order of their "explained variance". For example, the first principal component (PC1) explains the most variance in the data, PC2 explains the second-most variance, and so on.
Then you can decide to keep only as many principal components as needed to reach a cumulative explained variance of 90%.
This technique is fast and simple to implement, and works well in practice.
However the new principal components are not interpretable, because they are linear combinations of original features.
Mathematical functions
BoxCox()
from forecast
scale()
discretize()
from infotheo
Data (dimension) reduction
Feature selection vs. Feature extraction
Feature filtering vs. ranking
Practice!
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |