class: center, middle, inverse, title-slide .title[ # Module 1 Demonstration ] .subtitle[ ## Data Preprocessing: From Raw Data to Ready to Analyse ] --- class: inverse, center, middle <a data-flickr-embed="true" data-header="true" data-footer="true" href="https://www.flickr.com/photos/rmit/19424075296/in/photolist-KQNDhr-KN2LZy-KN2LNb-KN2LAC-KJ1vUK-KN2M1A-KJ1vwR-KN2LDo-KJ1v9B-KpG9n9-KJ1vZz-KJ1vra-KFw2vu-KpG62G-KJ1w8R-KFw5mu-p2PuqW-vArsRW-ReGy6h-KN7sgb-KJ5Ute-ee3jDm-KQTfX6-KJ5UQM-KQTfPa-KN7skE-KQTfvz-KQTfp2-KQTfoa-KQTeip-KQTeQB-JUke2Q-KQTf8R-KQTeZp-KQTeVr-KN7ppu-JUyXD6-KQTeH2-KN7rao-JUke7j-JUyXxV-JUyXpZ-KN7rqo-JUyXrc-JUyXuZ-KN7qWs-JUyVi4-KN7r65-KQTebF-JUyWuT" title="Midyear Orientation Business VE Welcome Day"><img src="https://farm1.staticflickr.com/544/19424075296_3bfe238d98_z.jpg" width="640" height="427" alt="Midyear Orientation Business VE Welcome Day"></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script> Image credits: RMIT University, https://flic.kr/p/vArsRW --- class: inverse, center, middle # Get Started --- # Course Orientation <!-- * This course is designated as a first year course in MC004, MC242. --> * This course assumes you have a working knowledge of basic mathematics and familiarity with computers. <!--* [**Course Information Pack**](https://rmit.instructure.com/courses/141018/pages/course-information-pack?module_item_id=6934413): Please read this document for orientation.--> * [**Course Website**](https://data-wrangling-rmit.netlify.app/) contains all the learning content for students to work through in their own time and space. --- # Course Orientation Cont. * **Class time**: - Announcements and Questions (~ 5-30 mins) - Demonstration (~ 1 hr - 1.5 hrs) - In class activities (~ 1 hr - 1.5 hrs, hands on exercises) <!-- - Supervised self-directed learning (~ 30 mins, work on module exercises and/or assignments) --> <!-- * **Online sessions**: --> <!-- - There will be two online practical sessions per week. --> * **Before Class**: - Watch the pre-recorded lectures. - Read through the module notes. - Work on Module 1 worksheet questions. * **During Class**: - Actively engage in demonstrations, learning activities and supervised self-study. * **After Class**: - Module-based assessments (online tests on Canvas). - DataCamp modules (for extra study). --- # Course Orientation Cont. <!-- * **Course Schedule** see Course Information Pack. --> * **Flexible Learning**: Classes are recorded, allowing you to watch them at your convenience via Canvas in EchoCenter. * **Teamwork** is encouraged (worksheet activities and group assignments). This closely mirrors the real-world workforce learning through on-the-job interactions with peers and teammates. --- # DataCamp Online Courses <center> <p><img src="../images/DataCamp.png" width = "20%"></p> </center> Data Wrangling course is supported by [DataCamp for Classroom](https://app.datacamp.com/learn) initiative. - During this semester, you will have free access to DataCamp learning modules. <!-- - I have selected specific modules that you will need to complete as a skill builder [Here](https://rmit.instructure.com/courses/141018/pages/r-and-datacamp-in-this-course?module_item_id=6934421). --> - Note that you need to first sign-up to [DataCamp](https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.datacamp.com%2Fgroups%2Fshared_links%2F4aaf62cab3a6e8b86c00923a82b58b7b04ea95975bbf5fbeacc1edbe5eb35d6b&data=05%7C02%7Csona.taheri%40rmit.edu.au%7C70d6a625606c4b173f5c08de5fa21b48%7Cd1323671cdbe4417b4d4bdb24b51316b%7C0%7C0%7C639053347027713952%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=DSfcfuHzI%2F63rsnOQyEgzrSRuu7q8%2FMWWrbz1xa8Ydg%3D&reserved=0). - You will have 6 months of FREE access to the full DataCamp course curriculum (>250 hours). - Access to premium courses (i.e., R, Python and SQL courses). - You can participate in leaderboards and private discussion forums with your fellow classmate. - You may also complete other online courses that you are interested as they will help you with your other studies. --- # Assessment * Course assessment is comprised of the following: - **Practical (project) assessments 1 & 2** (Weighting 35% and 45%) <!-- - **Mid-term assessment** (Weighting 35%) --> - **Module-based assessments 1-4** (Weighting each 5%) <!-- - **Assessment Task1** (Weighting 20%): There will be 1 practical assessment and 2 module-based assessments (15% and 5%). --> <!-- - **Assessment Task2** (Weighting 35%): There will be 1 formative assessment and 2 module-based assessments (30% and 5%). --> <!-- - **Assessment Task3** (Weighting 45%): There will be 1 practical assessment and 1 formative assessment (30% and 15%). --> * Self study: - **Worksheet activities (not graded)**: Each module will be accompanied by worksheet activities. - **DataCamp assignments (not graded)**: Students can complete them as a skill builder or for an extra study. <!-- * Assignment 1 details are available. --> --- class: inverse, center, middle # Module 1 Basics : What is Data Preprocessing? --- # What is Data Preprocessing? - **Data Preprocessing** is a process and the collection of operations needed to prepare all forms of untidy data, incomplete, noisy and inconsistent data for statistical analysis. We will define 5 major tasks for data preprocessing framework, namely : **Get**, **Understand**, **Tidy & Manipulate**, **Scan** and **Transform**. <br> <br> <center> <p><img src="../images/Dataprepver_son.png" width = "110%"></p> </center> ??? Most statistical theory concentrates on data modelling, prediction, and statistical inference while it is usually assumed that data are in the correct state for the analysis. However, in practice, a data analyst spends most of his/her time (usually 50\%-80\% of an analyst time) on making ready the data before doing any statistical operation. Despite the amount of time it takes, there has been surprisingly very little emphasis on how to preprocess data well. Real-world data are commonly incomplete, noisy, inconsistent, and don't have all the correct labels and codes that are required for the analysis. ***Data Preprocessing, which is also commonly referred to as data wrangling, data manipulation, data cleaning, etc., is a process and the collection of operations needed to prepare all forms of untidy data (incomplete, noisy and inconsistent data) for statistical analysis***. We will define 5 major tasks for data preprocessing framework, namely : **Get**, **Understand**, **Tidy & Manipulate**, **Scan** and **Transform**. In the following modules of this course, we will unwrap each of these preprocessing tasks by providing details of operations related to that task. --- # What will you learn in this course? <center> <p><img src="../images/Dataprepver_son.png" width = "110%"></p> </center> .pull-left[ * Module 1: Basics: What is DP? * Module 2: Get * Module 3: Understand * Module 4: Tidy & Manipulate * Module 5: Scan: Missing Values * Module 6: Scan: Outliers * Module 7: Transform * Module 8: Special Operations ] .pull-right[ * **Technology**: Open Source R/RStudio<sup>*</sup> * **Practical experience:** + Class worksheets + Data challenges + Assignments + DataCamp ] .footnote[* Including Base R functions, `readr`, `tidyr`, `dplyr`, `mlr`, `stringr`, `lubridate`, `RMarkdown` packages and many others.] ??? We will cover eight different modules in this course namely: In each module, we will unwrap these data preprocessing tasks by providing details of operations related to that task. By completion of these modules you should be able to: Apply data integration techniques to import and combine different sources of data. Critically reflect upon different data sources, types, formats and structures. Apply different data manipulation techniques to recode, filter, select, split, aggregate, and reshape the data into a format suitable for statistical analysis. Justify data by detecting and handling missing values, outliers, inconsistencies and errors. By completion of class worksheets, module exercises, datacamp modules and assignments, you will demonstrate practical experience by having been exposed to real data problems. Effectively use leading open source software for reproducible, automated data preprocessing. --- # R and RStudio Quick Overview - R is a free *programming language and environment* for statistical computing - [https://www.r-project.org/](https://www.r-project.org/) - Why to learn R? + Recognised across industries + Promotes coding and computational skills + Provide access to the world’s largest and most comprehensive library of statistical functions + Powerful and grows with you + Works on all major operating systems + R and RStudio can be used in combination to create new functions and statistical programs, build dynamic and interactive reports, dashboards, websites, slideshows, statistical web applications and all for ... **FREE!** - RStudio is a free *integrated development environment* for R and makes using R a lot easier and more efficient - [https://www.rstudio.com/](https://www.rstudio.com/). - RStudio requires R to be installed. ??? Speaking of software, we will use R and RStudio in this course. You won't learn anything about Excel, SPSS, SQL, SAS, Python, Julia, or any other statistical package/programming language useful for data preprocessing. This isn't because I think that these tools are bad or redundant. They are not. In practice, most data analytics teams use a mixture of these tools and programming languages. I strongly believe that R is a great place to start your data analysis journey as it is a comprehensive language for data analysis. You can use R effectively in almost each step of data analysis, from data collection to reporting. You can collect, preprocess, visualise and analyse your data using R functions, report and publish your findings using RMarkdown. --- # R and RStudio Quick Overview Cont.1 <center>  <center> ??? RStudio interface consists of four main windows called source window (or source editor), the console, the environment window and Files, Plots Help and Viewer windows. --- # R and RStudio Quick Overview Cont.2 <center>  <center> ??? Source window: The Source window is the place where you can open or create an R script file, add, edit, save and share your R codes to reproduce your analysis. Any codes sitting in your script are not active unless you select and run them by hitting the RUN button or CTRL+R in keyboard. --- # Installing and Loading Packages - Packages are collections of related functions. - [Comprehensive R Archive Network](https://cran.r-project.org) (CRAN) lists over 10,000 available packages! - Packages are the reason why R is so powerful. - Packages need to be installed first. ``` r install.packages("dplyr", dependencies = TRUE) ``` - Include the `dependencies = TRUE` option as many packages require other packages to run. This option checks and installs dependent packages where required. --- # Installing Packages Overview <center>  <center> --- # Installing and Loading Packages Cont. - Once a package is installed, it needs to be loaded into an R session in order to make its functions available. ``` r library(dplyr) ``` - You will **need to reload packages each time you need to start a new R session**. Always start your scripts, notebooks or markdown files by loading all the packages you will need. --- # Loading Packages Overview <center>  <center> --- # What do you need to know by Week 1 <!-- - Read through the [Course Information Pack](https://rmit.instructure.com/courses/141018/pages/course-information-pack?module_item_id=6934413). --> <!-- - How to access the course Canvas shell through myRMIT. --> <!-- - How to access our [Course website](https://data-wrangling-rmit.netlify.app/). --> <!-- - Learn how to [install R and RStudio](https://rmit.instructure.com/courses/141018/pages/course-resources-2?module_item_id=6934415.) --> - Learn how to install and load R packages (See [Module 1 notes](https://data-wrangling-rmit.netlify.app/module_01#Introduction_to_R_and_RStudio_IDE)). - Know how to get further help for R statistical programming language (refer to [Module 1 notes](https://data-wrangling-rmit.netlify.app/module_01#Additional_Resources_and_Further_Help_in_R)). - Don’t panic. R has a slow learning curve, but you will get heaps of practice in this course! --- # Worksheet questions <center><img src="../images/giphy.gif" width="300px" /></center> - Complete the following worksheet: [Module 1 Worksheet](https://data-wrangling-rmit.netlify.app/worksheets/week_01_worksheet) <!-- - Once completed, work on your Module Exercises. --> <br> <br> <br> [Return to Course Website](../index.html)