Required Packages

The following packages and the function will be required or may come in handy:

library(readr)
library(dplyr)
library(lubridate)
library(stringr)

Exercises

NYC Jobs Data

The following exercises 1-6 will be based on NYC Jobs data set NYC_jobs.csv from https://data.cityofnewyork.us. You can find the data description in https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t. Variables are self explanatory however it is expected to do checks on the type of the data and use the suitable transformations if necessary. Here is a quick look of the few variables of NYC Jobs data:

Job ID Agency Posting Type # Of Positions Business Title Civil Service Title Title Code No Level
86699 DEPT OF CITYWIDE ADMIN SVCS External 2 Graphic Artist GRAPHIC ARTIST 91415 1
86699 DEPT OF CITYWIDE ADMIN SVCS External 2 Graphic Artist GRAPHIC ARTIST 91415 1
87990 DEPARTMENT OF BUSINESS SERV. Internal 1 Account Manager CONTRACT REVIEWER (OFFICE OF L 40563 1
97899 DEPARTMENT OF BUSINESS SERV. Internal 1 EXECUTIVE DIRECTOR, BUSINESS DEVELOPMENT ADMINISTRATIVE BUSINESS PROMOT 10009 M3
102221 DEPT OF ENVIRONMENT PROTECTION External 1 Project Specialist ENVIRONMENTAL ENGINEERING INTE 20616 0
102221 DEPT OF ENVIRONMENT PROTECTION Internal 1 Project Specialist ENVIRONMENTAL ENGINEERING INTE 20616 0
  1. Check the structures of the variables Posting Date, Posting Updated and Process Date. If any of the variables are in character format, convert it to date format. Hint: To do so you will need to use chartr() first for Posting Date to replace T with a blank.

  2. Paste Job Id and Title Code No with a separator “-”, name this column as new_id. Use both paste() and str_c().

  3. Convert characters to lower for Agency, Business title and Civil Service Title. If you are looking for a challenge use mutate_at() function.

  4. Trim white spaces for both ends and find string length for Job Description variable.

  5. Pad the Job Id variable with a leading 1 and pick 7 for the width.

  6. Extract the first character of the Salary Frequency.

Accreditation Data

 The following exercises 6-10 will be based on accredited universities data set Accreditation.csv from https://www.kaggle.com/ghalebdweikat/accredited-universities-in-the-usa. Variables are self explanatory however it is expected to do checks on the type of the data and use the suitable transformations if necessary. Here is a quick look of the few variables of Accreditation data:

Institution_ID Institution_Name Institution_Address Institution_City Institution_State Institution_Zip Institution_Phone
100016 Community College of the Air Force 130 W Maxwell Blvd Montgomery AL 36112-6613 334-953-6436
100016 Community College of the Air Force 130 W Maxwell Blvd Montgomery AL 36112-6613 334-953-6436
100025 Alabama A & M University 4900 Meridian St Normal AL 35762 256-372-5000
100025 Alabama A & M University 4900 Meridian St Normal AL 35762 256-372-5000
100025 Alabama A & M University 4900 Meridian St Normal AL 35762 256-372-5000
100025 Alabama A & M University 4900 Meridian St Normal AL 35762 256-372-5000
  1. Check the structure of the variables. Convert Institution_ID to a string using a suitable function (Do this part using mutate and pipes).

  2. Remove the “-” form Institution_Zip and Institution_Phone using a suitable function from stringr package (Do this part using mutate and pipes).

  3. Create a new column combining address information using a suitable function (Do this part using mutate and pipes).

  4. Bonus exercise: Use one of the data sets in the worksheet to either to create a new variable or clean up a variable using one (or more) of the functions we learned this week. Post your answer on the discussion board.

Finished?

If you have finished the above tasks, work through the weekly list of tasks posted on the Canvas announcement page.

Return to Course Website