The following packages and the function will be required or may come in handy:
library(readr)
library(dplyr)
library(lubridate)
library(stringr)
The following exercises 1-6 will be based on NYC Jobs data set NYC_jobs.csv from https://data.cityofnewyork.us. You can find the data description in https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t. Variables are self explanatory however it is expected to do checks on the type of the data and use the suitable transformations if necessary. Here is a quick look of the few variables of NYC Jobs data:
Job ID | Agency | Posting Type | # Of Positions | Business Title | Civil Service Title | Title Code No | Level |
---|---|---|---|---|---|---|---|
86699 | DEPT OF CITYWIDE ADMIN SVCS | External | 2 | Graphic Artist | GRAPHIC ARTIST | 91415 | 1 |
86699 | DEPT OF CITYWIDE ADMIN SVCS | External | 2 | Graphic Artist | GRAPHIC ARTIST | 91415 | 1 |
87990 | DEPARTMENT OF BUSINESS SERV. | Internal | 1 | Account Manager | CONTRACT REVIEWER (OFFICE OF L | 40563 | 1 |
97899 | DEPARTMENT OF BUSINESS SERV. | Internal | 1 | EXECUTIVE DIRECTOR, BUSINESS DEVELOPMENT | ADMINISTRATIVE BUSINESS PROMOT | 10009 | M3 |
102221 | DEPT OF ENVIRONMENT PROTECTION | External | 1 | Project Specialist | ENVIRONMENTAL ENGINEERING INTE | 20616 | 0 |
102221 | DEPT OF ENVIRONMENT PROTECTION | Internal | 1 | Project Specialist | ENVIRONMENTAL ENGINEERING INTE | 20616 | 0 |
Check the structures of the variables Posting Date
,
Posting Updated
and Process Date
. If any of
the variables are in character format, convert it to date format. Hint:
To do so you will need to use chartr() first for
Posting Date
to replace T with a blank.
Paste Job Id
and Title Code No
with a
separator “-”, name this column as new_id. Use both paste()
and str_c()
.
Convert characters to lower for Agency
,
Business title
and Civil Service Title
. If you
are looking for a challenge use mutate_at()
function.
Trim white spaces for both ends and find string length for
Job Description
variable.
Pad the Job Id
variable with a leading
1
and pick 7 for the width.
Extract the first character of the
Salary Frequency
.
The following exercises 6-10 will be based on accredited universities data set Accreditation.csv from https://www.kaggle.com/ghalebdweikat/accredited-universities-in-the-usa. Variables are self explanatory however it is expected to do checks on the type of the data and use the suitable transformations if necessary. Here is a quick look of the few variables of Accreditation data:
Institution_ID | Institution_Name | Institution_Address | Institution_City | Institution_State | Institution_Zip | Institution_Phone |
---|---|---|---|---|---|---|
100016 | Community College of the Air Force | 130 W Maxwell Blvd | Montgomery | AL | 36112-6613 | 334-953-6436 |
100016 | Community College of the Air Force | 130 W Maxwell Blvd | Montgomery | AL | 36112-6613 | 334-953-6436 |
100025 | Alabama A & M University | 4900 Meridian St | Normal | AL | 35762 | 256-372-5000 |
100025 | Alabama A & M University | 4900 Meridian St | Normal | AL | 35762 | 256-372-5000 |
100025 | Alabama A & M University | 4900 Meridian St | Normal | AL | 35762 | 256-372-5000 |
100025 | Alabama A & M University | 4900 Meridian St | Normal | AL | 35762 | 256-372-5000 |
Check the structure of the variables. Convert
Institution_ID
to a string using a suitable function (Do
this part using mutate and pipes).
Remove the “-” form Institution_Zip and Institution_Phone using a
suitable function from stringr
package (Do this part using
mutate and pipes).
Create a new column combining address information using a suitable function (Do this part using mutate and pipes).
Bonus exercise: Use one of the data sets in the worksheet to either to create a new variable or clean up a variable using one (or more) of the functions we learned this week. Post your answer on the discussion board.
If you have finished the above tasks, work through the weekly list of tasks posted on the Canvas announcement page.