Module 4.2 - Worksheet

Exercises

Family Incidents data set

The following exercise is based on family incidents family_incidents.csv data set. This data set is taken from Crime Statistics Agency https://www.crimestatistics.vic.gov.au/family-violence-data-portal/download-data-tables containing 84 observation from Local government areas about family incident rate per 100,000 population by police region from 2012 to 2017.

The variables for family_incidents.csv data set consists of:

Police Region [Character]: Police regions (levels: North West Metro, Eastern, Southern Metro, Western and where Total2 is for unknown geographical location).

Local Government Area [Character]: 79 Local government areas of Victoria there are totals per region.

2012-2013, ..., 2016-2017: The year that data was collected.

From family_incidents data set, select only 2015-2016 and 2016-2017. Then group by police region and show a summary of mean and standard deviation of 2015-2016 and 2016-2017.

Influenza data set

The following exercises (exercise 2-4) are based on Influenza data set Influenza.xlsx taken from Department of Health, located at http://www9.health.gov.au/cda/source/pub_influ.cfm.

This data set has 335,544 observations of infection with influenza viruses around Australia between 2008 to 2016 containing variables:

Week Ending (Friday) [POSIXct]: Represents the date the data of the diagnosis with time zone.

State [Character]: The state or territory of residence of the notified case.

Age Group [Character]: Age category with 19 levels.

Sex [Character]: Male, Female, X, Unknown

Indigenous Status [Character]: Indigenous, non-Indigenous, not available, unknown

Type/Subtype [Character]: Type/Subtype of the influenza virus.

For more information please check the Data Caveats sheet of the data set.

Using influenza data set, use filter(), group_by() and summarize() to find the counts of Indigenous people, group them with Age group, Sex and State then use arrange() in ascending order. Don’t forget to use factor() with levels(), labels and ordered arguments where appropriate for Age Group, State variables.
The table down below is created from influenza data which is filtered out to show only Indigenous people. Create the same table using filter(), mutate(), group_by(), summarize() functions from dplyr package and spread() function from tidyr package, name it df1. The challenge is creating a new variable year using mutate(): You need to use substr() function from base package to select the first 4 digits of the Week Ending (Friday) column (for year). You can always go creative and find another way to create the same table as an exercise! (Hint: To select the first 4 digits of a string you can use substr(x, 1,4)).

year	NT	QLD	WA
2008	105	72	58
2009	1209	2365	632
2010	258	266	71
2011	439	395	175
2012	199	586	422
2013	271	201	123
2014	496	824	464
2015	274	885	311
2016	249	987	415
2017	1	1	NA

Repeat exercise 3, this time don’t use any filter, name it df2, then join df1 and df2 using suffix argument to differentiate the columns of df1.

SA & Victorian pet ownership data sets

The following exercises (exercise 5-9) are based on subsets of the pet ownership data which are Registrations_Master_Vic.csv, VIC_pet.csv, SA_pet.csv, pet1.csv, pet2.csv and pet3.csv. These data sets are taken from Kaggle, located at https://www.kaggle.com/puppygogo/sa-dog-ownership-sample/data. Variables are self explanatory.

Use bind_rows() and union() to bind vic_pet and sa_pet data sets. Compare these two data sets you binded with intersect() and use setdiff() to prove that two data sets has no difference.
Read in pet1, pet2 and pet3 data sets. First apply a left_join() to pet1 and pet2, then join this new data set with pet3 using left_join. Repeat the same action, this time use right_join(), name it pet_join. Explain shortly why the results are different. Then use setdiff() and/or anti_join() to find out the different records in the data sets.
Use a suitable join function to join pet2 and pet3 data sets, only keep the rows that exists in the both data sets.
Use a suitable join function to join pet2 and pet3 data sets, only keep the rows that exists in pet2 data set.

Module 4.2 - Worksheet

Tidy and Manipulate: Part II - Manipulate

Required Packages

Exercises

Family Incidents data set

Influenza data set

SA & Victorian pet ownership data sets

Finished?