5 Lab 5: Data import

5.1 Objectives

In this lab, you will learn to:

Import datasets in various formats into R, including:
- CSV files with read_csv()
- Excel files with read_excel()
- Google Sheets with read_sheet()
Deal with common data import issues including:
- column names
- column types
- missing values
Enter data in R with tibble() and tribble()

5.2 Overview

A brief overview of the steps for this lab:

Complete the steps in Getting Started below
Read chapters from R for Data Science (2e)
- Chapters 7 Data Import
- Chapter 20 Spreadsheets
Read the entire Assignment section before beginning, then complete the requested assignment
Submit your assignment

5.3 Getting Started

5.3.1 Claim your repository from GitHub Classroom

Your instructor has created a blank repository for you using GitHub Classroom.

Go to the course D2L page
Navigate to the module for this lab
Follow the link to claim your GitHub Classroom repository. After claiming your repository, copy its URL

5.3.2 Clone the repository to your machine

In RStudio, create a New Project
Select the Version Control option, then GitHub
Paste the URL you copied, and press tab to auto-complete the directory name
Select a parent directory in which to place your new project folder, probably either ~ or ~/R.
Click OK.

5.3.3 Prepare your working environment

Once RStudio has started, double-check your Project Options as described in Lab 2. (set the options to not save .RData or .Rhistory files, and not auto-load them on startup)
Make your first commit:
1. In the Git tab, check the box next to the *.Rproj file to stage it for adding them the repository
2. Click the “Commit” button
3. Enter a commit message such as “Create a new RStudio project”
4. Click Commit to submit the commit
5. Click Push to push the commit to GitHub
  1. If RStudio prompts you for credentials, follow the method you learned in Lab 2 (use your GitHub username and a Personal Access Token for the password)
Create a new R script. This is what you will use to develop the code for your assignment

5.4 Assignment

5.4.1 Read the Muskox CSV

5.4.1.1 Download the Muskox data

Go to Data.gov, a comprehensive online platform managed by the U.S. government, offering access to a vast array of datasets generated by federal agencies. It serves as a centralized hub for users to discover, access, and utilize open data for research, analysis, and innovation purposes.
Search for “muskox”
Open the resource titled “Northeast muskox population sex and age composition summary for all areas (26A+B and 26C (Arctic NWR) in Alaska and northern Yukon), 1972-2011”
Download the file 1972_2011_sex_age_composition_dwc.csv to the project folder for Lab 5

You will know it is saved in the correct folder when you can see it in the Files tab in RStudio (bottom right pane)

5.4.1.2 Read the Muskox CSV data into R

Open your R script
Load the tidyverse package, which will load the readr package for you
Use the read_csv() function to read the data file

IMPORTANT: do not use the similar base R function read.csv().
Have R treat values of -999 in the organismQuantity column as missing values (NAs)

5.4.2 Read the water quality data

5.4.2.1 Download the water quality data

Go to Data.gov
Search for “cyanobacteria”
Open the resource titled “1987-2018 cyanobacteria and water quality data for 20 reservoirs”
Download the file Means_of_reservoir_nutrients.xlsx to the project folder for Lab 5
Open the file to view it
1. note the names and contents of the three tabs, “Read_Me”, “Means_of_variables”, and “Spearman_correlations”
2. On the “Means_of_variables” sheet, look for missing values. Take note of the value used to denote them.

5.4.2.2 Read the water quality XLS data into R

Open your R script
Load the tidyverse package, which will load the readr package for you
Use the read_excel() function to read the “Means_of_variables” sheet
- note that this function is in the readxl package, so you will have to load that first
- note that the Means_of_variables sheet is not the first sheet, so you will have to use the sheets argument to read_excel()
Have R treat missing values appropriately.

5.4.3 Read Google Sheets data into R

See the crab data collected last week in Animal Behavior:

https://docs.google.com/spreadsheets/d/15fozF3WTnH-PIL3Qm6RIyOFM6yk0RlxO3lBPF1ndu9s/edit?usp=sharing

Import the ‘crabs’ sheet into R using read_sheet() from the googlesheets4 package.

Note: the default behavior of googlesheets4 is to attempt to obtain a token for authorization. In our case, this is unnecessary because the sheet in question is viewable by the public. To prevent googlesheets4 from asking for a token, run the following code before you attempt to read the sheet:

library(googlesheets4)
gs4_deauth()

Note: when you read the sheet, the first argument to read_sheet() is the URL to the sheet, or the sheet ID, which is the long alpha-numeric string in the middle of the URL, in this case 15fozF3WTnH-PIL3Qm6RIyOFM6yk0RlxO3lBPF1ndu9s . The sheet ID is shorter than the entire URL, so using it is preferable because it makes your code more readable.

5.4.4 Enter data directly in R

Sometimes it is helpful to enter data directly into R, for example when you have a small table and using a CSV, Excel, or Google Sheet is overkill. In these cases you can us the tibble() and tribble() functions from the tibble package, part of the tidyverse.

See Section 7.6 Data entry in R for Data Science (2e) for more details.

For this part of the assignment use the tribble() function to recreate the following table in R.

The table shows the three leading causes of death in the United States in 2021, according to the National Center for Health Statistics. The variables include:

cause. CHARACTER. the cause of death.
deaths. NUMERIC. the estimated number of deaths in 2021 in the US.
is_infectious. LOGICAL. a binary variable representing whether the cause of death is infectious or not; possible values include TRUE and FALSE. Note these values should be in all caps and not quoted (surrounded by quotation marks).

5.5 Lab Report Submission

When you have completed the assignment listed above, save your files, commit your changes, and push them to GitHub.

5.5.1 Create your lab report

Next, you will create a lab report:

Create a new Quarto Document by going to File > New File > Quarto Document…
In the dialog window, click the “Create Empty Document” button
Save the document as lab-report.qmd
Replace the default YAML header with the one described on under Lab Reports.
Make sure you are in the Visual Mode for the next steps. Click the “Visual” button on the left of the tool bar near the save button.
Add a code chunk after the YAML; in it, load the tidyverse, readxl, and googlesheets4 packages
Create three level 2 headings (two hash tags followed by a space and the heading name), like this:
```
## Muskox CSV

## Water quality XLSX

## Animal Behavior Google sheets

## Mortality causes
```
After each heading, create a code chunk and copy and paste your (already working) code from your R script to the report, one output per question. Each question should produce a table output.
Render your document

BIOL 275 Biostatistics at MSUM