5 Lab 5: Data import

5.1 Objectives

In this lab, you will learn to:

  • Import datasets in various formats into R, including:
  • Deal with common data import issues including:
    • column names
    • column types
    • missing values
  • Enter data in R with tibble() and tribble()

5.2 Overview

A brief overview of the steps for this lab:

  1. Complete the steps in Getting Started below
  2. Read chapters from R for Data Science (2e)
  3. Read the entire Assignment section before beginning, then complete the requested assignment
  4. Submit your assignment

5.3 Getting Started

5.3.1 Claim your repository from GitHub Classroom

Your instructor has created a blank repository for you using GitHub Classroom.

  1. Go to the course D2L page

  2. Navigate to the module for this lab

  3. Follow the link to claim your GitHub Classroom repository. After claiming your repository, copy its URL

5.3.2 Clone the repository to your machine

  1. In RStudio, create a New Project

  2. Select the Version Control option, then GitHub

  3. Paste the URL you copied, and press tab to auto-complete the directory name

  4. Select a parent directory in which to place your new project folder, probably either ~ or ~/R.

  5. Click OK.

5.3.3 Prepare your working environment

  1. Once RStudio has started, double-check your Project Options as described in Lab 2. (set the options to not save .RData or .Rhistory files, and not auto-load them on startup)

  2. Make your first commit:

    1. In the Git tab, check the box next to the *.Rproj file to stage it for adding them the repository

    2. Click the “Commit” button

    3. Enter a commit message such as “Create a new RStudio project”

    4. Click Commit to submit the commit

    5. Click Push to push the commit to GitHub

      1. If RStudio prompts you for credentials, follow the method you learned in Lab 2 (use your GitHub username and a Personal Access Token for the password)
  3. Create a new R script. This is what you will use to develop the code for your assignment

5.4 Assignment

5.4.1 Read the Muskox CSV

5.4.1.1 Download the Muskox data

  1. Go to Data.gov, a comprehensive online platform managed by the U.S. government, offering access to a vast array of datasets generated by federal agencies. It serves as a centralized hub for users to discover, access, and utilize open data for research, analysis, and innovation purposes.

  2. Search for “muskox”

  3. Open the resource titled “Northeast muskox population sex and age composition summary for all areas (26A+B and 26C (Arctic NWR) in Alaska and northern Yukon), 1972-2011”

  4. Download the file 1972_2011_sex_age_composition_dwc.csv to the project folder for Lab 5

    You will know it is saved in the correct folder when you can see it in the Files tab in RStudio (bottom right pane)

5.4.1.2 Read the Muskox CSV data into R

  1. Open your R script

  2. Load the tidyverse package, which will load the readr package for you

  3. Use the read_csv() function to read the data file

    IMPORTANT: do not use the similar base R function read.csv().

  4. Have R treat values of -999 in the organismQuantity column as missing values (NAs)

5.4.2 Read the water quality data

5.4.2.1 Download the water quality data

  1. Go to Data.gov
  2. Search for “cyanobacteria”
  3. Open the resource titled “1987-2018 cyanobacteria and water quality data for 20 reservoirs”
  4. Download the file Means_of_reservoir_nutrients.xlsx to the project folder for Lab 5
  5. Open the file to view it
    1. note the names and contents of the three tabs, “Read_Me”, “Means_of_variables”, and “Spearman_correlations”
    2. On the “Means_of_variables” sheet, look for missing values. Take note of the value used to denote them.

5.4.2.2 Read the water quality XLS data into R

  1. Open your R script
  2. Load the tidyverse package, which will load the readr package for you
  3. Use the read_excel() function to read the “Means_of_variables” sheet
    • note that this function is in the readxl package, so you will have to load that first
    • note that the Means_of_variables sheet is not the first sheet, so you will have to use the sheets argument to read_excel()
  4. Have R treat missing values appropriately.

5.4.3 Read Google Sheets data into R

See the crab data collected last week in Animal Behavior:

https://docs.google.com/spreadsheets/d/15fozF3WTnH-PIL3Qm6RIyOFM6yk0RlxO3lBPF1ndu9s/edit?usp=sharing

Import the ‘crabs’ sheet into R using read_sheet() from the googlesheets4 package.

Note: the default behavior of googlesheets4 is to attempt to obtain a token for authorization. In our case, this is unnecessary because the sheet in question is viewable by the public. To prevent googlesheets4 from asking for a token, run the following code before you attempt to read the sheet:

Note: when you read the sheet, the first argument to read_sheet() is the URL to the sheet, or the sheet ID, which is the long alpha-numeric string in the middle of the URL, in this case 15fozF3WTnH-PIL3Qm6RIyOFM6yk0RlxO3lBPF1ndu9s . The sheet ID is shorter than the entire URL, so using it is preferable because it makes your code more readable.

5.4.4 Enter data directly in R

Sometimes it is helpful to enter data directly into R, for example when you have a small table and using a CSV, Excel, or Google Sheet is overkill. In these cases you can us the tibble() and tribble() functions from the tibble package, part of the tidyverse.

See Section 7.6 Data entry in R for Data Science (2e) for more details.

For this part of the assignment use the tribble() function to recreate the following table in R.

The table shows the three leading causes of death in the United States in 2021, according to the National Center for Health Statistics. The variables include:

  • cause. CHARACTER. the cause of death.

  • deaths. NUMERIC. the estimated number of deaths in 2021 in the US.

  • is_infectious. LOGICAL. a binary variable representing whether the cause of death is infectious or not; possible values include TRUE and FALSE. Note these values should be in all caps and not quoted (surrounded by quotation marks).

5.5 Lab Report Submission

When you have completed the assignment listed above, save your files, commit your changes, and push them to GitHub.

5.5.1 Create your lab report

Next, you will create a lab report:

  1. Create a new Quarto Document by going to File > New File > Quarto Document…

  2. In the dialog window, click the “Create Empty Document” button

  3. Save the document as lab-report.qmd

  4. Replace the default YAML header with the one described on under Lab Reports.

  5. Make sure you are in the Visual Mode for the next steps. Click the “Visual” button on the left of the tool bar near the save button.

  6. Add a code chunk after the YAML; in it, load the tidyverse, readxl, and googlesheets4 packages

  7. Create three level 2 headings (two hash tags followed by a space and the heading name), like this:

    ## Muskox CSV
    
    ## Water quality XLSX
    
    ## Animal Behavior Google sheets
    
    ## Mortality causes
  8. After each heading, create a code chunk and copy and paste your (already working) code from your R script to the report, one output per question. Each question should produce a table output.

  9. Render your document

5.5.2 Submit your lab report

Finally, submit your lab report to D2L for grading.

  1. Again, save your files, commit your changes, and push them to GitHub.

  2. Copy the URL to your lab-report.md file (not the qmd) on GitHub.

  3. Submit that to the Assignment on D2L.