9 Data sets

9.1 Your own dataset

If you are involved in an ongoing research project, for example with a faculty member at MSUM, you may be able to use the data collected as part of that project for your BIOL 275 lab project.

In order for you to choose your own dataset, it must be approved by the instructor first. In general, it needs to have enough data to be able to answer an interesting question.

9.2 Biological trait data

9.3 Species occurrence data

  • eBird data on bird observations. This is a huge dataset with many possible questions to explore. Challenge level: Difficult

  • The Botanical Information and Ecology Network brings together data on plant distribution, abundance, and traits, with the goal of predicting and mitigating the effects of climate change on plant species and communities. You can download geolocated observations and trait data, but you’d probably need to combine it with some other earth observation data like those found below. Challenge level: Moderate

  • iNaturalist. Challenge level: Easy

  • GBIF. Global Biodiversity Information Facility. Geolocated occurrence data for all species worldwide, aggregated from many other data sources. Challenge level: Moderate

    • rgbif package on GitHub - read the intro on the README for more links to vignettes, reference, articles, and a published paper

9.4 Environmental data

  • NEON. The National Science Foundation’s National Ecological Observatory Network (NEON) is a continental-scale observation facility operated by Battelle and designed to collect long-term open access ecological data to better understand how U.S. ecosystems are changing. The comprehensive data, spatial extent and remote sensing technology provided by NEON will enable a large and diverse user community to tackle new questions at scales not accessible to previous generations of ecologists. Challenge level: Difficult

    • Users can browse data products and associated documentation and then select time frames and field sites to download the data

    • The neonUtilities R package allows you to access and download NEON data as well as to work with NEON data downloaded from the portal.

9.5 Public health data sets

Many data sets for the USA can be found at:

  • National Center for Health Statistics. Includes data sets, documentation, and questionnaires from NCHS data collection systems. Some of these are included in the table below, but there are many more than what is given here.

Quite a bit of health data may be downloaded at:

  • CDC WONDER. You choose the dataset, which variables to include, and download it.
Dataset Description Spatial Coverage Spatial Resolution Temporal Coverage Temporal Resolution
Behavioral Risk Factor Surveillance System (BRFSS) Prevalence Data Prevalance data based on telephone surveys USA State 2011-present Yearly
County Health Rankings & Roadmaps You can download data by state and year (see Minnesota, for example) USA State, County Yearly
KIDS COUNT A source of data on children and families and a project of the Annie E. Casey Foundation. You choose and download the variables necessary to answer your question. USA State, County Yearly
National Comorbidity Survey (NCS) Series Prevalence, risk factors, and consequences of psychiatric morbidity and comorbidity USA Individual 1990-2004 baseline, reinterview, replication

Other pages that provide lists of available data sets:

9.6 Epidemiological data

9.7 Geospatial data

9.7.1 Appears

The Application for Extracting and Exploring Analysis Ready Samples (AρρEEARS) offers a simple and efficient way to access and transform geospatial data from a variety of federal data archives.

AρρEEARS enables users to subset geospatial datasets using spatial, temporal, and band/layer parameters.

Two types of sample requests are available:

  • point samples for geographic coordinates and
  • area samples for spatial areas via vector polygons.

Sample requests submitted to AρρEEARS provide users not only with data values, but also associated quality data values. Interactive visualizations with summary statistics are provided for each sample within the application, which allow users to preview and interact with their samples before downloading their data. Visit the Help page to learn more.

There are handy videos on how to use the system to get data.

  • Some datasets include:

    • Land Surface Temperature (min, max, mean)
    • Sea Surface Temperature
    • Precipitation
    • Snow cover
    • Land cover
    • Soil moisture, soil temperature
    • Vegetation indices (e.g. NDVI)
    • Gridded population data
  • The temporal and spatial range and resolution of these data sets varies.

You could explore geospatial data by itself, or if you have GPS coordinate for other types of data (e.g. georeferenced specimen or observation data) then you could us AppEARS to extract environmental data associated with those points.

9.7.2 Other geospatial data

9.8 Cellular and molecular biology and biochemistry

  • The Actinobacteriophage Database at PhagesDB.org, a website that collects and shares data, pictures, protocols, and analysis tools associated with the discovery, sequencing, and characterization of mycobacteriophages—viruses that infect the Mycobacteria and also other bacterial hosts in the phylum Actinobacteria. It was developed at—and is maintained from—the Pittsburgh Bacteriophage Institute, a joint venture of Dr. Graham Hatfull and Dr. Roger Hendrix, both of the Department of Biological Sciences at the University of Pittsburgh.

9.9 Online data repositories

  • Dryad. A curated, general purpose data repository. You can search through it to find an interesting dataset. Here are two examples (but you should find your own):

  • Awesome Public Datasets. A topic-centric list of high-quality open datasets in public domains. By everyone, for everyone!

  • ATLANTIC: Data Papers from a biodiversity hotspot. Datasets include: Mammals, Mammal traits, Bats, Nonvolant mammals, Small mammals, Primates, Birds, Bird traits, Amphibians, Butterflies, Epiphytes, Frugivory, Camera traps

9.10 Datasets in R

Many R packages have datasets included.