2  Importing and cleaning data

2.1 Creating R data objects

We’re able to work with data in R as objects. You define an object by selecting a name, and assigning it some sort of value. We’ll illustrate this with a few different kinds of data, then later we’ll instead work with reading in data from files.

We assign an object a value using the <- sequence. Let’s try. After assigning your new object a value, you can navigate to it in the R object viewer in your IDE (R Studio or Visual Studio Code). To display in this guide, we’ll use print().

myobject <- 2

print(myobject)
[1] 2

We can store other kinds of data, like text, which is called a string. Strings are always "wrapped in quotes" to be handled properly.

mysecondobject <- "this is a string"

print(mysecondobject)
[1] "this is a string"

And of course, we can store more complex datasets than single values. We can create a list with the special combine c() function. You can think of a list as a row of data:

mythirdobject <- c(1, 2, 3)

print(mythirdobject)
[1] 1 2 3

Our R object explorer might be starting to get a little cluttered with these example objects. You can remove them in the GUI or you can run the rm() command to delete them:

rm(myobject)
rm(mysecondobject)
rm(mythirdobject)

2.2 Importing data

Now that we’ve learned how to work with R data objects, let’s discuss working with actual data. First, of course, we’ll need to get the data into R. How we get the data into R depends on how the data is stored and where the data is stored. Let’s consider each:

By file format

Most data files are stored in one of two common formats: Comma-separated values (.csv) and Microsoft Excel files (.xlsx). There are great packages in R to handle these kinds of imports.

First, readr, includes functions to import data from those .csv files, primarily read_csv(). You’ll need to load the readr library, then pass it a filename or URL.

Warning

Remember, you’ll need to run install.packages() to install the package before you can load it. See Chapter 1 for details on installing packages.

library("readr")

mydata <- read_csv("mydatafile.csv")

For modern Excel files (.xlsx), we use read_xlsx() from the readxl package.

library("readxl")
mydata <- read_xlsx("mydatafile.xlsx")

By file location

The other important difference is where your data files reside. If you have the data locally, you can move or copy the files into your project directory. When they are in your project directory, you can simply pass the filename to the appropriate function (e.g. read_csv() or read_xlsx()).

If the file exists on the internet, you may be able to pass the URL (in "" quotes) to a function like read_csv(). You can also use the curl package to download the file using curl_download() and set a filename for the location on your computer, then follow the instructions for a local file above.

The curl package can also be used to work with FTP/SFTP servers. Since you’ll need to pass credentials to the server in code, you’ll want to look into the R environment file as detailed in Section 1.6.6.

library("curl")

From a database

If your data is stored in a database, it’s likely that you’ll be able to connect to your database using R. The DBI package details how to connect to your database, either by using ODBC or by specifying a driver. The DBI package includes such functions as dbReadTable() to read in a data table and dbGetQuery() to run a query and grab the result.

From an Application Program Interface (API)

If the data can be provided from a webservice, it’s likely that you can use R to engage with the data. This is beyond the scope of this guide, but packages exist for many popular public data sources that have APIs - for example, you can use the tidycensus package to access the Census Bureau data APIs.

2.3 Tidy data

For much of our data cleaning, manipulation, and analysis in R we’re going to use a collection of packages known as tidyverse (Wickham et al. 2019). The readr package described above is actually part of the compilation! This set of packages is designed to for data science, with a common design and underlying data structures. We can load the entire collection with just one command:

library("tidyverse")

tidyverse is named after a concept of what it means to have tidy data, which Wickham (2014) defines as:

  1. Each variable is a column; each column is a variable.
  2. Each observation is a row; each row is an observation.
  3. Each value is a cell; each cell is a single value.

As institutional researchers, we do come across data that do not meet those requirements from time to time, especially when collected by humans, such as data coded as semester 1, semester 2, etc. tidyverse includes functions, like pivot_longer() and pivot_wider(). Wickham, Çetinkaya-Rundel, and Grolemund (2023) has a great overview of these concepts and functions.

For this guide, we’ll work with data that comes from adminstrative data sources, and already meets the tidy data standards, as it was collected and processed in machine readable formats. That doesn’t mean it won’t need some cleaning, of course.

2.4 In Practice: Dataset #1 – IPEDS data

For working with our first dataset, we’ll of course begin with data from the Integrated Postsecondary Education Data System (IPEDS). Though many IR professionals are used to working with the web interface of the IPEDS data center, we also have the opportunity to interact with the complete data files, which are available as comma-separated values files (.csv) in a compressed format (.zip), or as Access databases. We’ll be working with the .csv files for now.1

If we go to the IPEDS Data Center and click on “Complete Data Files”, we’ll reach this page: https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx. We can hover over the various file links and note that they share a similar stem and file format. Let’s set two variables to start: one to store this URL stem and one to store the latest year available:

IPEDS_url <- "https://nces.ed.gov/ipeds/datacenter/data/"
IPEDS_year <- 2022

Next we’re going to load libraries we’ll need to use. tidyverse should be loaded already but we’ll load it again. The curl package will assist us in downloading files.

# load libraries
library("tidyverse")
library("curl")

Let’s use the simple curl_download() function from the curl package. We need to provide it two parameters: a URL to download and a filename for the downloaded file. We can build both of those by concatenating info to match the available links using the variables we stored above. We’ll grab the IPEDS Directory Information data, which comes from the Institutional Characteristics Header component (“HD”), and the Completions component data (“C”). We can create the filenames and URLs by concatenating text fields. In Excel we would use CONCATENATE() or the special concatenation character (&), but using tidyverse we will use str_c():

curl_download(
  str_c(IPEDS_url, "HD", IPEDS_year, ".zip"),
  destfile = str_c("HD", IPEDS_year, ".zip")
)

curl_download(
  str_c(IPEDS_url, "C", IPEDS_year, "_A", ".zip"),
  destfile = str_c("C", IPEDS_year, "_A", ".zip")
)

Next we need to import the downloaded files into R and begin cleaning the data. We’ll start with the directory information data. We’ll begin using read_csv(), which we can use to read in the .csv file contained in the .zip file. We’ll store the data as institutions, then take a peek at the imported data using glimse():

institutions <- read_csv(
    str_c("HD", IPEDS_year, ".zip")
)
Rows: 6256 Columns: 73
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (23): INSTNM, IALIAS, ADDR, CITY, STABBR, ZIP, CHFNM, CHFTITLE, EIN, UEI...
dbl (50): UNITID, FIPS, OBEREG, GENTELE, OPEFLAG, SECTOR, ICLEVEL, CONTROL, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(institutions)
Rows: 6,256
Columns: 73
$ UNITID   <dbl> 100654, 100663, 100690, 100706, 100724, 100733, 100751, 10076…
$ INSTNM   <chr> "Alabama A & M University", "University of Alabama at Birming…
$ IALIAS   <chr> "AAMU", "UAB", "Southern Christian University  Regions Univer…
$ ADDR     <chr> "4900 Meridian Street", "Administration Bldg Suite 1070", "12…
$ CITY     <chr> "Normal", "Birmingham", "Montgomery", "Huntsville", "Montgome…
$ STABBR   <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
$ ZIP      <chr> "35762", "35294-0110", "36117-3553", "35899", "36104-0271", "…
$ FIPS     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ OBEREG   <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
$ CHFNM    <chr> "Dr. Daniel K. Wims", "Ray L. Watts", "Michael C.Turner", "Ch…
$ CHFTITLE <chr> "President", "President", "President", "President", "Presiden…
$ GENTELE  <dbl> 2.563725e+09, 2.059344e+09, 3.343874e+13, 2.568246e+09, 3.342…
$ EIN      <chr> "636001109", "636005396", "237034324", "630520830", "63600110…
$ UEIS     <chr> "JDVGS67MSLH7", "YND4PLMC9AN7", "RB27R4GLDKE7", "HB6KNGVNJRU1…
$ OPEID    <chr> "00100200", "00105200", "02503400", "00105500", "00100500", "…
$ OPEFLAG  <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ WEBADDR  <chr> "www.aamu.edu/", "https://www.uab.edu/", "https://www.amridge…
$ ADMINURL <chr> "https://www.aamu.edu/admissions-aid/index.html", "https://ww…
$ FAIDURL  <chr> "https://www.aamu.edu/admissions-aid/financial-aid/", "https:…
$ APPLURL  <chr> "https://www.aamu.edu/admissions-aid/undergraduate-admissions…
$ NPRICURL <chr> "www.aamu.edu/admissions-aid/tuition-fees/net-price-calculato…
$ VETURL   <chr> NA, "https://www.uab.edu/students/veterans", "https://www.amr…
$ ATHURL   <chr> NA, "https://www.uab.edu/registrar/students", NA, "www.uah.ed…
$ DISAURL  <chr> "https://www.aamu.edu/administrativeoffices/VADS/Pages/Disabi…
$ SECTOR   <dbl> 1, 1, 2, 1, 1, 0, 1, 4, 1, 1, 1, 2, 4, 3, 4, 4, 2, 4, 9, 4, 4…
$ ICLEVEL  <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 3, 2, 2…
$ CONTROL  <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 1, 1, 2, 1, 3, 1, 1…
$ HLOFFER  <dbl> 9, 9, 9, 9, 9, 9, 9, 3, 7, 9, 9, 5, 3, 9, 3, 3, 9, 3, 2, 3, 3…
$ UGOFFER  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ GROFFER  <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2…
$ HDEGOFR1 <dbl> 12, 11, 12, 11, 11, 11, 11, 40, 20, 11, 11, 30, 40, 13, 40, 4…
$ DEGGRANT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
$ HBCU     <dbl> 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2…
$ HOSPITAL <dbl> 2, 1, 2, 2, 2, -2, 2, -2, 2, 2, 2, 2, -2, 2, -2, -2, 2, -2, -…
$ MEDICAL  <dbl> 2, 1, 2, 2, 2, -2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ TRIBAL   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ LOCALE   <dbl> 12, 12, 12, 12, 12, 13, 13, 32, 31, 12, 13, 12, 41, 12, 32, 3…
$ OPENPUBL <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ACT      <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "…
$ NEWID    <dbl> -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -…
$ DEATHYR  <dbl> -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -…
$ CLOSEDAT <chr> "-2", "-2", "-2", "-2", "-2", "-2", "-2", "-2", "-2", "-2", "…
$ CYACTIVE <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ POSTSEC  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ PSEFLAG  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ PSET4FLG <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ RPTMTH   <dbl> 1, 1, 1, 1, 1, -2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, …
$ INSTCAT  <dbl> 2, 2, 2, 2, 2, -2, 2, 4, 2, 2, 2, 2, 4, 2, 4, 4, 2, 4, 6, 4, …
$ C21BASIC <dbl> 18, 15, 20, 15, 17, -2, 15, 5, 22, 18, 15, 21, 4, 22, 2, 4, 1…
$ C21IPUG  <dbl> 16, 14, 20, 17, 13, -2, 17, 2, 16, 16, 17, 9, 2, 20, 1, 2, 19…
$ C21IPGRD <dbl> 18, 14, 18, 17, 18, -2, 15, 0, 6, 4, 14, 0, 0, 7, 0, 0, 16, 0…
$ C21UGPRF <dbl> 10, 9, 5, 15, 10, -2, 12, 1, 5, 7, 14, 12, 2, 7, 2, 2, 7, 2, …
$ C21ENPRF <dbl> 4, 5, 6, 4, 3, -2, 4, 1, 3, 4, 4, 2, 1, 4, 1, 1, 5, 1, -2, 1,…
$ C21SZSET <dbl> 14, 15, 6, 13, 14, -2, 16, 2, 9, 13, 15, 11, 2, 6, 2, 3, 10, …
$ C18BASIC <dbl> 18, 15, 20, 16, 19, -2, 15, 2, 22, 18, 15, 21, 1, 20, 5, 5, 1…
$ C15BASIC <dbl> 18, 15, 20, 16, 19, -2, 16, 1, 22, 18, 16, 21, 1, 22, 8, 1, 2…
$ CCBASIC  <dbl> 18, 15, 21, 15, 18, -2, 16, 2, 22, 18, 16, 21, 2, 23, 2, 5, 2…
$ CARNEGIE <dbl> 16, 15, 51, 16, 21, -2, 15, 40, 32, 21, 15, 31, 40, 40, 40, 4…
$ LANDGRNT <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ INSTSIZE <dbl> 3, 5, 1, 3, 2, -2, 5, 2, 2, 3, 5, 1, 2, 1, 2, 3, 2, 2, 1, 2, …
$ F1SYSTYP <dbl> 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1…
$ F1SYSNAM <chr> "-2", "The University of Alabama System", "-2", "The Universi…
$ F1SYSCOD <dbl> -2, 101050, -2, 101050, -2, 101050, 101050, 101030, -2, 10104…
$ CBSA     <dbl> 26620, 13820, 33860, 26620, 33860, 46220, 46220, 10760, 26620…
$ CBSATYPE <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2…
$ CSA      <dbl> 290, 142, 388, 290, 388, -2, -2, 388, 290, 388, 194, 142, 194…
$ COUNTYCD <dbl> 1089, 1073, 1101, 1089, 1101, 1125, 1125, 1123, 1083, 1101, 1…
$ COUNTYNM <chr> "Madison County", "Jefferson County", "Montgomery County", "M…
$ CNGDSTCD <dbl> 105, 107, 102, 105, 107, 107, 107, 103, 105, 102, 103, 107, 1…
$ LONGITUD <dbl> -86.56850, -86.79935, -86.17401, -86.64045, -86.29568, -87.52…
$ LATITUDE <dbl> 34.78337, 33.50570, 32.36261, 34.72456, 32.36432, 33.20701, 3…
$ DFRCGID  <dbl> 107, 92, 125, 92, 98, -2, 91, 73, 134, 107, 91, 132, 70, 199,…
$ DFRCUSCG <dbl> 1, 1, 2, 2, 1, -2, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 2, 1, 2, …

That’s a lot of data! Since we’re not using all of the variables, let’s use select to pull out a subset of them. We’ll use the select() function to do this.

Important

This is an important place to teach an R technique that makes code much easier and cleaner to write. If you look at the documentation for different functions, you’ll note that typically, the first parameter of a function is a data object to use.

If we have several things we need to do with a data object, we could end up writing the name of the data object over, and over, and over, cluttering up our code.

R uses a special character sequence |>, called the pipe, that takes the output of one command and uses it as the following command. (Some older documentation uses the sequence %>%, which also works, but |> is now preferred.)

institutions <- institutions |>
    select(UNITID, INSTNM, OPEID, STABBR, SECTOR, HLOFFER)

glimpse(institutions)
Rows: 6,256
Columns: 6
$ UNITID  <dbl> 100654, 100663, 100690, 100706, 100724, 100733, 100751, 100760…
$ INSTNM  <chr> "Alabama A & M University", "University of Alabama at Birmingh…
$ OPEID   <chr> "00100200", "00105200", "02503400", "00105500", "00100500", "0…
$ STABBR  <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "A…
$ SECTOR  <dbl> 1, 1, 2, 1, 1, 0, 1, 4, 1, 1, 1, 2, 4, 3, 4, 4, 2, 4, 9, 4, 4,…
$ HLOFFER <dbl> 9, 9, 9, 9, 9, 9, 9, 3, 7, 9, 9, 5, 3, 9, 3, 3, 9, 3, 2, 3, 3,…

Let’s move on to importing and cleaning the IPEDS Completions data. We’ll again use read_csv() to read in the data and store it in an object called completions. Since we’re familiar with the completions data, we know that we really only need major #1 and can filter out secondary majors. We will ask filter() to keep only those rows equal to 1 (== 1). Finally, similar to above, we only care about a few of the variables for now.

completions <- read_csv(
    str_c("C", IPEDS_year, "_A.zip")
    ) |>
    filter(MAJORNUM == 1) |>
    select(UNITID, CIPCODE, AWLEVEL, CTOTALT)
Rows: 300877 Columns: 64
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (32): CIPCODE, AWLEVEL, XCTOTALT, XCTOTALM, XCTOTALW, XCAIANT, XCAIANM, ...
dbl (32): UNITID, MAJORNUM, CTOTALT, CTOTALM, CTOTALW, CAIANT, CAIANM, CAIAN...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(completions)
Rows: 279,506
Columns: 4
$ UNITID  <dbl> 100654, 100654, 100654, 100654, 100654, 100654, 100654, 100654…
$ CIPCODE <chr> "01.0999", "01.1001", "01.1001", "01.1001", "01.9999", "01.999…
$ AWLEVEL <chr> "05", "05", "07", "17", "05", "07", "17", "05", "05", "07", "0…
$ CTOTALT <dbl> 9, 7, 7, 3, 1, 8, 3, 8, 3, 12, 11, 17, 41, 16, 0, 1, 3, 1, 8, …

Now that we have clean directory info (institutions) and clean completions data (completions), let’s combine them. We’ll use the *_join() functions from the tidyverse dplyr package. This is like joining tables in SQL, or for those of you who use Excel like databases, like using VLOOKUP().

We’ll store this combined dataset as combined for now. We need to provide the two data objects we’re joining, and the variable(s) we want to join on:

combined <- left_join(
    institutions,
    completions,
    by = "UNITID",
    keep = TRUE
)

glimpse(combined)
Rows: 279,720
Columns: 10
$ UNITID.x <dbl> 100654, 100654, 100654, 100654, 100654, 100654, 100654, 10065…
$ INSTNM   <chr> "Alabama A & M University", "Alabama A & M University", "Alab…
$ OPEID    <chr> "00100200", "00100200", "00100200", "00100200", "00100200", "…
$ STABBR   <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
$ SECTOR   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ HLOFFER  <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9…
$ UNITID.y <dbl> 100654, 100654, 100654, 100654, 100654, 100654, 100654, 10065…
$ CIPCODE  <chr> "01.0999", "01.1001", "01.1001", "01.1001", "01.9999", "01.99…
$ AWLEVEL  <chr> "05", "05", "07", "17", "05", "07", "17", "05", "05", "07", "…
$ CTOTALT  <dbl> 9, 7, 7, 3, 1, 8, 3, 8, 3, 12, 11, 17, 41, 16, 0, 1, 3, 1, 8,…
Tip

Note that when we get to more complex functions that would create really long lines of code or nested functions, it’s best to separate your code into multiple lines to keep it readable. R will ignore the line breaks that you create just like it ignores spaces.

The data set is combined and clean, but since it has a row for every credential (or at least unique combinations of level and CIP code) for every institution, it’s pretty large. We’ll work with Ohio data so we can filter again to Ohio institutions:

combined <- combined |>
    filter(STABBR == "OH")

glimpse(combined)
Rows: 12,521
Columns: 10
$ UNITID.x <dbl> 200590, 200590, 200590, 200590, 200590, 200590, 200590, 20059…
$ INSTNM   <chr> "ETI Technical College", "ETI Technical College", "ETI Techni…
$ OPEID    <chr> "03079000", "03079000", "03079000", "03079000", "03079000", "…
$ STABBR   <chr> "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "…
$ SECTOR   <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ HLOFFER  <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9…
$ UNITID.y <dbl> 200590, 200590, 200590, 200590, 200590, 200590, 200590, 20059…
$ CIPCODE  <chr> "22.0301", "46.0401", "48.0508", "51.0801", "51.0801", "51.38…
$ AWLEVEL  <chr> "03", "02", "02", "02", "03", "03", "02", "02", "03", "02", "…
$ CTOTALT  <dbl> 2, 7, 10, 11, 0, 15, 9, 0, 2, 37, 19, 9, 2, 10, 2, 39, 10, 10…

Much better. But we have two more steps we’d like to do. Note that many of the variables are repeated in the dataset. We can use a special data type, factors, to clean up this data into categorical variables. tidyverse includes a package, forcats, to help with working with categorical data using the factor data type. Let’s do this for SECTOR, AWLEVEL, and CTOTALT:

combined <- combined |>
    mutate(
        SECTOR = as_factor(SECTOR),
        HLOFFER = as_factor(HLOFFER),
        AWLEVEL = as_factor(AWLEVEL)
    )

glimpse(combined)
Rows: 12,521
Columns: 10
$ UNITID.x <dbl> 200590, 200590, 200590, 200590, 200590, 200590, 200590, 20059…
$ INSTNM   <chr> "ETI Technical College", "ETI Technical College", "ETI Techni…
$ OPEID    <chr> "03079000", "03079000", "03079000", "03079000", "03079000", "…
$ STABBR   <chr> "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "…
$ SECTOR   <fct> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ HLOFFER  <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9…
$ UNITID.y <dbl> 200590, 200590, 200590, 200590, 200590, 200590, 200590, 20059…
$ CIPCODE  <chr> "22.0301", "46.0401", "48.0508", "51.0801", "51.0801", "51.38…
$ AWLEVEL  <fct> 03, 02, 02, 02, 03, 03, 02, 02, 03, 02, 03, 07, 17, 07, 06, 0…
$ CTOTALT  <dbl> 2, 7, 10, 11, 0, 15, 9, 0, 2, 37, 19, 9, 2, 10, 2, 39, 10, 10…

Lastly, since we want to use the SECTOR variable in particular later, let’s provide it with desciptions that are used in IPEDS:2

combined <- combined |>
    mutate(
        SECTOR = fct_recode(
            SECTOR,
            "Administrative Unit" = "0",
            "Public 4-year or above" = "1",
            "Private nonprofit 4-year or above" = "2",
            "Private for-profit 4-year or above" = "3",
            "Public 2-year" = "4",
            "Private nonprofit 2-year" = "5",
            "Private for-profit 2-year" = "6",
            "Public less-than-2-year" = "7",
            "Private nonprofit less-than-2-year" = "8",
            "Private for-profit less-than-2-year" = "9"
        )
    )
Note

Once you’re comfortable with many of these cleaning steps, you can revisit the readr function options. You can actually change column types, select variables, etc. all with optional arguments.

Dataset #1 complete code

# load libraries
library("tidyverse")
library("curl")

# store IPEDS url & data year for use
IPEDS_url <- "https://nces.ed.gov/ipeds/datacenter/data/"
IPEDS_year <- 2022

# download data
curl_download(
  str_c(IPEDS_url, "HD", IPEDS_year, ".zip"),
  destfile = str_c("HD", IPEDS_year, ".zip")
)

curl_download(
  str_c(IPEDS_url, "C", IPEDS_year, "_A", ".zip"),
  destfile = str_c("C", IPEDS_year, "_A", ".zip")
)

# process institution directory information
institutions <- read_csv(
    str_c("HD", IPEDS_year, ".zip")
    ) |>
    select(UNITID, INSTNM, OPEID, STABBR, SECTOR, HLOFFER)

# process completions data
completions <- read_csv(
    str_c("C", IPEDS_year, "_A.zip")
    ) |>
    filter(MAJORNUM == 1) |>
    select(UNITID, CIPCODE, AWLEVEL, CTOTALT)

# combine datasets, additional cleaning
combined <- left_join(
    institutions,
    completions,
    by = "UNITID",
    keep = TRUE
    ) |>
    filter(STABBR == "OH") |> #filter to Ohio
    mutate(
        SECTOR = as_factor(SECTOR),
        HLOFFER = as_factor(HLOFFER),
        AWLEVEL = as_factor(AWLEVEL)
    ) |>
    mutate(
        SECTOR = fct_recode(
            SECTOR,
            "Administrative Unit" = "0",
            "Public 4-year or above" = "1",
            "Private nonprofit 4-year or above" = "2",
            "Private for-profit 4-year or above" = "3",
            "Public 2-year" = "4",
            "Private nonprofit 2-year" = "5",
            "Private for-profit 2-year" = "6",
            "Public less-than-2-year" = "7",
            "Private nonprofit less-than-2-year" = "8",
            "Private for-profit less-than-2-year" = "9"
        )
    )

2.5 In Practice: Dataset #2 – Occupation Projections data

For our second example dataset, we’ll work with Occupational Projections data from the Ohio Department of Jobs and Family Services, Bureau of Labor Market Information. The Long-Term Projections data by JobsOhio region and Metropolitan Statistical Area (MSA) can be found at https://ohiolmi.com/Home/Projections/ProjectionsHome_08.16.23. As interest in post-graduate outcomes increases throughout higher education, institutional researchers are increasingly working with labor market data.

We’ll begin with the Central Ohio Excel file. If you copy the URL, you’ll get ,https://ohiolmi.com/_docs/PROJ/JobsOhio/Central.xlsx.

Note

If you hover over the file links, you’ll notice that they share a common URL stem (https://ohiolmi.com/_docs/PROJ/JobsOhio/), which will be helpful to use in parameterized reports, covered in Chapter 4.

Let’s store that URL to make it easier to use:

projections_url <- "https://ohiolmi.com/_docs/PROJ/JobsOhio/Central.xlsx"

We’ll pull down the file with curl:

curl_download(projections_url, "projectionsdata.xlsx")

Next we’ll try to import the file as a data object, and then check the results with summary() and glimpse():

projections_data <- read_xlsx("projectionsdata.xlsx")
New names:
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
• `` -> `...7`
• `` -> `...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...11`
• `` -> `...12`
glimpse(projections_data)
Rows: 501
Columns: 12
$ `JobsOhio Regional Network - Central Region` <chr> "Occupational* Employment…
$ ...2                                         <chr> NA, NA, NA, NA, "Occupati…
$ ...3                                         <chr> NA, NA, "Employment*", "2…
$ ...4                                         <chr> NA, NA, NA, "2030", "Proj…
$ ...5                                         <chr> NA, NA, "Change in", "Emp…
$ ...6                                         <chr> NA, NA, NA, NA, "Percent"…
$ ...7                                         <chr> NA, NA, NA, "Annual Openi…
$ ...8                                         <chr> NA, NA, NA, NA, "Exits", …
$ ...9                                         <chr> NA, NA, NA, NA, "Transfer…
$ ...10                                        <chr> NA, NA, NA, NA, "Total", …
$ ...11                                        <chr> NA, NA, NA, "Median Wage"…
$ ...12                                        <chr> NA, NA, NA, NA, NA, NA, N…
summary(projections_data)
 JobsOhio Regional Network - Central Region     ...2          
 Length:501                                 Length:501        
 Class :character                           Class :character  
 Mode  :character                           Mode  :character  
     ...3               ...4               ...5               ...6          
 Length:501         Length:501         Length:501         Length:501        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
     ...7               ...8               ...9              ...10          
 Length:501         Length:501         Length:501         Length:501        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
    ...11              ...12          
 Length:501         Length:501        
 Class :character   Class :character  
 Mode  :character   Mode  :character  

That returned a bunch of junk, because there are multiple header rows in the file. Luckily, there’s an optional parameter we can add to read_xlsx to skip to the line we want (line 6 has the headers we want), which we can read about in the readxl documentation.

projections_data <- read_xlsx("projectionsdata.xlsx", skip = 5)
New names:
• `` -> `...12`
glimpse(projections_data)
Rows: 496
Columns: 12
$ Code                 <chr> "00-0000", "11-0000", "11-1011", "11-1021", "11-1…
$ `Occupational Title` <chr> "Total, All Occupations", "Management Occupations…
$ Annual               <chr> "1168921", "70397", "1830", "14405", "336", "1777…
$ Projected            <dbl> 1261966, 77059, 1671, 15779, 372, 1869, 2668, 413…
$ `2020-2030`          <dbl> 93045, 6662, -159, 1374, 36, 92, 131, 37, 252, 10…
$ Percent              <dbl> 0.0796, 0.0946, -0.0869, 0.0954, 0.1071, 0.0518, …
$ Growth               <dbl> 9304, 666, -16, 137, 4, 9, 13, 4, 25, 10, 96, 9, …
$ Exits                <dbl> 50873, 2143, 48, 301, 8, 36, 52, 10, 86, 54, 122,…
$ Transfers            <dbl> 79184, 3818, 71, 936, 17, 115, 164, 22, 151, 188,…
$ Total                <dbl> 139361, 6627, 103, 1374, 29, 160, 229, 36, 262, 2…
$ `May 2021`           <chr> "22.4", "48.2", "80.739999999999995", "47.63", "3…
$ ...12                <chr> NA, NA, NA, NA, "**", NA, NA, "^", "^", NA, NA, N…
summary(projections_data)
     Code           Occupational Title    Annual            Projected      
 Length:496         Length:496         Length:496         Min.   :    104  
 Class :character   Class :character   Class :character   1st Qu.:    400  
 Mode  :character   Mode  :character   Mode  :character   Median :    976  
                                                          Mean   :   7667  
                                                          3rd Qu.:   2903  
                                                          Max.   :1261966  
                                                          NA's   :8        
   2020-2030          Percent             Growth            Exits        
 Min.   :-1975.0   Min.   :-0.35860   Min.   :-198.00   Min.   :    2.0  
 1st Qu.:    6.0   1st Qu.: 0.01740   1st Qu.:   1.00   1st Qu.:   12.0  
 Median :   53.0   Median : 0.06970   Median :   5.00   Median :   34.5  
 Mean   :  566.1   Mean   : 0.07336   Mean   :  56.61   Mean   :  309.3  
 3rd Qu.:  216.2   3rd Qu.: 0.12575   3rd Qu.:  22.00   3rd Qu.:  101.2  
 Max.   :93045.0   Max.   : 0.58880   Max.   :9304.00   Max.   :50873.0  
 NA's   :8         NA's   :8          NA's   :8         NA's   :8        
   Transfers           Total            May 2021            ...12          
 Min.   :    1.0   Min.   :     5.0   Length:496         Length:496        
 1st Qu.:   21.0   1st Qu.:    37.0   Class :character   Class :character  
 Median :   57.5   Median :    95.5   Mode  :character   Mode  :character  
 Mean   :  481.4   Mean   :   847.3                                        
 3rd Qu.:  179.8   3rd Qu.:   301.8                                        
 Max.   :79184.0   Max.   :139361.0                                        
 NA's   :8         NA's   :8                                               

It seems we have a little more data cleaning to do. We need to change the types of a few columns3, rename a few columns, and do some filtering to remove the summary occupations. Let’s use additional parameters in read_xlsx to define types and column names. By providing the column names instead of importing them, we need to change the skip = option to 6 instead of 5.

Note

There’s often more than one way to do something. Instead of providing column names on the import (and changing the skip parameter), we could rename the columns with the rename() function after import.

projections_data <- read_xlsx(
    "projectionsdata.xlsx",
    skip = 6,
    col_names = c(
        "SOC",
        "Occupation",
        "Current_jobs",
        "Projected_jobs",
        "Change_num",
        "Change_pct",
        "Growth",
        "Exits",
        "Transfers",
        "Tot_openings",
        "Wage",
        "Notes"
    ),
    col_types = c(
        "text",
        "text",
        "numeric",
        "numeric",
        "numeric",
        "numeric",
        "numeric",
        "numeric",
        "numeric",
        "numeric",
        "numeric",
        "text"
    )
)

glimpse(projections_data)
Rows: 496
Columns: 12
$ SOC            <chr> "00-0000", "11-0000", "11-1011", "11-1021", "11-1031", …
$ Occupation     <chr> "Total, All Occupations", "Management Occupations", "Ch…
$ Current_jobs   <dbl> 1168921, 70397, 1830, 14405, 336, 1777, 2537, 376, 2900…
$ Projected_jobs <dbl> 1261966, 77059, 1671, 15779, 372, 1869, 2668, 413, 3152…
$ Change_num     <dbl> 93045, 6662, -159, 1374, 36, 92, 131, 37, 252, 103, 965…
$ Change_pct     <dbl> 0.0796, 0.0946, -0.0869, 0.0954, 0.1071, 0.0518, 0.0516…
$ Growth         <dbl> 9304, 666, -16, 137, 4, 9, 13, 4, 25, 10, 96, 9, 3, 23,…
$ Exits          <dbl> 50873, 2143, 48, 301, 8, 36, 52, 10, 86, 54, 122, 27, 1…
$ Transfers      <dbl> 79184, 3818, 71, 936, 17, 115, 164, 22, 151, 188, 298, …
$ Tot_openings   <dbl> 139361, 6627, 103, 1374, 29, 160, 229, 36, 262, 252, 51…
$ Wage           <dbl> 22.40, 48.20, 80.74, 47.63, 37738.00, 62.05, 61.39, 57.…
$ Notes          <chr> NA, NA, NA, NA, "**", NA, NA, "^", "^", NA, NA, NA, NA,…
summary(projections_data)
     SOC             Occupation         Current_jobs       Projected_jobs   
 Length:496         Length:496         Min.   :    102.0   Min.   :    104  
 Class :character   Class :character   1st Qu.:    397.5   1st Qu.:    400  
 Mode  :character   Mode  :character   Median :    937.5   Median :    976  
                                       Mean   :   7101.0   Mean   :   7667  
                                       3rd Qu.:   2729.0   3rd Qu.:   2903  
                                       Max.   :1168921.0   Max.   :1261966  
                                       NA's   :8           NA's   :8        
   Change_num        Change_pct           Growth            Exits        
 Min.   :-1975.0   Min.   :-0.35860   Min.   :-198.00   Min.   :    2.0  
 1st Qu.:    6.0   1st Qu.: 0.01740   1st Qu.:   1.00   1st Qu.:   12.0  
 Median :   53.0   Median : 0.06970   Median :   5.00   Median :   34.5  
 Mean   :  566.1   Mean   : 0.07336   Mean   :  56.61   Mean   :  309.3  
 3rd Qu.:  216.2   3rd Qu.: 0.12575   3rd Qu.:  22.00   3rd Qu.:  101.2  
 Max.   :93045.0   Max.   : 0.58880   Max.   :9304.00   Max.   :50873.0  
 NA's   :8         NA's   :8          NA's   :8         NA's   :8        
   Transfers        Tot_openings           Wage              Notes          
 Min.   :    1.0   Min.   :     5.0   Min.   :    10.04   Length:496        
 1st Qu.:   21.0   1st Qu.:    37.0   1st Qu.:    18.13   Class :character  
 Median :   57.5   Median :    95.5   Median :    23.53   Mode  :character  
 Mean   :  481.4   Mean   :   847.3   Mean   :  4994.70                     
 3rd Qu.:  179.8   3rd Qu.:   301.8   3rd Qu.:    37.24                     
 Max.   :79184.0   Max.   :139361.0   Max.   :129451.00                     
 NA's   :8         NA's   :8          NA's   :11                            

Next, let’s filter the dataset to remove those summary occupations and any notes at the end.

Summary occupations are coded with SOC codes ending in “0000”, so we can quickly identify them. In Excel, we’d likely use the MID() or RIGHT() commands to pull out that sequence. In R’s tidyverse package, we can use str_sub(), which works very similar to these, extracting a subset of the string field based on character position. Negative values mean work from the end.

Any rows with anything other than the SOC code in the SOC column should be ignored. Since SOC codes are 7 characters long, we’ll try to use that. In Excel, we’d use LEN() to get the length, while here we’ll use str_length() from the tidyverse package. Note that in R, equals and not equals (== and !=) are different than in Excel (= and <>)

projections_data <- projections_data |>
    filter(
        str_sub(SOC, -4, -1) != "0000",
        str_length(SOC) == 7
        )

glimpse(projections_data)
Rows: 465
Columns: 12
$ SOC            <chr> "11-1011", "11-1021", "11-1031", "11-2021", "11-2022", …
$ Occupation     <chr> "Chief Executives", "General and Operations Managers", …
$ Current_jobs   <dbl> 1830, 14405, 336, 1777, 2537, 376, 2900, 3250, 5457, 13…
$ Projected_jobs <dbl> 1671, 15779, 372, 1869, 2668, 413, 3152, 3353, 6422, 14…
$ Change_num     <dbl> -159, 1374, 36, 92, 131, 37, 252, 103, 965, 87, 29, 228…
$ Change_pct     <dbl> -0.0869, 0.0954, 0.1071, 0.0518, 0.0516, 0.0984, 0.0869…
$ Growth         <dbl> -16, 137, 4, 9, 13, 4, 25, 10, 96, 9, 3, 23, 0, 8, 3, 5…
$ Exits          <dbl> 48, 301, 8, 36, 52, 10, 86, 54, 122, 27, 10, 30, 3, 24,…
$ Transfers      <dbl> 71, 936, 17, 115, 164, 22, 151, 188, 298, 67, 24, 86, 8…
$ Tot_openings   <dbl> 103, 1374, 29, 160, 229, 36, 262, 252, 516, 103, 37, 13…
$ Wage           <dbl> 80.74, 47.63, 37738.00, 62.05, 61.39, 57.62, 47.73, 63.…
$ Notes          <chr> NA, NA, "**", NA, NA, "^", "^", NA, NA, NA, NA, NA, NA,…
summary(projections_data)
     SOC             Occupation         Current_jobs   Projected_jobs 
 Length:465         Length:465         Min.   :  102   Min.   :  104  
 Class :character   Class :character   1st Qu.:  376   1st Qu.:  395  
 Mode  :character   Mode  :character   Median :  841   Median :  902  
                                       Mean   : 2425   Mean   : 2618  
                                       3rd Qu.: 2296   3rd Qu.: 2583  
                                       Max.   :34936   Max.   :39662  
                                                                      
   Change_num        Change_pct           Growth            Exits       
 Min.   :-1948.0   Min.   :-0.35860   Min.   :-195.00   Min.   :   2.0  
 1st Qu.:    5.0   1st Qu.: 0.01650   1st Qu.:   0.00   1st Qu.:  12.0  
 Median :   44.0   Median : 0.06850   Median :   4.00   Median :  30.0  
 Mean   :  193.9   Mean   : 0.07258   Mean   :  19.39   Mean   : 105.8  
 3rd Qu.:  185.0   3rd Qu.: 0.12710   3rd Qu.:  18.00   3rd Qu.:  82.0  
 Max.   : 8608.0   Max.   : 0.58880   Max.   : 861.00   Max.   :2775.0  
                                                                        
   Transfers       Tot_openings         Wage              Notes          
 Min.   :   1.0   Min.   :   5.0   Min.   :    10.04   Length:465        
 1st Qu.:  21.0   1st Qu.:  36.0   1st Qu.:    18.13   Class :character  
 Median :  52.0   Median :  87.0   Median :    23.60   Mode  :character  
 Mean   : 164.6   Mean   : 289.8   Mean   :  5242.13                     
 3rd Qu.: 143.0   3rd Qu.: 231.0   3rd Qu.:    37.30                     
 Max.   :3245.0   Max.   :6071.0   Max.   :129451.00                     
                                   NA's   :3                             

And finally, we need to adjust the wage column. It seems that there is a mix of hourly and annual wage figures in this column. Let’s convert all of them to annual wages as a new variable, by multiplying any values below $200/hr by 2,080 hours/yr. We’ll do this by using mutate() to create the new variable, and define it using a case_when().4

projections_data <- projections_data |>
    mutate(
        annual_wage = case_when(
            Wage < 200 ~ Wage * 2080,
            .default = Wage
        )
    )

glimpse(projections_data)
Rows: 465
Columns: 13
$ SOC            <chr> "11-1011", "11-1021", "11-1031", "11-2021", "11-2022", …
$ Occupation     <chr> "Chief Executives", "General and Operations Managers", …
$ Current_jobs   <dbl> 1830, 14405, 336, 1777, 2537, 376, 2900, 3250, 5457, 13…
$ Projected_jobs <dbl> 1671, 15779, 372, 1869, 2668, 413, 3152, 3353, 6422, 14…
$ Change_num     <dbl> -159, 1374, 36, 92, 131, 37, 252, 103, 965, 87, 29, 228…
$ Change_pct     <dbl> -0.0869, 0.0954, 0.1071, 0.0518, 0.0516, 0.0984, 0.0869…
$ Growth         <dbl> -16, 137, 4, 9, 13, 4, 25, 10, 96, 9, 3, 23, 0, 8, 3, 5…
$ Exits          <dbl> 48, 301, 8, 36, 52, 10, 86, 54, 122, 27, 10, 30, 3, 24,…
$ Transfers      <dbl> 71, 936, 17, 115, 164, 22, 151, 188, 298, 67, 24, 86, 8…
$ Tot_openings   <dbl> 103, 1374, 29, 160, 229, 36, 262, 252, 516, 103, 37, 13…
$ Wage           <dbl> 80.74, 47.63, 37738.00, 62.05, 61.39, 57.62, 47.73, 63.…
$ Notes          <chr> NA, NA, "**", NA, NA, "^", "^", NA, NA, NA, NA, NA, NA,…
$ annual_wage    <dbl> 167939.2, 99070.4, 37738.0, 129064.0, 127691.2, 119849.…
summary(projections_data)
     SOC             Occupation         Current_jobs   Projected_jobs 
 Length:465         Length:465         Min.   :  102   Min.   :  104  
 Class :character   Class :character   1st Qu.:  376   1st Qu.:  395  
 Mode  :character   Mode  :character   Median :  841   Median :  902  
                                       Mean   : 2425   Mean   : 2618  
                                       3rd Qu.: 2296   3rd Qu.: 2583  
                                       Max.   :34936   Max.   :39662  
                                                                      
   Change_num        Change_pct           Growth            Exits       
 Min.   :-1948.0   Min.   :-0.35860   Min.   :-195.00   Min.   :   2.0  
 1st Qu.:    5.0   1st Qu.: 0.01650   1st Qu.:   0.00   1st Qu.:  12.0  
 Median :   44.0   Median : 0.06850   Median :   4.00   Median :  30.0  
 Mean   :  193.9   Mean   : 0.07258   Mean   :  19.39   Mean   : 105.8  
 3rd Qu.:  185.0   3rd Qu.: 0.12710   3rd Qu.:  18.00   3rd Qu.:  82.0  
 Max.   : 8608.0   Max.   : 0.58880   Max.   : 861.00   Max.   :2775.0  
                                                                        
   Transfers       Tot_openings         Wage              Notes          
 Min.   :   1.0   Min.   :   5.0   Min.   :    10.04   Length:465        
 1st Qu.:  21.0   1st Qu.:  36.0   1st Qu.:    18.13   Class :character  
 Median :  52.0   Median :  87.0   Median :    23.60   Mode  :character  
 Mean   : 164.6   Mean   : 289.8   Mean   :  5242.13                     
 3rd Qu.: 143.0   3rd Qu.: 231.0   3rd Qu.:    37.30                     
 Max.   :3245.0   Max.   :6071.0   Max.   :129451.00                     
                                   NA's   :3                             
  annual_wage    
 Min.   : 20883  
 1st Qu.: 37710  
 Median : 48776  
 Mean   : 57792  
 3rd Qu.: 75712  
 Max.   :179150  
 NA's   :3       

And now we have a clean dataset! We’ll use this further in Chapter 3 and Chapter 4, so let’s show what it looks like all together.

Dataset #2 complete code

# load libraries
library("readxl")
library("curl")
library("tidyverse")

# define file url
projections_url <- "https://ohiolmi.com/_docs/PROJ/JobsOhio/Central.xlsx"

# download file
curl_download(projections_url, "projectionsdata.xlsx")

# read file with approrpiate settings
projections_data <- read_xlsx(
    "projectionsdata.xlsx",
    skip = 6,
    col_names = c(
        "SOC",
        "Occupation",
        "Current_jobs",
        "Projected_jobs",
        "Change_num",
        "Change_pct",
        "Growth",
        "Exits",
        "Transfers",
        "Tot_openings",
        "Wage",
        "Notes"
    ),
    col_types = c(
        "text",
        "text",
        "numeric",
        "numeric",
        "numeric",
        "numeric",
        "numeric",
        "numeric",
        "numeric",
        "numeric",
        "numeric",
        "text"
    )
    ) |>

    # remove summary occupations and notes/blank rows
    filter(
        str_sub(SOC, -4, -1) != "0000",
        str_length(SOC) == 7
        ) |>
        
    # create annual wage column so values are consistent
    mutate(
        annual_wage = case_when(
            Wage < 200 ~ Wage * 2080,
            .default = Wage
        )
    )

Exercises

Exercise 1

Think about the kinds of data that you work with regularly. Is it tidy? How is it stored? What functions will you need to use to work with it in R?

Exercise 2

We used functions from several tidyverse packages. Especially when learning, it’s nice to have quick references. Tidyverse has a series of official cheat sheets that you’ll likely find useful. Take a look:

package what this package is about site cheat sheet
readxl data import (esp. Excel files) site cheat sheet
dplyr data transformation site cheat sheet
stringr working with text site cheat sheet
factors categorical data site cheat sheet

Extra: Exporting data

At the start of this chapter we covered importing data - but didn’t cover exporting data. Of course, while this guide will show you how far you can take R, it also can be just one small piece of your toolchain if you wish. You could use R to do some cleaning, processing, and some analysis, and then export the data for use in another tool like a dashboard product.

The tidyverse package readr includes a set of write_ functions to export to CSV quickly. write_csv_excel() takes a data object input and a filename (e.g. mydata.csv) and builds out the file which can be read in Excel or other tools.

We briefly reviewed accessing a database earlier in this chapter. DBI provides dbWriteTable() to write a data table object as a database table. You must provide the db connection object, a table name, and then the data object you’re storing. It can take an append argument if you want to simply add data to an existing table.


  1. Yes, there are a couple packages that are designed for accessing IPEDS data including this one in CRAN and this one on GitHub. But we’re using the files directly from the IPEDS data center because we are focused on learning how to import and clean data.↩︎

  2. And included in the dictionary file or the web interface↩︎

  3. By default, the tidyverse read_ functions guess at column types by examining the fields, but it is imperfect.↩︎

  4. case_when() is inspired by the SQL CASE statement, and is more elegant than nested IF() functions you may be used to using in Excel.↩︎