<- 2
myobject
print(myobject)
[1] 2
We’re able to work with data in R as objects. You define an object by selecting a name, and assigning it some sort of value. We’ll illustrate this with a few different kinds of data, then later we’ll instead work with reading in data from files.
We assign an object a value using the <-
sequence. Let’s try. After assigning your new object a value, you can navigate to it in the R object viewer in your IDE (R Studio or Visual Studio Code). To display in this guide, we’ll use print()
.
<- 2
myobject
print(myobject)
[1] 2
We can store other kinds of data, like text, which is called a string. Strings are always "wrapped in quotes"
to be handled properly.
<- "this is a string"
mysecondobject
print(mysecondobject)
[1] "this is a string"
And of course, we can store more complex datasets than single values. We can create a list with the special combine c()
function. You can think of a list as a row of data:
<- c(1, 2, 3)
mythirdobject
print(mythirdobject)
[1] 1 2 3
Our R object explorer might be starting to get a little cluttered with these example objects. You can remove them in the GUI or you can run the rm()
command to delete them:
rm(myobject)
rm(mysecondobject)
rm(mythirdobject)
Now that we’ve learned how to work with R data objects, let’s discuss working with actual data. First, of course, we’ll need to get the data into R. How we get the data into R depends on how the data is stored and where the data is stored. Let’s consider each:
Most data files are stored in one of two common formats: Comma-separated values (.csv
) and Microsoft Excel files (.xlsx
). There are great packages in R to handle these kinds of imports.
First, readr
, includes functions to import data from those .csv
files, primarily read_csv()
. You’ll need to load the readr
library, then pass it a filename or URL.
Remember, you’ll need to run install.packages()
to install the package before you can load it. See Chapter 1 for details on installing packages.
library("readr")
<- read_csv("mydatafile.csv") mydata
For modern Excel files (.xlsx
), we use read_xlsx()
from the readxl
package.
library("readxl")
<- read_xlsx("mydatafile.xlsx") mydata
The other important difference is where your data files reside. If you have the data locally, you can move or copy the files into your project directory. When they are in your project directory, you can simply pass the filename to the appropriate function (e.g. read_csv()
or read_xlsx()
).
If the file exists on the internet, you may be able to pass the URL (in ""
quotes) to a function like read_csv()
. You can also use the curl
package to download the file using curl_download()
and set a filename for the location on your computer, then follow the instructions for a local file above.
The curl
package can also be used to work with FTP/SFTP servers. Since you’ll need to pass credentials to the server in code, you’ll want to look into the R environment file as detailed in Section 1.6.6.
library("curl")
If your data is stored in a database, it’s likely that you’ll be able to connect to your database using R. The DBI
package details how to connect to your database, either by using ODBC or by specifying a driver. The DBI
package includes such functions as dbReadTable()
to read in a data table and dbGetQuery()
to run a query and grab the result.
If the data can be provided from a webservice, it’s likely that you can use R to engage with the data. This is beyond the scope of this guide, but packages exist for many popular public data sources that have APIs - for example, you can use the tidycensus
package to access the Census Bureau data APIs.
For much of our data cleaning, manipulation, and analysis in R we’re going to use a collection of packages known as tidyverse
(Wickham et al. 2019). The readr
package described above is actually part of the compilation! This set of packages is designed to for data science, with a common design and underlying data structures. We can load the entire collection with just one command:
library("tidyverse")
tidyverse
is named after a concept of what it means to have tidy data, which Wickham (2014) defines as:
As institutional researchers, we do come across data that do not meet those requirements from time to time, especially when collected by humans, such as data coded as semester 1
, semester 2
, etc. tidyverse
includes functions, like pivot_longer()
and pivot_wider()
. Wickham, Çetinkaya-Rundel, and Grolemund (2023) has a great overview of these concepts and functions.
For this guide, we’ll work with data that comes from adminstrative data sources, and already meets the tidy data standards, as it was collected and processed in machine readable formats. That doesn’t mean it won’t need some cleaning, of course.
For working with our first dataset, we’ll of course begin with data from the Integrated Postsecondary Education Data System (IPEDS). Though many IR professionals are used to working with the web interface of the IPEDS data center, we also have the opportunity to interact with the complete data files, which are available as comma-separated values files (.csv
) in a compressed format (.zip
), or as Access databases. We’ll be working with the .csv
files for now.1
If we go to the IPEDS Data Center and click on “Complete Data Files”, we’ll reach this page: https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx. We can hover over the various file links and note that they share a similar stem and file format. Let’s set two variables to start: one to store this URL stem and one to store the latest year available:
<- "https://nces.ed.gov/ipeds/datacenter/data/"
IPEDS_url <- 2022 IPEDS_year
Next we’re going to load libraries we’ll need to use. tidyverse
should be loaded already but we’ll load it again. The curl
package will assist us in downloading files.
# load libraries
library("tidyverse")
library("curl")
Let’s use the simple curl_download()
function from the curl
package. We need to provide it two parameters: a URL to download and a filename for the downloaded file. We can build both of those by concatenating info to match the available links using the variables we stored above. We’ll grab the IPEDS Directory Information data, which comes from the Institutional Characteristics Header component (“HD”), and the Completions component data (“C”). We can create the filenames and URLs by concatenating text fields. In Excel we would use CONCATENATE()
or the special concatenation character (&
), but using tidyverse we will use str_c()
:
curl_download(
str_c(IPEDS_url, "HD", IPEDS_year, ".zip"),
destfile = str_c("HD", IPEDS_year, ".zip")
)
curl_download(
str_c(IPEDS_url, "C", IPEDS_year, "_A", ".zip"),
destfile = str_c("C", IPEDS_year, "_A", ".zip")
)
Next we need to import the downloaded files into R and begin cleaning the data. We’ll start with the directory information data. We’ll begin using read_csv()
, which we can use to read in the .csv
file contained in the .zip
file. We’ll store the data as institutions
, then take a peek at the imported data using glimse()
:
<- read_csv(
institutions str_c("HD", IPEDS_year, ".zip")
)
Rows: 6256 Columns: 73
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (23): INSTNM, IALIAS, ADDR, CITY, STABBR, ZIP, CHFNM, CHFTITLE, EIN, UEI...
dbl (50): UNITID, FIPS, OBEREG, GENTELE, OPEFLAG, SECTOR, ICLEVEL, CONTROL, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(institutions)
Rows: 6,256
Columns: 73
$ UNITID <dbl> 100654, 100663, 100690, 100706, 100724, 100733, 100751, 10076…
$ INSTNM <chr> "Alabama A & M University", "University of Alabama at Birming…
$ IALIAS <chr> "AAMU", "UAB", "Southern Christian University Regions Univer…
$ ADDR <chr> "4900 Meridian Street", "Administration Bldg Suite 1070", "12…
$ CITY <chr> "Normal", "Birmingham", "Montgomery", "Huntsville", "Montgome…
$ STABBR <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
$ ZIP <chr> "35762", "35294-0110", "36117-3553", "35899", "36104-0271", "…
$ FIPS <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ OBEREG <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
$ CHFNM <chr> "Dr. Daniel K. Wims", "Ray L. Watts", "Michael C.Turner", "Ch…
$ CHFTITLE <chr> "President", "President", "President", "President", "Presiden…
$ GENTELE <dbl> 2.563725e+09, 2.059344e+09, 3.343874e+13, 2.568246e+09, 3.342…
$ EIN <chr> "636001109", "636005396", "237034324", "630520830", "63600110…
$ UEIS <chr> "JDVGS67MSLH7", "YND4PLMC9AN7", "RB27R4GLDKE7", "HB6KNGVNJRU1…
$ OPEID <chr> "00100200", "00105200", "02503400", "00105500", "00100500", "…
$ OPEFLAG <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ WEBADDR <chr> "www.aamu.edu/", "https://www.uab.edu/", "https://www.amridge…
$ ADMINURL <chr> "https://www.aamu.edu/admissions-aid/index.html", "https://ww…
$ FAIDURL <chr> "https://www.aamu.edu/admissions-aid/financial-aid/", "https:…
$ APPLURL <chr> "https://www.aamu.edu/admissions-aid/undergraduate-admissions…
$ NPRICURL <chr> "www.aamu.edu/admissions-aid/tuition-fees/net-price-calculato…
$ VETURL <chr> NA, "https://www.uab.edu/students/veterans", "https://www.amr…
$ ATHURL <chr> NA, "https://www.uab.edu/registrar/students", NA, "www.uah.ed…
$ DISAURL <chr> "https://www.aamu.edu/administrativeoffices/VADS/Pages/Disabi…
$ SECTOR <dbl> 1, 1, 2, 1, 1, 0, 1, 4, 1, 1, 1, 2, 4, 3, 4, 4, 2, 4, 9, 4, 4…
$ ICLEVEL <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 3, 2, 2…
$ CONTROL <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 1, 1, 2, 1, 3, 1, 1…
$ HLOFFER <dbl> 9, 9, 9, 9, 9, 9, 9, 3, 7, 9, 9, 5, 3, 9, 3, 3, 9, 3, 2, 3, 3…
$ UGOFFER <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ GROFFER <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2…
$ HDEGOFR1 <dbl> 12, 11, 12, 11, 11, 11, 11, 40, 20, 11, 11, 30, 40, 13, 40, 4…
$ DEGGRANT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
$ HBCU <dbl> 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2…
$ HOSPITAL <dbl> 2, 1, 2, 2, 2, -2, 2, -2, 2, 2, 2, 2, -2, 2, -2, -2, 2, -2, -…
$ MEDICAL <dbl> 2, 1, 2, 2, 2, -2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ TRIBAL <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ LOCALE <dbl> 12, 12, 12, 12, 12, 13, 13, 32, 31, 12, 13, 12, 41, 12, 32, 3…
$ OPENPUBL <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ACT <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "…
$ NEWID <dbl> -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -…
$ DEATHYR <dbl> -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -…
$ CLOSEDAT <chr> "-2", "-2", "-2", "-2", "-2", "-2", "-2", "-2", "-2", "-2", "…
$ CYACTIVE <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ POSTSEC <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ PSEFLAG <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ PSET4FLG <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ RPTMTH <dbl> 1, 1, 1, 1, 1, -2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, …
$ INSTCAT <dbl> 2, 2, 2, 2, 2, -2, 2, 4, 2, 2, 2, 2, 4, 2, 4, 4, 2, 4, 6, 4, …
$ C21BASIC <dbl> 18, 15, 20, 15, 17, -2, 15, 5, 22, 18, 15, 21, 4, 22, 2, 4, 1…
$ C21IPUG <dbl> 16, 14, 20, 17, 13, -2, 17, 2, 16, 16, 17, 9, 2, 20, 1, 2, 19…
$ C21IPGRD <dbl> 18, 14, 18, 17, 18, -2, 15, 0, 6, 4, 14, 0, 0, 7, 0, 0, 16, 0…
$ C21UGPRF <dbl> 10, 9, 5, 15, 10, -2, 12, 1, 5, 7, 14, 12, 2, 7, 2, 2, 7, 2, …
$ C21ENPRF <dbl> 4, 5, 6, 4, 3, -2, 4, 1, 3, 4, 4, 2, 1, 4, 1, 1, 5, 1, -2, 1,…
$ C21SZSET <dbl> 14, 15, 6, 13, 14, -2, 16, 2, 9, 13, 15, 11, 2, 6, 2, 3, 10, …
$ C18BASIC <dbl> 18, 15, 20, 16, 19, -2, 15, 2, 22, 18, 15, 21, 1, 20, 5, 5, 1…
$ C15BASIC <dbl> 18, 15, 20, 16, 19, -2, 16, 1, 22, 18, 16, 21, 1, 22, 8, 1, 2…
$ CCBASIC <dbl> 18, 15, 21, 15, 18, -2, 16, 2, 22, 18, 16, 21, 2, 23, 2, 5, 2…
$ CARNEGIE <dbl> 16, 15, 51, 16, 21, -2, 15, 40, 32, 21, 15, 31, 40, 40, 40, 4…
$ LANDGRNT <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ INSTSIZE <dbl> 3, 5, 1, 3, 2, -2, 5, 2, 2, 3, 5, 1, 2, 1, 2, 3, 2, 2, 1, 2, …
$ F1SYSTYP <dbl> 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1…
$ F1SYSNAM <chr> "-2", "The University of Alabama System", "-2", "The Universi…
$ F1SYSCOD <dbl> -2, 101050, -2, 101050, -2, 101050, 101050, 101030, -2, 10104…
$ CBSA <dbl> 26620, 13820, 33860, 26620, 33860, 46220, 46220, 10760, 26620…
$ CBSATYPE <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2…
$ CSA <dbl> 290, 142, 388, 290, 388, -2, -2, 388, 290, 388, 194, 142, 194…
$ COUNTYCD <dbl> 1089, 1073, 1101, 1089, 1101, 1125, 1125, 1123, 1083, 1101, 1…
$ COUNTYNM <chr> "Madison County", "Jefferson County", "Montgomery County", "M…
$ CNGDSTCD <dbl> 105, 107, 102, 105, 107, 107, 107, 103, 105, 102, 103, 107, 1…
$ LONGITUD <dbl> -86.56850, -86.79935, -86.17401, -86.64045, -86.29568, -87.52…
$ LATITUDE <dbl> 34.78337, 33.50570, 32.36261, 34.72456, 32.36432, 33.20701, 3…
$ DFRCGID <dbl> 107, 92, 125, 92, 98, -2, 91, 73, 134, 107, 91, 132, 70, 199,…
$ DFRCUSCG <dbl> 1, 1, 2, 2, 1, -2, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 2, 1, 2, …
That’s a lot of data! Since we’re not using all of the variables, let’s use select to pull out a subset of them. We’ll use the select()
function to do this.
This is an important place to teach an R technique that makes code much easier and cleaner to write. If you look at the documentation for different functions, you’ll note that typically, the first parameter of a function is a data object to use.
If we have several things we need to do with a data object, we could end up writing the name of the data object over, and over, and over, cluttering up our code.
R uses a special character sequence |>
, called the pipe, that takes the output of one command and uses it as the following command. (Some older documentation uses the sequence %>%
, which also works, but |>
is now preferred.)
<- institutions |>
institutions select(UNITID, INSTNM, OPEID, STABBR, SECTOR, HLOFFER)
glimpse(institutions)
Rows: 6,256
Columns: 6
$ UNITID <dbl> 100654, 100663, 100690, 100706, 100724, 100733, 100751, 100760…
$ INSTNM <chr> "Alabama A & M University", "University of Alabama at Birmingh…
$ OPEID <chr> "00100200", "00105200", "02503400", "00105500", "00100500", "0…
$ STABBR <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "A…
$ SECTOR <dbl> 1, 1, 2, 1, 1, 0, 1, 4, 1, 1, 1, 2, 4, 3, 4, 4, 2, 4, 9, 4, 4,…
$ HLOFFER <dbl> 9, 9, 9, 9, 9, 9, 9, 3, 7, 9, 9, 5, 3, 9, 3, 3, 9, 3, 2, 3, 3,…
Let’s move on to importing and cleaning the IPEDS Completions data. We’ll again use read_csv()
to read in the data and store it in an object called completions
. Since we’re familiar with the completions data, we know that we really only need major #1 and can filter out secondary majors. We will ask filter()
to keep only those rows equal to 1 (== 1
). Finally, similar to above, we only care about a few of the variables for now.
<- read_csv(
completions str_c("C", IPEDS_year, "_A.zip")
|>
) filter(MAJORNUM == 1) |>
select(UNITID, CIPCODE, AWLEVEL, CTOTALT)
Rows: 300877 Columns: 64
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (32): CIPCODE, AWLEVEL, XCTOTALT, XCTOTALM, XCTOTALW, XCAIANT, XCAIANM, ...
dbl (32): UNITID, MAJORNUM, CTOTALT, CTOTALM, CTOTALW, CAIANT, CAIANM, CAIAN...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(completions)
Rows: 279,506
Columns: 4
$ UNITID <dbl> 100654, 100654, 100654, 100654, 100654, 100654, 100654, 100654…
$ CIPCODE <chr> "01.0999", "01.1001", "01.1001", "01.1001", "01.9999", "01.999…
$ AWLEVEL <chr> "05", "05", "07", "17", "05", "07", "17", "05", "05", "07", "0…
$ CTOTALT <dbl> 9, 7, 7, 3, 1, 8, 3, 8, 3, 12, 11, 17, 41, 16, 0, 1, 3, 1, 8, …
Now that we have clean directory info (institutions
) and clean completions data (completions
), let’s combine them. We’ll use the *_join()
functions from the tidyverse
dplyr
package. This is like joining tables in SQL, or for those of you who use Excel like databases, like using VLOOKUP()
.
We’ll store this combined dataset as combined
for now. We need to provide the two data objects we’re joining, and the variable(s) we want to join on:
<- left_join(
combined
institutions,
completions,by = "UNITID",
keep = TRUE
)
glimpse(combined)
Rows: 279,720
Columns: 10
$ UNITID.x <dbl> 100654, 100654, 100654, 100654, 100654, 100654, 100654, 10065…
$ INSTNM <chr> "Alabama A & M University", "Alabama A & M University", "Alab…
$ OPEID <chr> "00100200", "00100200", "00100200", "00100200", "00100200", "…
$ STABBR <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
$ SECTOR <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ HLOFFER <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9…
$ UNITID.y <dbl> 100654, 100654, 100654, 100654, 100654, 100654, 100654, 10065…
$ CIPCODE <chr> "01.0999", "01.1001", "01.1001", "01.1001", "01.9999", "01.99…
$ AWLEVEL <chr> "05", "05", "07", "17", "05", "07", "17", "05", "05", "07", "…
$ CTOTALT <dbl> 9, 7, 7, 3, 1, 8, 3, 8, 3, 12, 11, 17, 41, 16, 0, 1, 3, 1, 8,…
Note that when we get to more complex functions that would create really long lines of code or nested functions, it’s best to separate your code into multiple lines to keep it readable. R will ignore the line breaks that you create just like it ignores spaces.
The data set is combined and clean, but since it has a row for every credential (or at least unique combinations of level and CIP code) for every institution, it’s pretty large. We’ll work with Ohio data so we can filter again to Ohio institutions:
<- combined |>
combined filter(STABBR == "OH")
glimpse(combined)
Rows: 12,521
Columns: 10
$ UNITID.x <dbl> 200590, 200590, 200590, 200590, 200590, 200590, 200590, 20059…
$ INSTNM <chr> "ETI Technical College", "ETI Technical College", "ETI Techni…
$ OPEID <chr> "03079000", "03079000", "03079000", "03079000", "03079000", "…
$ STABBR <chr> "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "…
$ SECTOR <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ HLOFFER <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9…
$ UNITID.y <dbl> 200590, 200590, 200590, 200590, 200590, 200590, 200590, 20059…
$ CIPCODE <chr> "22.0301", "46.0401", "48.0508", "51.0801", "51.0801", "51.38…
$ AWLEVEL <chr> "03", "02", "02", "02", "03", "03", "02", "02", "03", "02", "…
$ CTOTALT <dbl> 2, 7, 10, 11, 0, 15, 9, 0, 2, 37, 19, 9, 2, 10, 2, 39, 10, 10…
Much better. But we have two more steps we’d like to do. Note that many of the variables are repeated in the dataset. We can use a special data type, factors, to clean up this data into categorical variables. tidyverse
includes a package, forcats
, to help with working with categorical data using the factor data type. Let’s do this for SECTOR
, AWLEVEL
, and CTOTALT
:
<- combined |>
combined mutate(
SECTOR = as_factor(SECTOR),
HLOFFER = as_factor(HLOFFER),
AWLEVEL = as_factor(AWLEVEL)
)
glimpse(combined)
Rows: 12,521
Columns: 10
$ UNITID.x <dbl> 200590, 200590, 200590, 200590, 200590, 200590, 200590, 20059…
$ INSTNM <chr> "ETI Technical College", "ETI Technical College", "ETI Techni…
$ OPEID <chr> "03079000", "03079000", "03079000", "03079000", "03079000", "…
$ STABBR <chr> "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "…
$ SECTOR <fct> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ HLOFFER <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9…
$ UNITID.y <dbl> 200590, 200590, 200590, 200590, 200590, 200590, 200590, 20059…
$ CIPCODE <chr> "22.0301", "46.0401", "48.0508", "51.0801", "51.0801", "51.38…
$ AWLEVEL <fct> 03, 02, 02, 02, 03, 03, 02, 02, 03, 02, 03, 07, 17, 07, 06, 0…
$ CTOTALT <dbl> 2, 7, 10, 11, 0, 15, 9, 0, 2, 37, 19, 9, 2, 10, 2, 39, 10, 10…
Lastly, since we want to use the SECTOR
variable in particular later, let’s provide it with desciptions that are used in IPEDS:2
<- combined |>
combined mutate(
SECTOR = fct_recode(
SECTOR,"Administrative Unit" = "0",
"Public 4-year or above" = "1",
"Private nonprofit 4-year or above" = "2",
"Private for-profit 4-year or above" = "3",
"Public 2-year" = "4",
"Private nonprofit 2-year" = "5",
"Private for-profit 2-year" = "6",
"Public less-than-2-year" = "7",
"Private nonprofit less-than-2-year" = "8",
"Private for-profit less-than-2-year" = "9"
) )
Once you’re comfortable with many of these cleaning steps, you can revisit the readr
function options. You can actually change column types, select variables, etc. all with optional arguments.
# load libraries
library("tidyverse")
library("curl")
# store IPEDS url & data year for use
<- "https://nces.ed.gov/ipeds/datacenter/data/"
IPEDS_url <- 2022
IPEDS_year
# download data
curl_download(
str_c(IPEDS_url, "HD", IPEDS_year, ".zip"),
destfile = str_c("HD", IPEDS_year, ".zip")
)
curl_download(
str_c(IPEDS_url, "C", IPEDS_year, "_A", ".zip"),
destfile = str_c("C", IPEDS_year, "_A", ".zip")
)
# process institution directory information
<- read_csv(
institutions str_c("HD", IPEDS_year, ".zip")
|>
) select(UNITID, INSTNM, OPEID, STABBR, SECTOR, HLOFFER)
# process completions data
<- read_csv(
completions str_c("C", IPEDS_year, "_A.zip")
|>
) filter(MAJORNUM == 1) |>
select(UNITID, CIPCODE, AWLEVEL, CTOTALT)
# combine datasets, additional cleaning
<- left_join(
combined
institutions,
completions,by = "UNITID",
keep = TRUE
|>
) filter(STABBR == "OH") |> #filter to Ohio
mutate(
SECTOR = as_factor(SECTOR),
HLOFFER = as_factor(HLOFFER),
AWLEVEL = as_factor(AWLEVEL)
|>
) mutate(
SECTOR = fct_recode(
SECTOR,"Administrative Unit" = "0",
"Public 4-year or above" = "1",
"Private nonprofit 4-year or above" = "2",
"Private for-profit 4-year or above" = "3",
"Public 2-year" = "4",
"Private nonprofit 2-year" = "5",
"Private for-profit 2-year" = "6",
"Public less-than-2-year" = "7",
"Private nonprofit less-than-2-year" = "8",
"Private for-profit less-than-2-year" = "9"
) )
For our second example dataset, we’ll work with Occupational Projections data from the Ohio Department of Jobs and Family Services, Bureau of Labor Market Information. The Long-Term Projections data by JobsOhio region and Metropolitan Statistical Area (MSA) can be found at https://ohiolmi.com/Home/Projections/ProjectionsHome_08.16.23. As interest in post-graduate outcomes increases throughout higher education, institutional researchers are increasingly working with labor market data.
We’ll begin with the Central Ohio Excel file. If you copy the URL, you’ll get ,https://ohiolmi.com/_docs/PROJ/JobsOhio/Central.xlsx.
If you hover over the file links, you’ll notice that they share a common URL stem (https://ohiolmi.com/_docs/PROJ/JobsOhio/
), which will be helpful to use in parameterized reports, covered in Chapter 4.
Let’s store that URL to make it easier to use:
<- "https://ohiolmi.com/_docs/PROJ/JobsOhio/Central.xlsx" projections_url
We’ll pull down the file with curl:
curl_download(projections_url, "projectionsdata.xlsx")
Next we’ll try to import the file as a data object, and then check the results with summary()
and glimpse()
:
<- read_xlsx("projectionsdata.xlsx") projections_data
New names:
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
• `` -> `...7`
• `` -> `...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...11`
• `` -> `...12`
glimpse(projections_data)
Rows: 501
Columns: 12
$ `JobsOhio Regional Network - Central Region` <chr> "Occupational* Employment…
$ ...2 <chr> NA, NA, NA, NA, "Occupati…
$ ...3 <chr> NA, NA, "Employment*", "2…
$ ...4 <chr> NA, NA, NA, "2030", "Proj…
$ ...5 <chr> NA, NA, "Change in", "Emp…
$ ...6 <chr> NA, NA, NA, NA, "Percent"…
$ ...7 <chr> NA, NA, NA, "Annual Openi…
$ ...8 <chr> NA, NA, NA, NA, "Exits", …
$ ...9 <chr> NA, NA, NA, NA, "Transfer…
$ ...10 <chr> NA, NA, NA, NA, "Total", …
$ ...11 <chr> NA, NA, NA, "Median Wage"…
$ ...12 <chr> NA, NA, NA, NA, NA, NA, N…
summary(projections_data)
JobsOhio Regional Network - Central Region ...2
Length:501 Length:501
Class :character Class :character
Mode :character Mode :character
...3 ...4 ...5 ...6
Length:501 Length:501 Length:501 Length:501
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
...7 ...8 ...9 ...10
Length:501 Length:501 Length:501 Length:501
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
...11 ...12
Length:501 Length:501
Class :character Class :character
Mode :character Mode :character
That returned a bunch of junk, because there are multiple header rows in the file. Luckily, there’s an optional parameter we can add to read_xlsx
to skip to the line we want (line 6 has the headers we want), which we can read about in the readxl
documentation.
<- read_xlsx("projectionsdata.xlsx", skip = 5) projections_data
New names:
• `` -> `...12`
glimpse(projections_data)
Rows: 496
Columns: 12
$ Code <chr> "00-0000", "11-0000", "11-1011", "11-1021", "11-1…
$ `Occupational Title` <chr> "Total, All Occupations", "Management Occupations…
$ Annual <chr> "1168921", "70397", "1830", "14405", "336", "1777…
$ Projected <dbl> 1261966, 77059, 1671, 15779, 372, 1869, 2668, 413…
$ `2020-2030` <dbl> 93045, 6662, -159, 1374, 36, 92, 131, 37, 252, 10…
$ Percent <dbl> 0.0796, 0.0946, -0.0869, 0.0954, 0.1071, 0.0518, …
$ Growth <dbl> 9304, 666, -16, 137, 4, 9, 13, 4, 25, 10, 96, 9, …
$ Exits <dbl> 50873, 2143, 48, 301, 8, 36, 52, 10, 86, 54, 122,…
$ Transfers <dbl> 79184, 3818, 71, 936, 17, 115, 164, 22, 151, 188,…
$ Total <dbl> 139361, 6627, 103, 1374, 29, 160, 229, 36, 262, 2…
$ `May 2021` <chr> "22.4", "48.2", "80.739999999999995", "47.63", "3…
$ ...12 <chr> NA, NA, NA, NA, "**", NA, NA, "^", "^", NA, NA, N…
summary(projections_data)
Code Occupational Title Annual Projected
Length:496 Length:496 Length:496 Min. : 104
Class :character Class :character Class :character 1st Qu.: 400
Mode :character Mode :character Mode :character Median : 976
Mean : 7667
3rd Qu.: 2903
Max. :1261966
NA's :8
2020-2030 Percent Growth Exits
Min. :-1975.0 Min. :-0.35860 Min. :-198.00 Min. : 2.0
1st Qu.: 6.0 1st Qu.: 0.01740 1st Qu.: 1.00 1st Qu.: 12.0
Median : 53.0 Median : 0.06970 Median : 5.00 Median : 34.5
Mean : 566.1 Mean : 0.07336 Mean : 56.61 Mean : 309.3
3rd Qu.: 216.2 3rd Qu.: 0.12575 3rd Qu.: 22.00 3rd Qu.: 101.2
Max. :93045.0 Max. : 0.58880 Max. :9304.00 Max. :50873.0
NA's :8 NA's :8 NA's :8 NA's :8
Transfers Total May 2021 ...12
Min. : 1.0 Min. : 5.0 Length:496 Length:496
1st Qu.: 21.0 1st Qu.: 37.0 Class :character Class :character
Median : 57.5 Median : 95.5 Mode :character Mode :character
Mean : 481.4 Mean : 847.3
3rd Qu.: 179.8 3rd Qu.: 301.8
Max. :79184.0 Max. :139361.0
NA's :8 NA's :8
It seems we have a little more data cleaning to do. We need to change the types of a few columns3, rename a few columns, and do some filtering to remove the summary occupations. Let’s use additional parameters in read_xlsx
to define types and column names. By providing the column names instead of importing them, we need to change the skip =
option to 6 instead of 5.
There’s often more than one way to do something. Instead of providing column names on the import (and changing the skip parameter), we could rename the columns with the rename()
function after import.
<- read_xlsx(
projections_data "projectionsdata.xlsx",
skip = 6,
col_names = c(
"SOC",
"Occupation",
"Current_jobs",
"Projected_jobs",
"Change_num",
"Change_pct",
"Growth",
"Exits",
"Transfers",
"Tot_openings",
"Wage",
"Notes"
),col_types = c(
"text",
"text",
"numeric",
"numeric",
"numeric",
"numeric",
"numeric",
"numeric",
"numeric",
"numeric",
"numeric",
"text"
)
)
glimpse(projections_data)
Rows: 496
Columns: 12
$ SOC <chr> "00-0000", "11-0000", "11-1011", "11-1021", "11-1031", …
$ Occupation <chr> "Total, All Occupations", "Management Occupations", "Ch…
$ Current_jobs <dbl> 1168921, 70397, 1830, 14405, 336, 1777, 2537, 376, 2900…
$ Projected_jobs <dbl> 1261966, 77059, 1671, 15779, 372, 1869, 2668, 413, 3152…
$ Change_num <dbl> 93045, 6662, -159, 1374, 36, 92, 131, 37, 252, 103, 965…
$ Change_pct <dbl> 0.0796, 0.0946, -0.0869, 0.0954, 0.1071, 0.0518, 0.0516…
$ Growth <dbl> 9304, 666, -16, 137, 4, 9, 13, 4, 25, 10, 96, 9, 3, 23,…
$ Exits <dbl> 50873, 2143, 48, 301, 8, 36, 52, 10, 86, 54, 122, 27, 1…
$ Transfers <dbl> 79184, 3818, 71, 936, 17, 115, 164, 22, 151, 188, 298, …
$ Tot_openings <dbl> 139361, 6627, 103, 1374, 29, 160, 229, 36, 262, 252, 51…
$ Wage <dbl> 22.40, 48.20, 80.74, 47.63, 37738.00, 62.05, 61.39, 57.…
$ Notes <chr> NA, NA, NA, NA, "**", NA, NA, "^", "^", NA, NA, NA, NA,…
summary(projections_data)
SOC Occupation Current_jobs Projected_jobs
Length:496 Length:496 Min. : 102.0 Min. : 104
Class :character Class :character 1st Qu.: 397.5 1st Qu.: 400
Mode :character Mode :character Median : 937.5 Median : 976
Mean : 7101.0 Mean : 7667
3rd Qu.: 2729.0 3rd Qu.: 2903
Max. :1168921.0 Max. :1261966
NA's :8 NA's :8
Change_num Change_pct Growth Exits
Min. :-1975.0 Min. :-0.35860 Min. :-198.00 Min. : 2.0
1st Qu.: 6.0 1st Qu.: 0.01740 1st Qu.: 1.00 1st Qu.: 12.0
Median : 53.0 Median : 0.06970 Median : 5.00 Median : 34.5
Mean : 566.1 Mean : 0.07336 Mean : 56.61 Mean : 309.3
3rd Qu.: 216.2 3rd Qu.: 0.12575 3rd Qu.: 22.00 3rd Qu.: 101.2
Max. :93045.0 Max. : 0.58880 Max. :9304.00 Max. :50873.0
NA's :8 NA's :8 NA's :8 NA's :8
Transfers Tot_openings Wage Notes
Min. : 1.0 Min. : 5.0 Min. : 10.04 Length:496
1st Qu.: 21.0 1st Qu.: 37.0 1st Qu.: 18.13 Class :character
Median : 57.5 Median : 95.5 Median : 23.53 Mode :character
Mean : 481.4 Mean : 847.3 Mean : 4994.70
3rd Qu.: 179.8 3rd Qu.: 301.8 3rd Qu.: 37.24
Max. :79184.0 Max. :139361.0 Max. :129451.00
NA's :8 NA's :8 NA's :11
Next, let’s filter the dataset to remove those summary occupations and any notes at the end.
Summary occupations are coded with SOC codes ending in “0000”, so we can quickly identify them. In Excel, we’d likely use the MID()
or RIGHT()
commands to pull out that sequence. In R’s tidyverse package, we can use str_sub()
, which works very similar to these, extracting a subset of the string field based on character position. Negative values mean work from the end.
Any rows with anything other than the SOC code in the SOC column should be ignored. Since SOC codes are 7 characters long, we’ll try to use that. In Excel, we’d use LEN()
to get the length, while here we’ll use str_length()
from the tidyverse package. Note that in R, equals and not equals (==
and !=
) are different than in Excel (=
and <>
)
<- projections_data |>
projections_data filter(
str_sub(SOC, -4, -1) != "0000",
str_length(SOC) == 7
)
glimpse(projections_data)
Rows: 465
Columns: 12
$ SOC <chr> "11-1011", "11-1021", "11-1031", "11-2021", "11-2022", …
$ Occupation <chr> "Chief Executives", "General and Operations Managers", …
$ Current_jobs <dbl> 1830, 14405, 336, 1777, 2537, 376, 2900, 3250, 5457, 13…
$ Projected_jobs <dbl> 1671, 15779, 372, 1869, 2668, 413, 3152, 3353, 6422, 14…
$ Change_num <dbl> -159, 1374, 36, 92, 131, 37, 252, 103, 965, 87, 29, 228…
$ Change_pct <dbl> -0.0869, 0.0954, 0.1071, 0.0518, 0.0516, 0.0984, 0.0869…
$ Growth <dbl> -16, 137, 4, 9, 13, 4, 25, 10, 96, 9, 3, 23, 0, 8, 3, 5…
$ Exits <dbl> 48, 301, 8, 36, 52, 10, 86, 54, 122, 27, 10, 30, 3, 24,…
$ Transfers <dbl> 71, 936, 17, 115, 164, 22, 151, 188, 298, 67, 24, 86, 8…
$ Tot_openings <dbl> 103, 1374, 29, 160, 229, 36, 262, 252, 516, 103, 37, 13…
$ Wage <dbl> 80.74, 47.63, 37738.00, 62.05, 61.39, 57.62, 47.73, 63.…
$ Notes <chr> NA, NA, "**", NA, NA, "^", "^", NA, NA, NA, NA, NA, NA,…
summary(projections_data)
SOC Occupation Current_jobs Projected_jobs
Length:465 Length:465 Min. : 102 Min. : 104
Class :character Class :character 1st Qu.: 376 1st Qu.: 395
Mode :character Mode :character Median : 841 Median : 902
Mean : 2425 Mean : 2618
3rd Qu.: 2296 3rd Qu.: 2583
Max. :34936 Max. :39662
Change_num Change_pct Growth Exits
Min. :-1948.0 Min. :-0.35860 Min. :-195.00 Min. : 2.0
1st Qu.: 5.0 1st Qu.: 0.01650 1st Qu.: 0.00 1st Qu.: 12.0
Median : 44.0 Median : 0.06850 Median : 4.00 Median : 30.0
Mean : 193.9 Mean : 0.07258 Mean : 19.39 Mean : 105.8
3rd Qu.: 185.0 3rd Qu.: 0.12710 3rd Qu.: 18.00 3rd Qu.: 82.0
Max. : 8608.0 Max. : 0.58880 Max. : 861.00 Max. :2775.0
Transfers Tot_openings Wage Notes
Min. : 1.0 Min. : 5.0 Min. : 10.04 Length:465
1st Qu.: 21.0 1st Qu.: 36.0 1st Qu.: 18.13 Class :character
Median : 52.0 Median : 87.0 Median : 23.60 Mode :character
Mean : 164.6 Mean : 289.8 Mean : 5242.13
3rd Qu.: 143.0 3rd Qu.: 231.0 3rd Qu.: 37.30
Max. :3245.0 Max. :6071.0 Max. :129451.00
NA's :3
And finally, we need to adjust the wage column. It seems that there is a mix of hourly and annual wage figures in this column. Let’s convert all of them to annual wages as a new variable, by multiplying any values below $200/hr by 2,080 hours/yr. We’ll do this by using mutate()
to create the new variable, and define it using a case_when()
.4
<- projections_data |>
projections_data mutate(
annual_wage = case_when(
< 200 ~ Wage * 2080,
Wage .default = Wage
)
)
glimpse(projections_data)
Rows: 465
Columns: 13
$ SOC <chr> "11-1011", "11-1021", "11-1031", "11-2021", "11-2022", …
$ Occupation <chr> "Chief Executives", "General and Operations Managers", …
$ Current_jobs <dbl> 1830, 14405, 336, 1777, 2537, 376, 2900, 3250, 5457, 13…
$ Projected_jobs <dbl> 1671, 15779, 372, 1869, 2668, 413, 3152, 3353, 6422, 14…
$ Change_num <dbl> -159, 1374, 36, 92, 131, 37, 252, 103, 965, 87, 29, 228…
$ Change_pct <dbl> -0.0869, 0.0954, 0.1071, 0.0518, 0.0516, 0.0984, 0.0869…
$ Growth <dbl> -16, 137, 4, 9, 13, 4, 25, 10, 96, 9, 3, 23, 0, 8, 3, 5…
$ Exits <dbl> 48, 301, 8, 36, 52, 10, 86, 54, 122, 27, 10, 30, 3, 24,…
$ Transfers <dbl> 71, 936, 17, 115, 164, 22, 151, 188, 298, 67, 24, 86, 8…
$ Tot_openings <dbl> 103, 1374, 29, 160, 229, 36, 262, 252, 516, 103, 37, 13…
$ Wage <dbl> 80.74, 47.63, 37738.00, 62.05, 61.39, 57.62, 47.73, 63.…
$ Notes <chr> NA, NA, "**", NA, NA, "^", "^", NA, NA, NA, NA, NA, NA,…
$ annual_wage <dbl> 167939.2, 99070.4, 37738.0, 129064.0, 127691.2, 119849.…
summary(projections_data)
SOC Occupation Current_jobs Projected_jobs
Length:465 Length:465 Min. : 102 Min. : 104
Class :character Class :character 1st Qu.: 376 1st Qu.: 395
Mode :character Mode :character Median : 841 Median : 902
Mean : 2425 Mean : 2618
3rd Qu.: 2296 3rd Qu.: 2583
Max. :34936 Max. :39662
Change_num Change_pct Growth Exits
Min. :-1948.0 Min. :-0.35860 Min. :-195.00 Min. : 2.0
1st Qu.: 5.0 1st Qu.: 0.01650 1st Qu.: 0.00 1st Qu.: 12.0
Median : 44.0 Median : 0.06850 Median : 4.00 Median : 30.0
Mean : 193.9 Mean : 0.07258 Mean : 19.39 Mean : 105.8
3rd Qu.: 185.0 3rd Qu.: 0.12710 3rd Qu.: 18.00 3rd Qu.: 82.0
Max. : 8608.0 Max. : 0.58880 Max. : 861.00 Max. :2775.0
Transfers Tot_openings Wage Notes
Min. : 1.0 Min. : 5.0 Min. : 10.04 Length:465
1st Qu.: 21.0 1st Qu.: 36.0 1st Qu.: 18.13 Class :character
Median : 52.0 Median : 87.0 Median : 23.60 Mode :character
Mean : 164.6 Mean : 289.8 Mean : 5242.13
3rd Qu.: 143.0 3rd Qu.: 231.0 3rd Qu.: 37.30
Max. :3245.0 Max. :6071.0 Max. :129451.00
NA's :3
annual_wage
Min. : 20883
1st Qu.: 37710
Median : 48776
Mean : 57792
3rd Qu.: 75712
Max. :179150
NA's :3
And now we have a clean dataset! We’ll use this further in Chapter 3 and Chapter 4, so let’s show what it looks like all together.
# load libraries
library("readxl")
library("curl")
library("tidyverse")
# define file url
<- "https://ohiolmi.com/_docs/PROJ/JobsOhio/Central.xlsx"
projections_url
# download file
curl_download(projections_url, "projectionsdata.xlsx")
# read file with approrpiate settings
<- read_xlsx(
projections_data "projectionsdata.xlsx",
skip = 6,
col_names = c(
"SOC",
"Occupation",
"Current_jobs",
"Projected_jobs",
"Change_num",
"Change_pct",
"Growth",
"Exits",
"Transfers",
"Tot_openings",
"Wage",
"Notes"
),col_types = c(
"text",
"text",
"numeric",
"numeric",
"numeric",
"numeric",
"numeric",
"numeric",
"numeric",
"numeric",
"numeric",
"text"
)|>
)
# remove summary occupations and notes/blank rows
filter(
str_sub(SOC, -4, -1) != "0000",
str_length(SOC) == 7
|>
)
# create annual wage column so values are consistent
mutate(
annual_wage = case_when(
< 200 ~ Wage * 2080,
Wage .default = Wage
) )
Think about the kinds of data that you work with regularly. Is it tidy? How is it stored? What functions will you need to use to work with it in R?
We used functions from several tidyverse packages. Especially when learning, it’s nice to have quick references. Tidyverse has a series of official cheat sheets that you’ll likely find useful. Take a look:
package | what this package is about | site | cheat sheet |
---|---|---|---|
readxl | data import (esp. Excel files) | site | cheat sheet |
dplyr | data transformation | site | cheat sheet |
stringr | working with text | site | cheat sheet |
factors | categorical data | site | cheat sheet |
At the start of this chapter we covered importing data - but didn’t cover exporting data. Of course, while this guide will show you how far you can take R, it also can be just one small piece of your toolchain if you wish. You could use R to do some cleaning, processing, and some analysis, and then export the data for use in another tool like a dashboard product.
The tidyverse
package readr
includes a set of write_
functions to export to CSV quickly. write_csv_excel()
takes a data object input and a filename (e.g. mydata.csv
) and builds out the file which can be read in Excel or other tools.
We briefly reviewed accessing a database earlier in this chapter. DBI
provides dbWriteTable()
to write a data table object as a database table. You must provide the db connection object, a table name, and then the data object you’re storing. It can take an append
argument if you want to simply add data to an existing table.
Yes, there are a couple packages that are designed for accessing IPEDS data including this one in CRAN and this one on GitHub. But we’re using the files directly from the IPEDS data center because we are focused on learning how to import and clean data.↩︎
And included in the dictionary file or the web interface↩︎
By default, the tidyverse read_
functions guess at column types by examining the fields, but it is imperfect.↩︎
case_when()
is inspired by the SQL CASE
statement, and is more elegant than nested IF()
functions you may be used to using in Excel.↩︎