3  Statistical analysis & data visualizations

3.1 Statistical analysis

R was created as a statistics-focused programming language, so it’s important to at least briefly showcase how R can be used to run statistical methods and quickly produce results that can then be visualized, used in reports, and stored for future use. However, institutional researchers come from many different backgrounds, and we don’t want to assume that all IR professionals reading this guide have the same level of knowledge of statistical methods. We’ll keep this section brief, but know that if a method exists, it likely has an implementation in R.

In this section we’ll build off of the occupational projections data that we worked with in Section 2.5, which is stored as the R data object projections_data.

Descriptive statistics

During data cleaning, we utilized the summary() function in conjunction with glimpse() to quickly examine a dataset. The summary() function displays basic descriptive statistics about every numeric variable in the dataset:

summary(projections_data)
     SOC             Occupation         Current_jobs   Projected_jobs 
 Length:465         Length:465         Min.   :  102   Min.   :  104  
 Class :character   Class :character   1st Qu.:  376   1st Qu.:  395  
 Mode  :character   Mode  :character   Median :  841   Median :  902  
                                       Mean   : 2425   Mean   : 2618  
                                       3rd Qu.: 2296   3rd Qu.: 2583  
                                       Max.   :34936   Max.   :39662  
                                                                      
   Change_num        Change_pct           Growth            Exits       
 Min.   :-1948.0   Min.   :-0.35860   Min.   :-195.00   Min.   :   2.0  
 1st Qu.:    5.0   1st Qu.: 0.01650   1st Qu.:   0.00   1st Qu.:  12.0  
 Median :   44.0   Median : 0.06850   Median :   4.00   Median :  30.0  
 Mean   :  193.9   Mean   : 0.07258   Mean   :  19.39   Mean   : 105.8  
 3rd Qu.:  185.0   3rd Qu.: 0.12710   3rd Qu.:  18.00   3rd Qu.:  82.0  
 Max.   : 8608.0   Max.   : 0.58880   Max.   : 861.00   Max.   :2775.0  
                                                                        
   Transfers       Tot_openings         Wage              Notes          
 Min.   :   1.0   Min.   :   5.0   Min.   :    10.04   Length:465        
 1st Qu.:  21.0   1st Qu.:  36.0   1st Qu.:    18.13   Class :character  
 Median :  52.0   Median :  87.0   Median :    23.60   Mode  :character  
 Mean   : 164.6   Mean   : 289.8   Mean   :  5242.13                     
 3rd Qu.: 143.0   3rd Qu.: 231.0   3rd Qu.:    37.30                     
 Max.   :3245.0   Max.   :6071.0   Max.   :129451.00                     
                                   NA's   :3                             
  annual_wage    
 Min.   : 20883  
 1st Qu.: 37710  
 Median : 48776  
 Mean   : 57792  
 3rd Qu.: 75712  
 Max.   :179150  
 NA's   :3       

Here we can quickly identify some important information about the dataset:

  • it includes 465 occupations that currently employ a range of 102 to 34,936 persons,1
  • the typical occupation is projected to grow 7% over the next ten years,
  • the median occupation will have 87 opening each year,2 and
  • the median occupation pays $48,776.

Inferential statistics: linear regression

Now let’s use linear regression to build a simple model. Let’s see whether annual wages might predict the rate at which persons leave an occupation.

We’ll first need to create an additonal variable to calculate the occupational turnover rate, which we’ll use as our outcome variable.

Then we’ll use the lm function provided in base R. We’ll store the regression model as an R object, so we can work with it. We provide a description of the model in the format outcome_variable ~ predictor1 (+ predictor2 ...). We provide an optional parameter na.action = na.exclude to direct how missing values should be handled.

projections_data <- projections_data |>
    mutate(
        turnover_rate = (Exits + Transfers) / Current_jobs
        )

projections_model <- lm(turnover_rate ~ annual_wage, 
                            data = projections_data,
                            na.action = na.exclude)

summary(projections_model)

Call:
lm(formula = turnover_rate ~ annual_wage, data = projections_data, 
    na.action = na.exclude)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.076254 -0.014932 -0.004293  0.010591  0.168017 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.464e-01  2.962e-03   49.44   <2e-16 ***
annual_wage -7.687e-07  4.637e-08  -16.58   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02709 on 460 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.374, Adjusted R-squared:  0.3726 
F-statistic: 274.8 on 1 and 460 DF,  p-value: < 2.2e-16

You can see that annual wages does appear to be a strong predictor of occupational turnover rate, since its coefficient is statistically significant and the model explains a considerable proportion of the variance.

Keen-eyed readers will note that wages may not be sufficiently normal to use as a variable in linear regression and likely needs a data transformation. Variables with quantity often need a log transform to be treated as sufficiently normal. We can make a quick adjustment to our code for this, without even needing a new variable:

projections_model <- lm(turnover_rate ~ log(annual_wage), 
                            data = projections_data,
                            na.action = na.exclude)

summary(projections_model)

Call:
lm(formula = turnover_rate ~ log(annual_wage), data = projections_data, 
    na.action = na.exclude)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.070564 -0.013939 -0.001799  0.011336  0.151656 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.688846   0.029031   23.73   <2e-16 ***
log(annual_wage) -0.054000   0.002669  -20.23   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02491 on 460 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.4708,    Adjusted R-squared:  0.4697 
F-statistic: 409.3 on 1 and 460 DF,  p-value: < 2.2e-16

This correction seems to have improved our model considerably, as the model now explains even more of the variance in occupational turnover rate.

The residuals may be of interest,3 so let’s pull them back into our projections_data:

projections_data <- projections_data |>
    mutate(residuals = residuals(projections_model))

glimpse(projections_data)
Rows: 465
Columns: 15
$ SOC            <chr> "11-1011", "11-1021", "11-1031", "11-2021", "11-2022", …
$ Occupation     <chr> "Chief Executives", "General and Operations Managers", …
$ Current_jobs   <dbl> 1830, 14405, 336, 1777, 2537, 376, 2900, 3250, 5457, 13…
$ Projected_jobs <dbl> 1671, 15779, 372, 1869, 2668, 413, 3152, 3353, 6422, 14…
$ Change_num     <dbl> -159, 1374, 36, 92, 131, 37, 252, 103, 965, 87, 29, 228…
$ Change_pct     <dbl> -0.0869, 0.0954, 0.1071, 0.0518, 0.0516, 0.0984, 0.0869…
$ Growth         <dbl> -16, 137, 4, 9, 13, 4, 25, 10, 96, 9, 3, 23, 0, 8, 3, 5…
$ Exits          <dbl> 48, 301, 8, 36, 52, 10, 86, 54, 122, 27, 10, 30, 3, 24,…
$ Transfers      <dbl> 71, 936, 17, 115, 164, 22, 151, 188, 298, 67, 24, 86, 8…
$ Tot_openings   <dbl> 103, 1374, 29, 160, 229, 36, 262, 252, 516, 103, 37, 13…
$ Wage           <dbl> 80.74, 47.63, 37738.00, 62.05, 61.39, 57.62, 47.73, 63.…
$ Notes          <chr> NA, NA, "**", NA, NA, "^", "^", NA, NA, NA, NA, NA, NA,…
$ annual_wage    <dbl> 167939.2, 99070.4, 37738.0, 129064.0, 127691.2, 119849.…
$ turnover_rate  <dbl> 0.06502732, 0.08587296, 0.07440476, 0.08497468, 0.08513…
$ residuals      <dbl> 0.0258700295, 0.0182162354, -0.0453703761, 0.0315996376…

3.2 Data visualizations

R can be used to create all kinds of graphs, utilizing the ggplot2 package from tidyverse. The ggplot2 package is based on (and named after) a book, The Grammar of Graphics (Wilkinson 2012). This is arguably one of the most complex packages, so you’ll want to have the site and cheat sheet available for review while you’re learning it.

In this section we’ll work with the IPEDS data we cleaned in Chapter 2.

Visualization #1

For our first visualization, we’ll keep the data simple. We just want to create a bar chart summarizing the number of postsecondary institutions in Ohio by sector.

We first need to aggregate the data a bit. We’ll use summarize() and group_by(). These functions act like steps in a Excel pivot table, with group_by() selecting which variables should be grouped in rows, and summarize() taking aggregations like mean() to calculate values.

sector_counts <- combined |>
    group_by(SECTOR) |>
    summarize(
        unique_inst = n_distinct(UNITID.x)
    )

A plot in ggplot2 includes three basic parts: the dataset you’re using, the a definition of the coordinate system and aesthetics, and the geometry (shapes) that you want to display. We use + to add on the geometry layer (and any further optional layers).

sector_counts |> # start with the summarized data
    # use an x,y coordinate system with SECTOR and unique_inst
    ggplot(aes(x = SECTOR, y = unique_inst)) +
    # apply bar shapes by using values in the data
    geom_col()

This isn’t very pretty, but we can make some adjustments. First, we should consider the order of our chart, which we can rearrange using reorder() and providing a field and order (- for descending):

sector_counts |> # start with the summarized data
    # use an x,y coordinate system with SECTOR and unique_inst
    ggplot(aes(x = reorder(SECTOR, -unique_inst), y = unique_inst)) +
    # apply bar shapes by using values in the data
    geom_col()

Now let’s start to label it. We would like both data labels and axes labels, as well as a title:

sector_counts |> # start with the summarized data
    # use an x,y coordinate system with SECTOR and unique_inst
    ggplot(aes(x = reorder(SECTOR, -unique_inst), y = unique_inst)) +
    # apply bar shapes by using values in the data
    geom_col() +
    # add data labels
    geom_text(aes(label = unique_inst), vjust = -0.5, size = 3) +
    # label axes and provide title
    labs(
        title = "Distribution of Ohio Institutions by Sector",
        x = "Sector",
        y = "Number of institutions"
    )

Now let’s pretty it up. We need to adjust the x axis labels, and change up the colors. We can change the bar color with an optional fill parameter to geom_col(), and adjust the labels and background with a theme layer:

sector_counts |> # start with the summarized data
    # use an x,y coordinate system with SECTOR and unique_inst
    ggplot(aes(x = reorder(SECTOR, -unique_inst), y = unique_inst)) +
    # apply bar shapes by using values in the data, use blue color
    geom_col(fill = "blue") +
    # add data labels
    geom_text(aes(label = unique_inst), vjust = -0.5, size = 3) +
    # label axes and provide title
    labs(
        title = "Distribution of Ohio Institutions by Sector",
        x = "Sector",
        y = "Number of institutions"
    ) +
    theme(
        axis.text.x = element_text(angle = 40, hjust = 1, size = 6),
        panel.background = element_blank()
    )

That’s much nicer! Let’s make a few more adjustments. We can specify a particular shade of blue with an exact hex code. Columbus State Community College uses a blue of R:0 G:114 B:152, which is a hex code of #007298. Then we’ll add a caption (in the labels layer labs()), a box around the chart (a new geom_rect() layer), and add additional theme() options:

sector_counts |> # start with the summarized data
    # use an x,y coordinate system with SECTOR and unique_inst
    ggplot(aes(x = reorder(SECTOR, -unique_inst), y = unique_inst)) +
    # apply bar shapes by using values in the data, use blue color
    geom_col(fill = "#007298", color = "black") +
    # add data labels
    geom_text(aes(label = unique_inst), vjust = -0.5, size = 3) +
    # label axes and provide title
    labs(
        title = "Distribution of Ohio Institutions by Sector",
        x = "Sector",
        y = "Number of institutions",
        # new caption
        caption = "Data Source: IPEDS Data Center"
    ) +
    theme(
        axis.text.x = element_text(angle = 40, hjust = 1, size = 6),
        panel.background = element_blank(),
        axis.title.x = element_text(margin = margin(t = 10), color = "black"),
        axis.title.y = element_text(margin = margin(t = 10), color = "black"),
        axis.text = element_text(color = "black")
    ) +
    # add box around chart
    geom_rect(
        aes(xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf),
        color = "black",
        fill = NA,
        alpha = 0
        )

Visualization #2

Next, we’ll illustrate how to create a different style of chart that may be helpful with more complex data. We’ll continue to use the IPEDS dataset, but now we’ll work with the completions data.

First we’ll again want to summarize the data. Let’s get the sum of awards (CTOTALT) by CIPCODE within SECTOR, focusing only on the public sector. We also need to make a slight adjustment to CIPCODE since there is a total code (99).

sector_awards <- combined |>
    filter(
        CIPCODE != 99, # remove special code for totals at award level
        str_sub(SECTOR, 1, 6) == "Public"  # focus only on public sector for now
        ) |>
    group_by(SECTOR, CIPCODE) |>
    summarize(TotalAwards = sum(CTOTALT))

Since there are a lot of CIP codes, we likely want to take a look at the top CIP codes by number of awards. We’ll use slice_max() to pick the top 5 CIP codes:

top_awards <- sector_awards |>
    slice_max(order_by = TotalAwards, n = 5)

top_awards
# A tibble: 15 × 3
# Groups:   SECTOR [3]
   SECTOR                  CIPCODE TotalAwards
   <fct>                   <chr>         <dbl>
 1 Public 4-year or above  51.3801        5998
 2 Public 4-year or above  52.0411        4565
 3 Public 4-year or above  52.0201        3798
 4 Public 4-year or above  24.0199        2947
 5 Public 4-year or above  24.0101        2496
 6 Public 2-year           24.0101        3422
 7 Public 2-year           52.0201        3237
 8 Public 2-year           43.0107        1441
 9 Public 2-year           51.3801        1249
10 Public 2-year           24.0102        1231
11 Public less-than-2-year 51.3901         858
12 Public less-than-2-year 51.3902         775
13 Public less-than-2-year 51.0904         512
14 Public less-than-2-year 43.0203         402
15 Public less-than-2-year 49.0205         402

Now we’ll create a more complex visual, still using the bar chart. We’ll work off of the top_awards we just created. We’ll start with the bar chart mostly as before in Visualization #1, with the exception of allowing the bars to be colored differently by sector.

Next we’ll add data labels to the bars, much like in Visualization #1.

Then we’ll add a new feature: facet. By adding a facet_wrap layer, we’ll actually create a multiple plot arrangement. Since total numbers of awards may vary considerably between the sectors, we’ll let the y-axes be independent.

Following, we’ll label teh plot with a title and axes labels.

We’ll use a theme layer again to modify the look, including applying a classic theme this time. We’ll also specify a custom color pallette.

top_awards |>
    ggplot(aes(x = TotalAwards, y = reorder(CIPCODE, -TotalAwards))) +
    # apply bar geometry but allow colors to vary by sector
    geom_col(aes(fill = SECTOR)) +
    # add data labels
    geom_text(aes(label = TotalAwards), hjust = -0.25, size = 3, color = "black") +
    # add facet layer
    facet_wrap(
        ~SECTOR,
        scales = "free_y"
    ) +
    # axes labes and title
    labs(
        title = "Top 5 CIP codes by awards conferred, public institutions",
        x = "Awards conferred",
        y = "CIP code",
        caption = "Data Source: IPEDS Data Center"
    ) +
    # apply theming
    theme_classic() +
    theme(
        axis.text = element_text(size = 8, color = "black"),
        legend.position = "none"
    ) +
    # apply color palette
    scale_fill_manual(values = c("#003E52", "#99DAEA", "#646569"))

Now you have a taste of the complexity that you can add with ggplot2!

Exercises

Exercise 1

Describe how you would quickly examine and summarize a dataset.

Code
# examine a dataset: glimpse()
# summary descriptive statistics: summary()

Exercise 2

Open the ggplot2 cheat sheet. Take a look in particular at the “Geoms” section. What tipes of graphs do you think will be most useful in your work?

Extra: logistic regression

In institutional research working with student data, we often create binary variables like retention, persistence, transfer, and graduation. To work with binary outcome variables like these, we need to use methods designed for working with binary outcome variables, like logistic regression.

Luckily, R provides a package, glm, for generalized linear models like logistic regression. glm includes a parameter family, for which providing family = "binomial" will provide for logistic regression.4

Extra: propensity score matching

Another increasingly important technique for the IR toolbox is propensity score matching, which can be used to create comparison groups for impact evaluation, taking into account factors that may be associated with the likelihood of participating in a program. The MatchIt package provides the functions to carry out this approach, as well as vignettes that provide a good summary (Ho et al. 2023).


  1. Though the summary occupation lines likely cover persons in all occupations, it seems an occupation has to have 100 or more workers to be included at a detail level in these reports.↩︎

  2. from growth ~4/yr, exits ~30/yr, and transfers ~52/yr. Exits are people in the occupation leaving the labor force (mainly retirements), while transfers are people moving into a different occupation.↩︎

  3. So that we can use them in Chapter 4!↩︎

  4. By default, this uses the logit link function, but using family = binomial(link = "probit") will change it to probit regression, etc.↩︎