Analyzing Disaggregated Data

via {marginaleffects} & {clarify} in

Sakeef M. Karim
New York University


CONSORTIUM ON ANALYTICS FOR DATA-DRIVEN DECISION-MAKING

What We’ll Do in the Next 15-20 Minutes

  • Load a few important packages in .
library(tidyverse)
library(marginaleffects)
library(clarify)
library(modelsummary)
library(ggdist)
  • Fit a basic logit model with a multiplicative interaction term.

  • Generate adjusted predictions, average marginal effects and marginalized comparisons using the {marginaleffects}   package.

  • Use simulation-based inference to simplify the heterogeneity we’ve identified — via {marginaleffects} and {clarify} .

  • Visualize our results using different functions or libraries.

Our Model

Predicting the Sex of Penguins

Once again, let’s use the wonderful {palmerpenguins}   library.

Code
library(palmerpenguins)

penguins_data <- penguins %>% mutate(year = as_factor(year))

penguins_data
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <fct>


model <- penguins_data %>% 
                          glm(fct_rev(sex) ~
                              flipper_length_mm * species +
                              year,
                              family = binomial,
                              data = .)
1
Reversing our factor order so that “male” is the omitted category — and so that our model predicts whether a penguin is female.
2
Allowing multiplicative interaction between flipper length and species on the right-hand side.
3
Accounting for time trend via year fixed effects.
4
Using default logistic link function for binomial generalized linear models.


Here, we’re fitting a crude model to map how the association between flipper length (our focal predictor) and sex (our outcome of interest) is moderated by the species a penguin belongs to. We account for time trends, too.

Code
# For more information, see: https://popagingdataviz.com/day3.html

var_labels <- rev(c("flipper_length_mm" = "Flipper Length (mm)", 
                    "flipper_length_mm:speciesGentoo" = "Flipper Length (mm) x Gentoo",
                    "flipper_length_mm:speciesChinstrap" = "Flipper Length (mm) x Chinstrap"))

modelplot(model,
          # Variable names:
          coef_map = var_labels,
          # 99% confidence interval:
          conf_level = 0.99) + 
# Marking significance; line:
geom_vline(xintercept = 0, lty = "dotted") +
# Marking significance; colour scheme:
aes(colour = ifelse(p.value < 0.01, "Significant", "Not significant")) +
scale_colour_manual(values = c("grey", "#3f1f69")) +
labs(colour = "", x = "Log Odds That Penguin Is Female\n (with 99% Confidence Intervals)") +
theme_bw(base_family = "IBM Plex Sans",
         base_size = 13)

Leveraging {marginaleffects}

The Marginal Effects Zoo

A brilliant library developed by Vincent Arel-Bundock (2024).

www.marginaleffects.com

Adjusted Predictions

perc_flipper <- penguins_data %>% summarise(flipper_25 = 
                                            quantile(flipper_length_mm, 0.25, na.rm = TRUE),
                                            flipper_75 = 
                                            quantile(flipper_length_mm, 0.75, na.rm = TRUE))  

avg_predictions(model,
                variables = list(species = unique,
                                 flipper_length_mm = c(perc_flipper$flipper_25, 
                                                       perc_flipper$flipper_75)))  

 flipper_length_mm   species Estimate Pr(>|z|)    S    2.5 % 97.5 %
               190 Adelie     0.49812    0.967  0.0 0.410239  0.586
               190 Gentoo     0.99998   <0.001 27.0 0.999235  1.000
               190 Chinstrap  0.86182   <0.001 10.9 0.689597  0.946
               213 Adelie     0.02553   <0.001 17.4 0.005405  0.112
               213 Gentoo     0.80988   <0.001 14.7 0.681198  0.895
               213 Chinstrap  0.00707   <0.001 14.5 0.000664  0.071

Columns: species, flipper_length_mm, estimate, p.value, s.value, conf.low, conf.high 
Type:  invlink(link) 
Code
plot_predictions(model,
                 # Generating conditional predictions along two axes:
                 condition = c("flipper_length_mm", "species")) +
labs(x = "Flipper Length (mm)",
     y = "Probability That Penguin Is Female") +
theme_bw(base_family = "IBM Plex Sans",
         base_size = 13) 

Average Marginal Effects

avg_slopes(model,
           variables = "flipper_length_mm",
           by = "species",
           newdata = datagrid(species = unique,
                              grid_type = "counterfactual"))

              Term    Contrast   species Estimate Std. Error      z Pr(>|z|)
 flipper_length_mm mean(dY/dX) Adelie     -0.0212   0.002295  -9.24   <0.001
 flipper_length_mm mean(dY/dX) Chinstrap  -0.0254   0.001618 -15.67   <0.001
 flipper_length_mm mean(dY/dX) Gentoo     -0.0184   0.000704 -26.17   <0.001
     S   2.5 %  97.5 %
  65.2 -0.0257 -0.0167
 181.5 -0.0285 -0.0222
 499.0 -0.0198 -0.0170

Columns: term, contrast, species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 
Type:  response 
Code
plot_slopes(model,
            variables = "flipper_length_mm",
            by = "species",
            newdata = datagrid(species = unique,
                               grid_type = "counterfactual")) +
geom_hline(yintercept = 0, lty = "dotted") +
theme_bw(base_family = "IBM Plex Sans",
         base_size = 13) +
labs(y = "Average Marginal Effect of Flipper Length on\n Probability of Penguin Being Female") +
aes(colour = species) +
theme(legend.position = "none") +
coord_flip()

Average Comparisons

avg_comparisons(model, 
                variables = "flipper_length_mm",
                by = "species", 
                hypothesis = "pairwise",
                newdata = datagrid(species = unique,
                                   grid_type = "counterfactual"))

               Term Estimate Std. Error     z Pr(>|z|)    S     2.5 %   97.5 %
 Adelie - Chinstrap  0.00483    0.00283  1.71   0.0877  3.5 -0.000714  0.01038
 Adelie - Gentoo    -0.00245    0.00237 -1.03   0.3021  1.7 -0.007101  0.00220
 Chinstrap - Gentoo -0.00728    0.00183 -3.98   <0.001 13.8 -0.010868 -0.00369

Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response 
Code
plot_comparisons(model,
                 variables = "species",
                 condition = "flipper_length_mm",
                 newdata = datagrid(species = unique,
                                    grid_type = "counterfactual")) +
geom_hline(yintercept = 0, lty = "dotted") +
labs(x = "Flipper Length (mm)") +
theme_bw(base_family = "IBM Plex Sans",
         base_size = 13)

Simulation-Based Inference

Simulated Predictions via {clarify}

A library inspired by King, Tomz, and Wittenberg’s (2000) seminal article.

iqss.github.io/clarify

Code
sim_estimates <- sim(model, n = 10000)

sim_prediction <- sim_setx(sim_estimates, 
                           x = list(species = c("Adelie", "Chinstrap"),
                                    flipper_length_mm = c(perc_flipper$flipper_25,
                                                          perc_flipper$flipper_75)),
                           verbose = FALSE)

sim_prediction %>% plot() + 
                   theme_bw(base_family = "IBM Plex Sans",
                            base_size = 13) +
                   theme(legend.position = "top")

Simulated AMEs via {marginaleffects}

Code
 avg_slopes(model,
             variables = "flipper_length_mm",
             by = "species",
             newdata = datagrid(species = unique,
                                grid_type = "counterfactual")) %>% 
  inferences(method = "simulation") %>% 
  posterior_draws("rvar")  %>% 
  ggplot(aes(xdist = rvar, 
             y = species, 
             colour = species,
             fill = species)) +
  stat_slabinterval(alpha = 0.4) +
  theme_bw(base_family = "IBM Plex Sans",
           base_size = 17) +
  labs(x = "Average Marginal Effect of Flipper Length on\n Probability of Penguin Being Female",
       y = "")  +
  theme(legend.position = "none") +
  geom_vline(xintercept = 0, lty = "dotdash")

Additional Resources

More Things To Read

The End

sakeefkarim
sakeefkarim.com
code

References

Arel-Bundock, Vincent. 2024. “Marginal Effects Zoo.” https://marginaleffects.com/.
King, Gary, Michael Tomz, and Jason Wittenberg. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science 44 (2): 347–61. https://doi.org/10.2307/2669316.
Long, J. Scott, and Sarah A. Mustillo. 2021. “Using Predictions and Marginal Effects to Compare Groups in Regression Models for Binary Outcomes.” Sociological Methods & Research 50 (3): 1284–1320. https://doi.org/10.1177/0049124118799374.
Mize, Trenton D. 2019. “Best Practices for Estimating, Interpreting, and Presenting Nonlinear Interaction Effects.” Sociological Science 6 (February): 81–117. https://doi.org/10.15195/v6.a4.
Mize, Trenton D., Long Doan, and J. Scott Long. 2019. “A General Framework for Comparing Predictions and Marginal Effects Across Models.” Sociological Methodology 49 (1): 152–89. https://doi.org/10.1177/0081175019852763.