Building reproducible
workflows with {targets}

Welcome!

Plan for today

  • Why care about workflows?

  • How {targets} works

  • Hand-on {targets} practice!

About me

Andrew Heiss

  • Assistant professor of public policy, Georgia State University

  • Data visualization, statistics, and causal inference

Andrew's headshot

Follow along

All the materials for today are accessible at

andhs.co/northwestern

Why care about workflows?

Statistical research
is a complicated,
messy process!

Itty bitty pieces

  • Data
  • Statistical results
  • Code
  • Fieldwork
  • Interviews
  • Analysis
  • Figures
  • Images
  • Tables
  • Citations
  • Your actual words

Each of these come
from different places!

 

Each of these can be
in a different state!

Approaches for handling all the itty bitty pieces

The Office model

Put everything in one document

  • Everything lives in one .docx file

The Engineering model

Embrace the bittiness and compile it all at the end

  • Everything lives separately and is combined in the end
  • Quarto!

Approaches for handling different states

YOLO workflow

Mentally remember to run all the scripts when data changes, replace old figures/tables/values with new values, and manually run everything in the right order.

Procedural workflow

Carefully document the precise order that your scripts run, maybe even with a master script that runs everything for you. Run the master script when data changes and rebuild the whole thing every time. Maybe get fancy with things like Quarto caching/freezing.

Functional workflow

Divide workflow into separate objects and let software keep track of which things are out of date and orchestrate which things need to re-run. Run one command to rebuild the whole project, skipping dependencies that don’t need to build again.

My own workflow journey

YOLO workflow

01_clean.R + 02_analysis.R + 03_plots.R

Procedural workflow

R Markdown/Quarto websites (example)

01_clean.Rmd + 02_analysis.Rmd + 03_plots.Rmd + caching

Functional workflow

Makefiles (example) → {targets} pipelines (example)

How {targets} works

{targets} documentation

General workflow

  • Create functions that make things (or “targets”; distinct objects that you can do stuff with)
  • Build these targets with tar_make()
    • {targets} keeps track of upstream and downstream dependencies and skips targets if nothing has changed
  • Load a target into an R session with tar_load(target_name) or blah <- tar_read(target_name)

Anatomy of _targets.R

_targets.R
library(targets)

# General pipeline settings
# ---------------------------
tar_option_set(
  packages = c("tibble") # Packages that your targets need for their tasks.
)

# Load functions
# ----------------
# Run the R scripts in the R/ folder with your custom functions:
tar_source()

# Actual pipeline
# -----------------
list(
  tar_target(
    name = data,  # Conceptually the same as saying `data <- tibble(...)`
    command = tibble(x = rnorm(100), y = rnorm(100))
  ),
  tar_target(
    name = model,  # Concetpually the same as saying `model <- coefficients(...)`
    command = coefficients(lm(y ~ x, data = data))
  )
)

Viewing the pipeline

tar_glimpse()

tar_visnetwork()

Building the pipeline

Build the whole pipeline:

tar_make()
#> + data dispatched                           
#> ✔ data completed [5ms, 1.75 kB]
#> + model dispatched
#> ✔ model completed [2ms, 113 B]
#> ✔ ended pipeline [132ms, 2 completed, 0 skipped]

Build it again, everything gets skipped!

tar_make()
#> ✔ skipped pipeline [61ms, 2 skipped]

Change something in model, then re-run:

tar_make()
#> + model dispatched                          
#> ✔ model completed [1ms, 108 B]
#> ✔ ended pipeline [88ms, 1 completed, 1 skipped]

Build specific targets:

tar_make(model)

Build multiple targets:

tar_make(c(data, model))

Use tidyselect selectors:

tar_make(starts_with("model_"))
tar_make(contains("tbl"))

Using targets

In a different R script or Quarto file:

library(targets)

# This loads the target as its name
tar_load(data)

# Do stuff with it
plot(data)

If you don’t want to use the target’s actual name, use tar_read():

library(targets)

# This lets you assign the target to a new object
my_neat_data <- tar_read(data)

# Do stuff with it
plot(my_neat_data)

Behind the scenes

{targets} stores each target as an extension-less .rds file in _targets/objects:

You can access a full data frame of all the target metadata if you really want

tar_meta() |> View()

Neat advanced stuff

  • Automatic parallel processing
  • Automatic remote HPC processing
  • Store targets in the cloud
  • Programmatically generate targets

{targets} and elections

That’s all really abstract—
let’s practice {targets} together!

andhs.co/northwestern