First steps with {targets}

How to develop reproducible pipelines with R

Cervin Guyomar

Who am I?

  • First time at R Toulouse 🫣
  • Bioinformatician at INRAE GenPhySE, REGLISS group : genomic regulation for farm animals
  • Member of the Sigenae platform
  • I mostly work with Python, R, Bash and Nextflow
  • I do my best to be FAIR

I (we?) love quarto

  • Binds together all of my R (and more) analyses
  • Interactive first, then wrap it up as a reproducible report
  • Integrate code, results and interpretation
  • Great communication tool

How I (we?) work sometimes

Start with a single Quarto notebook for exploration

  • Did I change the input data?
  • Did I change the code at some point?

➔ I should rerun it just to be sure

notebook.qmd data.csv report results

How I (we?) work sometimes

Add some parameters

  • How much of the code is redundant between my runs?
  • What parameter set did I use?

➔ I should rerun it just to be sure

notebook.qmd - param1: "A" - param2: "B" data.csv report results

How I (we?) work sometimes

Make the pipeline more complex with (skippable) tasks

  • Has anything changed in the code or intermediate files?
  • Can I trust quarto cache=True option or implement myself a sketchy skipping process?

➔ I should rerun it just to be sure

notebook.qmd - param1: "A" - param2: "B" preprocess.qmd - rerun: "True" - skip: "False" data.csv report results

Quarto caching

Quarto implements a caching system to avoid re-running expensive chunks when compiling the report

#| cache: true 

time_consumming_function(data)

It seems great but :

  • Works at the chunk level
  • Cache is invalidated only if the code of this particular chunk is changed (there is a cumbersome dependson extension to avoid that)
  • I never understand how it behaves

My conclusion :

  • I spend a lot of time “re-running things juste to be sure”
  • Quarto can handle moderately complex analyses, but it becomes cumbersome when some steps are expensive or when you need incremental re-runs.

A joke

There are 2 truly difficult problems in Computer Science

  • 0: Naming things
  • 1: Cache invalidation
  • 2: Off by one errors

{targets} helps to solve cache in validation to smooth complex R analyses

My use-case

RNA-seq : experimental observation of the level of activity of a gene in a given biological sample

A common approach is to compare levels of activity gene-wise between multiple sets of samples under different conditions, for that we need :

  • Counts normalization between samples (using a variety of statistical tools)
  • Statistical testing of expression levels (a.k.a. differential expression) (using a variety of statistical tools and model designs)
  • Visualization / exploration / characterization of the significant genes

With Sylvain Foissac (INRAE, GenPhySE) and Nathalie Vialaneix (INRAE, MIAT), we developped a set of notebooks to perform these tasks routinely

-> I started to develop tagadiff from this code base

What is {targets}

A pipeline toolkit for R

  • 100% R based
  • Follows the make philosophy, define target objects and let the package define what needs to be computed
  • Encourages clean function based code
  • Manages a lot of things for you

{targets} philosophy

Everything that happens is a function

read_data <- function(input_file){
  read_csv(...) %>% filter(...)
}

process_data <- function(data){
  ...
}

plot_results  <- function(results, data){
  ggplot(data) +
    ...
}

{targets} philosophy

Functions are bound together by defining targets

  • tar_target(downstream_target, f(upstream_target)) :

    A target named downstream_target which depends on a target upstream_target and a function f()

tar_target(
  name,            # Name used by high level {targets} functions
  command,         # Function call
  pattern = NULL,  # Branching
  ...
)
  • {tarchetypes} extends {targets} by adding extra target factories (mainly for branching) (and also the convenient tar_plan() function)

Initialize a pipeline

library(targets)
use_targets()
.
├── R
│   └── functions.R
├── run.R
├── run.sh
└── _targets.R

First targets

_targets.R

library(targets)
library(ggplot2)
# Run the R scripts in the R/ folder with your custom functions:
tar_source()
# source("other_functions.R") # Source other scripts as needed.

list(
  tar_target(
    name = data,
    command = simulate_data()
    # format = "feather" # efficient storage for large data frames
  ),
  tar_target(
    name = model,
    command = fit_model(data)
  ),
  tar_target(
    name = plot,
    command = plot_fit(data, model)
  )
)

R/functions.R

simulate_data <- function(){
  tibble(x = rnorm(100), y = rnorm(100))
}

fit_model <- function(data){
  coefficients(lm(y ~ x, data = data))
}

plot_fit <- function(data, model){

  intercept <- model[1]
  slope <- model[2]

  ggplot(data, aes(x = x, y = y)) +
      geom_point() +
      geom_abline(intercept = intercept, slope = slope, color = "red") +
      labs(title = "Linear regression of y ~ x")
}

Check it

tar_manifest() computes which steps of the pipeline must be run, and reports them in topological order

> tar_manifest()
# A tibble: 3 × 2
  name  command              
  <chr> <chr>                
1 data  simulate_data()      
2 model fit_model(data)      
3 plot  plot_fit(data, model)

Visualize the DAG using tar_visnetwork() or tar_mermaid()

%%{init: {'themeVariables': { 'fontSize': '15px'}}}%%
graph LR
  style Legend fill:#FFFFFF00,stroke:#000000;
  style Graph fill:#FFFFFF00,stroke:#000000;
  subgraph Legend
    direction LR
    x0a52b03877696646([""Outdated""]):::outdated --- xbf4603d6c2c2ad6b([""Stem""]):::none
    xbf4603d6c2c2ad6b([""Stem""]):::none --- xf0bce276fe2b9d3e>""Function""]:::none
  end
  subgraph Graph
    direction LR
    xb7119b48552d1da3(["data"]):::outdated --> xaf95534ce5e3f59e(["plot"]):::outdated
    xe1eeca7af8e0b529(["model"]):::outdated --> xaf95534ce5e3f59e(["plot"]):::outdated
    xc866e2bbd9b270da>"plot_fit"]:::outdated --> xaf95534ce5e3f59e(["plot"]):::outdated
    x8d3d6615a04319b4>"simulate_data"]:::outdated --> xb7119b48552d1da3(["data"]):::outdated
    xb7119b48552d1da3(["data"]):::outdated --> xe1eeca7af8e0b529(["model"]):::outdated
    x9c2a6d6bf64731cc>"fit_model"]:::outdated --> xe1eeca7af8e0b529(["model"]):::outdated
  end
  classDef outdated stroke:#000000,color:#000000,fill:#78B7C5;
  classDef none stroke:#000000,color:#000000,fill:#94a4ac;
  linkStyle 0 stroke-width:0px;
  linkStyle 1 stroke-width:0px;

Run it

> tar_make()
▶ dispatched target data
● completed target data [0.011 seconds]
▶ dispatched target model
● completed target model [0.002 seconds]
▶ completed pipeline [0.068 seconds]

Change one function and run it again

graph LR
  style Legend fill:#FFFFFF00,stroke:#000000;
  style Graph fill:#FFFFFF00,stroke:#000000;
  subgraph Legend
    direction LR
    x7420bd9270f8d27d([""Up to date""]):::uptodate --- x0a52b03877696646([""Outdated""]):::outdated
    x0a52b03877696646([""Outdated""]):::outdated --- xbf4603d6c2c2ad6b([""Stem""]):::none
    xbf4603d6c2c2ad6b([""Stem""]):::none --- xf0bce276fe2b9d3e>""Function""]:::none
  end
  subgraph Graph
    direction LR
    xb7119b48552d1da3(["data"]):::uptodate --> xaf95534ce5e3f59e(["plot"]):::outdated
    xe1eeca7af8e0b529(["model"]):::outdated --> xaf95534ce5e3f59e(["plot"]):::outdated
    xc866e2bbd9b270da>"plot_fit"]:::uptodate --> xaf95534ce5e3f59e(["plot"]):::outdated
    x8d3d6615a04319b4>"simulate_data"]:::uptodate --> xb7119b48552d1da3(["data"]):::uptodate
    xb7119b48552d1da3(["data"]):::uptodate --> xe1eeca7af8e0b529(["model"]):::outdated
    x9c2a6d6bf64731cc>"fit_model"]:::outdated --> xe1eeca7af8e0b529(["model"]):::outdated
  end
  classDef uptodate stroke:#000000,color:#ffffff,fill:#354823;
  classDef outdated stroke:#000000,color:#000000,fill:#78B7C5;
  classDef none stroke:#000000,color:#000000,fill:#94a4ac;
  linkStyle 0 stroke-width:0px;
  linkStyle 1 stroke-width:0px;
  linkStyle 2 stroke-width:0px;

✔ skipped target data
▶ dispatched target model
● completed target model [0.002 seconds]
▶ dispatched target plot
● completed target plot [0.009 seconds]
▶ completed pipeline [0.099 seconds]

Under the hood

  • All results are stored in _targets/objects/ (in rds by default, but there are other options)
  • the tar_read function is a high level function lazy loading a target from the cache
  • Metadata on targets are stored in _targets/meta (includes hash of data/code, timestamps…)
_targets
├── meta
   ├── meta
   ├── process
   └── progress
├── objects
   ├── data
   ├── model
   └── plot
└── user

Branching

Branching = using the same code on multiple values

targets implements two types of branching

  • Dynamic branching creates targets at execution time
  • Static branching requires to define them in advance in the code
Dynamic Static
Pipeline creates new targets at runtime. All targets defined in advance.
Cryptic target names. Friendly target names.
Scales to hundreds of thousands of branches. Does not scale as easily, especially with tar_visnetwork() graphs
No metaprogramming required. Familiarity with metaprogramming is helpful.

Dynamic branching

Specific operators allow to iterate on R objects to create batches of targets.

Typical use-cases :

  • Split data for parallel processing
  • Arbitrary number of input files
  • Repeat simulations

Dynamic branching

  • map(): one branch per element
  • cross(): one branch per combination of elements
  • slice() head() tail() and sample()
list(
  tar_target(file, list.files("data/", full.names = TRUE), format = "file"),
  tar_target(dataset, read.csv(file), pattern = map(file)),
  tar_target(result, analyze(dataset), pattern = map(dataset))
)

graph LR
  x7c436c5117336a58["dataset"]:::uptodate --> x40ad95db433ebf41["result"]:::outdated
  x6d51284275156668(["file"]):::uptodate --> x7c436c5117336a58["dataset"]:::uptodate
  classDef uptodate stroke:#000000,color:#ffffff,fill:#354823;
  classDef outdated stroke:#000000,color:#000000,fill:#78B7C5;
  classDef none stroke:#000000,color:#000000,fill:#94a4ac;
  linkStyle 0 stroke-width:1px;
  linkStyle 1 stroke-width:1px;

Same graph, no matter how many input files there are

Static branching

tar_map is a target factory function from the tarchetypes packages.

It uses metaprogramming to create copies of a target

values <- tibble(
  method = rlang::syms(c("method1", "method2")),
)

targets <- tar_map(
  values = values,
  tar_target(analysis, method_function(data)),
  tar_target(summary, summarize_analysis(analysis, data))
)
list(targets)

graph LR
  direction LR
  x78320671e355790a(["analysis_method1"]):::outdated --> x5f83de7ed26e174d(["summary_method1"]):::outdated
  x605088230554e47d(["analysis_method2"]):::outdated --> x65f515d3845bc095(["summary_method2"]):::outdated
  classDef outdated stroke:#000000,color:#000000,fill:#78B7C5;
  classDef none stroke:#000000,color:#000000,fill:#94a4ac;
  linkStyle 0 stroke-width:1px;
  linkStyle 1 stroke-width:1px;

Branching in tagadiff

  • Two normalization methods defined in advance
  • Filtering based on user parameters

Read parameters before the DAG solving

input.params <- yaml::read_yaml("config.yml")
normalization_methods = tibble::tibble(method = c("loess", "TMM")) |>
  dplyr::filter(method %in% parsed_params$norm)

Create targets with branching

mapped_targets  <- function(selected_normalization_method){
  tar_map(
    values = selected_normalization_method,
    tar_target(normalization_results, normalize_counts(params, design, peak_desc, raw_counts, method = method)),
    tar_target(normalized_dge, normalization_results$dge),
    tar_target(normalized_pseudo_counts,normalization_results$pseudo_counts),
    tar_target(normalized_pseudo_counts_long, normalization_results$pseudo_counts_long),
    tar_target(norm_counts_pca, plot_counts_PCA(params, normalized_pseudo_counts, design)
    )
  )
}

Parallelization

Integration with {crew} for local or distributed parallelization

Local

library(targets)
library(tarchetypes)
library(crew)

tar_option_set(
  controller = crew_controller_local(workers = 2)
)

Slurm integration

library(targets)
library(tarchetypes)
library(crew.cluster)

tar_option_set(
  controller = crew_controller_slurm(
    workers = 3,
    options_cluster = crew_options_slurm(
      script_lines = "module load statistics/R/4.4.3",
      log_output = "log_folder/"
    )
  )
)

Caching options for parallelization

To define with tar_option_set():

  • Memory: “transient” or “persistent”
  • Storage: “main” or “worker”
  • Retrieval: “main” or “worker”
  • Deployment: “main” or “worker”

Caution

Choose a fast I/O file system to run the pipeline

Quarto integration

Quarto rendering can be a target of the pipeline

  • tar_read() and tar_load() allow to access targets

  • tar_meta() can be used to access metadata (i.e. runtime)

  • tar_quarto() is a target creator from {tarchetypes} to make things easier

    tar_quarto(report, "report.qmd")

Using {Renv}

  • Use Renv like in any other project (commit renv.lock)

  • Load required packages

    tar_option_set(
    packages = c("dplyr", "ggplot2") # automatically attached when pipeline runs
    )
  • Alternatively use the tar_renv() function to load all packages automatically

  • When distributing the code, users need to run renv::restore() prior to running the pipeline

A real DAG from the tagadiff normalization subworkflow

Lessons from tagadiff development

  • Straightforward to port linear chunks from a notebook

  • One can also define a pipeline from a notebook using special targets chunks (documentation here) (I did not try that)

  • Branching is not very expressive :

    • Nested branching is difficult
    • Running complex branches (multiple steps, inter-connected, with conditions) is not easy
  • I tried to write modular code

    • First using a single pipeline and distinct R scripts for subworkflows
    • I ended up defining several interconnected workflows (documentation). Less efficient I/O wise but simpler to maintain

Limits

  • In my opinon writting complex pipelines quickly becomes overwhelming
  • Error messages are cyptic, debugging is hard
  • Writing data to rds files is probably a bottleneck
  • R-centric, difficult to work with other languages or external tools

Thank you :)

🚧 tagadiff is a WIP at : https://forge.inrae.fr/cervin.guyomar/tagadiff 🚧