First steps with `{targets}`

How to develop reproducible pipelines with R

Cervin Guyomar

cervin.guyomar@inrae.fr

Who am I?

First time at R Toulouse 🫣
Bioinformatician at INRAE GenPhySE, REGLISS group : genomic regulation for farm animals
Member of the Sigenae platform
I mostly work with Python, R, Bash and Nextflow
I do my best to be FAIR

I (we?) love quarto

Binds together all of my R (and more) analyses
Interactive first, then wrap it up as a reproducible report
Integrate code, results and interpretation
Great communication tool

How I (we?) work sometimes

Start with a single Quarto notebook for exploration

Did I change the input data?
Did I change the code at some point?

➔ I should rerun it just to be sure

How I (we?) work sometimes

Add some parameters

How much of the code is redundant between my runs?
What parameter set did I use?

➔ I should rerun it just to be sure

How I (we?) work sometimes

Make the pipeline more complex with (skippable) tasks

Has anything changed in the code or intermediate files?
Can I trust quarto cache=True option or implement myself a sketchy skipping process?

➔ I should rerun it just to be sure

Quarto caching

Quarto implements a caching system to avoid re-running expensive chunks when compiling the report

#| cache: true 

time_consumming_function(data)

It seems great but :

Works at the chunk level
Cache is invalidated only if the code of this particular chunk is changed (there is a cumbersome dependson extension to avoid that)
I never understand how it behaves

My conclusion :

I spend a lot of time “re-running things juste to be sure”
Quarto can handle moderately complex analyses, but it becomes cumbersome when some steps are expensive or when you need incremental re-runs.

A joke

There are 2 truly difficult problems in Computer Science

0: Naming things
1: Cache invalidation
2: Off by one errors

{targets} helps to solve cache in validation to smooth complex R analyses

My use-case

RNA-seq : experimental observation of the level of activity of a gene in a given biological sample

A common approach is to compare levels of activity gene-wise between multiple sets of samples under different conditions, for that we need :

Counts normalization between samples (using a variety of statistical tools)
Statistical testing of expression levels (a.k.a. differential expression) (using a variety of statistical tools and model designs)
Visualization / exploration / characterization of the significant genes

With Sylvain Foissac (INRAE, GenPhySE) and Nathalie Vialaneix (INRAE, MIAT), we developped a set of notebooks to perform these tasks routinely

-> I started to develop tagadiff from this code base

What is {targets}

A pipeline toolkit for R

100% R based
Follows the make philosophy, define target objects and let the package define what needs to be computed
Encourages clean function based code
Manages a lot of things for you

{targets} philosophy

Everything that happens is a function

read_data <- function(input_file){
  read_csv(...) %>% filter(...)
}

process_data <- function(data){
  ...
}

plot_results  <- function(results, data){
  ggplot(data) +
    ...
}

{targets} philosophy

Functions are bound together by defining targets

tar_target(downstream_target, f(upstream_target)) :

A target named downstream_target which depends on a target upstream_target and a function f()

tar_target(
  name,            # Name used by high level {targets} functions
  command,         # Function call
  pattern = NULL,  # Branching
  ...
)

{tarchetypes} extends {targets} by adding extra target factories (mainly for branching) (and also the convenient tar_plan() function)

Initialize a pipeline

library(targets)
use_targets()

.
├── R
│   └── functions.R
├── run.R
├── run.sh
└── _targets.R

First targets

_targets.R

library(targets)
library(ggplot2)
# Run the R scripts in the R/ folder with your custom functions:
tar_source()
# source("other_functions.R") # Source other scripts as needed.

list(
  tar_target(
    name = data,
    command = simulate_data()
    # format = "feather" # efficient storage for large data frames
  ),
  tar_target(
    name = model,
    command = fit_model(data)
  ),
  tar_target(
    name = plot,
    command = plot_fit(data, model)
  )
)

R/functions.R

simulate_data <- function(){
  tibble(x = rnorm(100), y = rnorm(100))
}

fit_model <- function(data){
  coefficients(lm(y ~ x, data = data))
}

plot_fit <- function(data, model){

  intercept <- model[1]
  slope <- model[2]

  ggplot(data, aes(x = x, y = y)) +
      geom_point() +
      geom_abline(intercept = intercept, slope = slope, color = "red") +
      labs(title = "Linear regression of y ~ x")
}

Check it

tar_manifest() computes which steps of the pipeline must be run, and reports them in topological order

> tar_manifest()
# A tibble: 3 × 2
  name  command              
  <chr> <chr>                
1 data  simulate_data()      
2 model fit_model(data)      
3 plot  plot_fit(data, model)

Visualize the DAG using tar_visnetwork() or tar_mermaid()

%%{init: {'themeVariables': { 'fontSize': '15px'}}}%%
graph LR
  style Legend fill:#FFFFFF00,stroke:#000000;
  style Graph fill:#FFFFFF00,stroke:#000000;
  subgraph Legend
    direction LR
    x0a52b03877696646([""Outdated""]):::outdated --- xbf4603d6c2c2ad6b([""Stem""]):::none
    xbf4603d6c2c2ad6b([""Stem""]):::none --- xf0bce276fe2b9d3e>""Function""]:::none
  end
  subgraph Graph
    direction LR
    xb7119b48552d1da3(["data"]):::outdated --> xaf95534ce5e3f59e(["plot"]):::outdated
    xe1eeca7af8e0b529(["model"]):::outdated --> xaf95534ce5e3f59e(["plot"]):::outdated
    xc866e2bbd9b270da>"plot_fit"]:::outdated --> xaf95534ce5e3f59e(["plot"]):::outdated
    x8d3d6615a04319b4>"simulate_data"]:::outdated --> xb7119b48552d1da3(["data"]):::outdated
    xb7119b48552d1da3(["data"]):::outdated --> xe1eeca7af8e0b529(["model"]):::outdated
    x9c2a6d6bf64731cc>"fit_model"]:::outdated --> xe1eeca7af8e0b529(["model"]):::outdated
  end
  classDef outdated stroke:#000000,color:#000000,fill:#78B7C5;
  classDef none stroke:#000000,color:#000000,fill:#94a4ac;
  linkStyle 0 stroke-width:0px;
  linkStyle 1 stroke-width:0px;

Run it

> tar_make()
▶ dispatched target data
● completed target data [0.011 seconds]
▶ dispatched target model
● completed target model [0.002 seconds]
▶ completed pipeline [0.068 seconds]

Change one function and run it again

graph LR
  style Legend fill:#FFFFFF00,stroke:#000000;
  style Graph fill:#FFFFFF00,stroke:#000000;
  subgraph Legend
    direction LR
    x7420bd9270f8d27d([""Up to date""]):::uptodate --- x0a52b03877696646([""Outdated""]):::outdated
    x0a52b03877696646([""Outdated""]):::outdated --- xbf4603d6c2c2ad6b([""Stem""]):::none
    xbf4603d6c2c2ad6b([""Stem""]):::none --- xf0bce276fe2b9d3e>""Function""]:::none
  end
  subgraph Graph
    direction LR
    xb7119b48552d1da3(["data"]):::uptodate --> xaf95534ce5e3f59e(["plot"]):::outdated
    xe1eeca7af8e0b529(["model"]):::outdated --> xaf95534ce5e3f59e(["plot"]):::outdated
    xc866e2bbd9b270da>"plot_fit"]:::uptodate --> xaf95534ce5e3f59e(["plot"]):::outdated
    x8d3d6615a04319b4>"simulate_data"]:::uptodate --> xb7119b48552d1da3(["data"]):::uptodate
    xb7119b48552d1da3(["data"]):::uptodate --> xe1eeca7af8e0b529(["model"]):::outdated
    x9c2a6d6bf64731cc>"fit_model"]:::outdated --> xe1eeca7af8e0b529(["model"]):::outdated
  end
  classDef uptodate stroke:#000000,color:#ffffff,fill:#354823;
  classDef outdated stroke:#000000,color:#000000,fill:#78B7C5;
  classDef none stroke:#000000,color:#000000,fill:#94a4ac;
  linkStyle 0 stroke-width:0px;
  linkStyle 1 stroke-width:0px;
  linkStyle 2 stroke-width:0px;

✔ skipped target data
▶ dispatched target model
● completed target model [0.002 seconds]
▶ dispatched target plot
● completed target plot [0.009 seconds]
▶ completed pipeline [0.099 seconds]

Under the hood

All results are stored in _targets/objects/ (in rds by default, but there are other options)
the tar_read function is a high level function lazy loading a target from the cache
Metadata on targets are stored in _targets/meta (includes hash of data/code, timestamps…)

_targets
├── meta
│   ├── meta
│   ├── process
│   └── progress
├── objects
│   ├── data
│   ├── model
│   └── plot
└── user

Branching

Branching = using the same code on multiple values

targets implements two types of branching

Dynamic branching creates targets at execution time
Static branching requires to define them in advance in the code

Dynamic	Static
Pipeline creates new targets at runtime.	All targets defined in advance.
Cryptic target names.	Friendly target names.
Scales to hundreds of thousands of branches.	Does not scale as easily, especially with `tar_visnetwork()` graphs
No metaprogramming required.	Familiarity with metaprogramming is helpful.

Dynamic branching

Specific operators allow to iterate on R objects to create batches of targets.

Typical use-cases :

Split data for parallel processing
Arbitrary number of input files
Repeat simulations

Dynamic branching

map(): one branch per element
cross(): one branch per combination of elements
slice() head() tail() and sample()

list(
  tar_target(file, list.files("data/", full.names = TRUE), format = "file"),
  tar_target(dataset, read.csv(file), pattern = map(file)),
  tar_target(result, analyze(dataset), pattern = map(dataset))
)

graph LR
  x7c436c5117336a58["dataset"]:::uptodate --> x40ad95db433ebf41["result"]:::outdated
  x6d51284275156668(["file"]):::uptodate --> x7c436c5117336a58["dataset"]:::uptodate
  classDef uptodate stroke:#000000,color:#ffffff,fill:#354823;
  classDef outdated stroke:#000000,color:#000000,fill:#78B7C5;
  classDef none stroke:#000000,color:#000000,fill:#94a4ac;
  linkStyle 0 stroke-width:1px;
  linkStyle 1 stroke-width:1px;

Same graph, no matter how many input files there are

Static branching

tar_map is a target factory function from the tarchetypes packages.

It uses metaprogramming to create copies of a target

values <- tibble(
  method = rlang::syms(c("method1", "method2")),
)

targets <- tar_map(
  values = values,
  tar_target(analysis, method_function(data)),
  tar_target(summary, summarize_analysis(analysis, data))
)
list(targets)

graph LR
  direction LR
  x78320671e355790a(["analysis_method1"]):::outdated --> x5f83de7ed26e174d(["summary_method1"]):::outdated
  x605088230554e47d(["analysis_method2"]):::outdated --> x65f515d3845bc095(["summary_method2"]):::outdated
  classDef outdated stroke:#000000,color:#000000,fill:#78B7C5;
  classDef none stroke:#000000,color:#000000,fill:#94a4ac;
  linkStyle 0 stroke-width:1px;
  linkStyle 1 stroke-width:1px;

Branching in `tagadiff`

Two normalization methods defined in advance
Filtering based on user parameters

Read parameters before the DAG solving

input.params <- yaml::read_yaml("config.yml")
normalization_methods = tibble::tibble(method = c("loess", "TMM")) |>
  dplyr::filter(method %in% parsed_params$norm)

Create targets with branching

mapped_targets  <- function(selected_normalization_method){
  tar_map(
    values = selected_normalization_method,
    tar_target(normalization_results, normalize_counts(params, design, peak_desc, raw_counts, method = method)),
    tar_target(normalized_dge, normalization_results$dge),
    tar_target(normalized_pseudo_counts,normalization_results$pseudo_counts),
    tar_target(normalized_pseudo_counts_long, normalization_results$pseudo_counts_long),
    tar_target(norm_counts_pca, plot_counts_PCA(params, normalized_pseudo_counts, design)
    )
  )
}

Parallelization

Integration with {crew} for local or distributed parallelization

Local

library(targets)
library(tarchetypes)
library(crew)

tar_option_set(
  controller = crew_controller_local(workers = 2)
)

Slurm integration

library(targets)
library(tarchetypes)
library(crew.cluster)

tar_option_set(
  controller = crew_controller_slurm(
    workers = 3,
    options_cluster = crew_options_slurm(
      script_lines = "module load statistics/R/4.4.3",
      log_output = "log_folder/"
    )
  )
)

Caching options for parallelization

To define with tar_option_set():

Memory: “transient” or “persistent”
Storage: “main” or “worker”
Retrieval: “main” or “worker”
Deployment: “main” or “worker”

Caution

Choose a fast I/O file system to run the pipeline

Quarto integration

Quarto rendering can be a target of the pipeline

tar_read() and tar_load() allow to access targets
tar_meta() can be used to access metadata (i.e. runtime)
tar_quarto() is a target creator from {tarchetypes} to make things easier
```
tar_quarto(report, "report.qmd")
```

Using {Renv}

Use Renv like in any other project (commit renv.lock)

Load required packages

tar_option_set(
packages = c("dplyr", "ggplot2") # automatically attached when pipeline runs
)

Alternatively use the tar_renv() function to load all packages automatically
When distributing the code, users need to run renv::restore() prior to running the pipeline

A real DAG from the `tagadiff` normalization subworkflow

Lessons from `tagadiff` development

Straightforward to port linear chunks from a notebook
One can also define a pipeline from a notebook using special targets chunks (documentation here) (I did not try that)
Branching is not very expressive :
- Nested branching is difficult
- Running complex branches (multiple steps, inter-connected, with conditions) is not easy
I tried to write modular code
- First using a single pipeline and distinct R scripts for subworkflows
- I ended up defining several interconnected workflows (documentation). Less efficient I/O wise but simpler to maintain

Limits

In my opinon writting complex pipelines quickly becomes overwhelming
Error messages are cyptic, debugging is hard
Writing data to rds files is probably a bottleneck
R-centric, difficult to work with other languages or external tools

Thank you :)

🚧 tagadiff is a WIP at : https://forge.inrae.fr/cervin.guyomar/tagadiff 🚧

First steps with {targets}

Who am I?

I (we?) love quarto

How I (we?) work sometimes

How I (we?) work sometimes

How I (we?) work sometimes

Quarto caching

A joke

My use-case

What is {targets}

{targets} philosophy

Everything that happens is a function

{targets} philosophy

Functions are bound together by defining targets

Initialize a pipeline

First targets

Check it

Run it

Change one function and run it again

Under the hood

Branching

Dynamic branching

Dynamic branching

Static branching

Branching in tagadiff

Parallelization

Local

Slurm integration

Caching options for parallelization

Quarto integration

Using {Renv}

A real DAG from the tagadiff normalization subworkflow

Lessons from tagadiff development

Limits

Thank you :)

First steps with `{targets}`

Branching in `tagadiff`

A real DAG from the `tagadiff` normalization subworkflow

Lessons from `tagadiff` development