by Phelelani Mpangase
Sydney Brenner Institute for Molecular Bioscience | Biomedical Informatics and Translational Science
Faculty of Health Sciences
University of the Witwatersrand
https://ars.els-cdn.com/content/image/1-s2.0-S2590262822000090-gr2_lrg.jpg
Central to Modern Science
○ Generation/storage exceeds capacity of traditional data processing systems
○ Various forms require various (new) approaches for handling and analysing
○ Growing need for advanced analytic techniques to extract meaningful insights
Central to Modern Science
Workflow Management
Use dedicated systems (e.g., Nextflow, WDL, Galaxy) to automate, connect, and manage multi-step analyses
Containerisation
Package software and dependencies into a single, isolated unit (e.g., Docker, Singularity/Apptainer)
Scaling & Sharing
Run workflows on HPC or Cloud environment - use Version Control (Git) to track and share code
Must support the following to promote reproducible science:
Able to define individual computational tasks
(commands, inputs, outputs, resources)
Automatic construction and execution of a directed acyclic graph (DAG) based on inputs/outputs or explicit dependencies
Able to discover and run tasks in parallel
(dataflow, rule-based, etc.)
Able to run the same workflow on different platforms
(local machine, HPC, cloud [AWS/GCP/Azure], Kubernetes, etc.)
Support container (Docker/Singularity), Conda, or package managers to ensure reproducibility
Support per-task CPU, memory, time, GPU and integration with schedulers/backends
Support checkpointing, caching, and resume/retry features
Support sub-workflows/modules, imports/includes, or reusable workflow components
Support wildcards, streaming, or channel/file globbing
Support execution logs, metadata, provenance tracking, and call-level reporting
Community pipelines, registries (e.g., nf-core), GUI/portal integrations, templates
How easy it is to write, read, and maintain workflows (syntax & language)?
Dry-run, DAG visualization, per-step logs, easy debug tools
Support for secrets, data locality, controlled backends (important for clinical use)
Support for standards (CWL, GA4GH, TRS, Task/Workflow APIs) or converters
https://k21academy.com/wp-content/uploads/2020/11/Virtual_Machine_Architecture_result-1.webp
○ Emulate an entire Operating System
○ Heavy with high resource costs and large file sizes
https://k21academy.com/wp-content/uploads/2020/11/output-onlinepngtools-16_result-1.webp
○ Share the host machine's kernel
○ Lightweight and fast - offer superior performance
○ Run only the necessary applications
Docker & Singularity/Apptainer
Ensures consistent results across different compute environments
Captures exact states of code, parameters, and environments
Ensures others can run the exact same workflow
Questions?