Skip to main content

Reproducible workflows

This page outlines OET's recommended approach for setting up reproducible research projects using PyPSA-X repositories. Rather than using a traditional submodule-based structure, we've implemented a more streamlined approach that facilitates easier maintenance and upstream compatibility. We recommend investing in reproducibility from the beginning of your project, as experiences have shown that it reduces the amount of work required to deliver reproducible deliverables.

You can also check out our internal workshop presentation on reproducible workflows.

Motivation

Many studies in the scientific community lack full transparency and are difficult to reproduce in outcome, analysis and interpretation. At OET, we believe open source and reproducibility are essential, and we aim to set a new standard. The benefits of fully reproducible research are plenty. It saves time and reduces manual effort while enabling easier validation and fostering collaboration. Additionally, it allows others to build upon our work, which leads to acceleration in innovations. Lastly, it encourages best practices and the use of standards, which we aim for at OET.

There are some excellent examples of reproducible research setups in the open-source community, including work by Iegor Riepin, Leon Schumm, Fabian Neumann, Tim Tröndle.

Our workflow is designed to achieve three levels of scientific reproducibility:

  • Outcome reproducibility: The experiment produces identical results to the original, enabling the same analysis, interpretation, and conclusions.
  • Analysis reproducibility: Despite potentially different raw results, the same analytical methods can be applied, producing consistent interpretations and conclusions.
  • Interpretation reproducibility: Even with different outcomes and analyses, the experimental findings lead to the same fundamental conclusions and validate the original hypothesis.

For more theoretical background on these different types of reproducibility feel free to check out Gundersen (2021): The fundamental principles of reproducibility and for a deeper understanding of the benefits of open energy system modelling specifically, feel free to explore OET's policy brief on the matter.

General process

The reproducible workflow structure is in line with our Soft-Fork Strategy, allowing teams to develop projects independently while maintaining the ability to incorporate upstream changes and contribute improvements back to the community.

Repository structure

We recommend the following structure for each new project:

  1. Fork the OET/PyPSA-X repository: Create a project-specific fork that maintains the exact same structure as our OET upstream repository.

  2. Add reporting capabilities: Each OET PyPSA-X fork includes:

    • A new report folder containing project-specific documentation,
    • A new Snakemake rule report.smk for generating a report.pdf file with project results.
  3. Minimal CI: Set up a minimal example CI that runs regularly to guarantee full reproducibility at any point in time (during and after the development).

  4. Final repository release: When completing the report or reaching a significant milestone, publish a repository release that includes:

    • The specific commit hash used to produce the study results,
    • A pinned environment.yaml file with exact package versions (provided for multiple operating systems),
    • Permanently archived and DOI-referenced input data bundles using repositories such as Zenodo to facilitate continuous reproducibility with data retrieval (alternative: OET internal server to store data).

Benefits

This approach offers several advantages over traditional submodule-based structures:

  • 🧹✨ Simplified workflow: Eliminates the complexity of managing submodules
  • 🗂️🛠️ Easier maintenance: Single repository structure with clear organization
  • 🔄🌐 Upstream compatibility: Maintains synchronization with the original PyPSA-X repositories
  • 🤝🚀 Contribution-friendly: Makes it easier to contribute new features back to upstream repositories
  • 🧪🔁 Reproducibility: Preserves scientific reproducibility without additional complexity or overhead

Workflow

Our workflow leverages a powerful workflow management tool called Snakemake. Snakemake allows you to execute the entire pipeline — from data processing to report generation — with a single command. Full documentation can be found in the official Snakemake documentation.

The workflow consists of six sequential stages that ensure complete reproducibility:

  1. Raw data processing: Source data is automatically fetched, validated, and transformed into a standardized format suitable for our PyPSA-X model.

  2. Model composition: The model is constructed using the processed input data, with components and parameters configured according to scenario specifications.

  3. Model solving: The optimization problem is solved using high-performance solvers to determine the optimal system.

  4. Generic post-processing: Results are processed into standardized outputs including summary tables and visualizations, achieving outcome reproducibility.

  5. Project-specific analysis: Custom key performance indicators (KPIs) and visualizations are generated to address specific project deliverables or research questions. We recommend implementing each KPI calculation as a separate rule for modularity and clarity. This stage ensures analysis reproducibility.

  6. Report generation: A comprehensive report is automatically compiled, incorporating generated figures, tables, and interpretive text. This final stage delivers interpretation reproducibility.

Other useful recommendations

  • Add the project to Zenodo if the project does not include any other publication to create a citable DOI.
  • It could be useful to contain sensible data in an extra layer with open alternative data sources if the original input data is sensitive.
  • To be able to reproduce “throwaway” ideas, open a PR that you later close without merging it.