Reproducibility in the Age of Agents

GCC 2026
John ChiltonMarius van den BeekDannon BakerDavid LópezAhmed Hamid AwanAnton Nekrutenko
Designing for human reproducibility, accelerating it for everyone else.

AI is an ethical & environmental disaster that requires brilliant leadership across every level of society 😔

This slide is just John, not my co-authors. I am here because I need to eat

Why, when it is easier than ever to write well-designed code, are we instead getting inundated with slop1?

The answer is incentives.

1 see @jmchilton’s PRs

Bioinformatics is following: the field is going to automate the very reproducibility crisis itself.

Galaxy MUST incentivize reproducibility for agents.

An Optimistic Thesis

Designing for human reproducibility supercharges agents.

Marius made the point Monday: UDTs were in development before agentic use of Galaxy.

Built for humans, supercharging agents.

3 WIP Papers Support this Implicit Thesis

Paper 1
Galaxy Notebooks: Reproducible Communication and Narrative-Driven Workflow Extraction
Paper 2
Format 2 and gxwf: Schema-Aware Authoring and Validation of Galaxy Workflows
Paper 3
Galaxy Workflow Foundry: Compiling Curated Workflow Knowledge into Provenanced Agent Skills
Designing for human reproducibility supercharges agents.
Paper 1 of 3

Galaxy Notebooks: Reproducible Communication and Narrative-Driven Workflow Extraction

jmchilton.github.io/galaxy-brain/papers/galaxy-notebooks/

Example 1: Mobile Resistome

Rendered Galaxy Notebook for the mobile resistome analysisExtracted fourteen-step workflow graph

Backward extraction from a four-isolate Staphylococcus aureus notebook recovered a 14-step, sample-agnostic workflow.

  • 8 collection map-over steps
  • 9 workflow outputs
  • 0 dangling inputs
  • byte-identical re-run across all four isolates
BioProject PRJDB8599 · analysis through heatmaps

History → Notebook → Workflow

History
computational record
datasets, tools, parameters, provenance
Notebook
communicative record
why it mattered, which outputs count, what to report
Workflow
reusable graph
backward closure from referenced artifacts

Example 2: Differential ATAC-seq

Rendered Galaxy Notebook showing DESeq2 PCA for erythroblast versus B-cell ATAC-seqRendered Galaxy Notebook showing differential accessibility volcano plot
Rendered Galaxy Notebook showing significant and top condition-gained ATAC-seq tables
Rendered notebook bits: PDF outputs and tabular outputs stay on-graph.
Extracted nine-step differential ATAC-seq workflow

Count matrix + sample sheet → DESeq2 → NA-filter → volcano ∥ significance filter → sort → top gained / lost peaks.

  • 9 extracted steps from a 590,650-peak universe
  • 6 workflow outputs, 0 dangling inputs, 0 report repairs
  • 45,620 significant peaks reproduced
  • 34,873 B-cell-gained / 10,747 erythroblast-gained
Corces 2016 ATAC-seq atlas · erythroblast vs B-cell · hg19

Extraction Is Easier

Before
A human reverse-engineers a busy history: jobs, outputs, branches, and connections.
Now
The notebook already names the outputs that matter; Galaxy walks the graph backward.
After
The extracted workflow starts with the story attached: the report is seeded from the notebook.
Don’t just make the analysis reproducible, make the communication of the analysis reproducible.

Skill: reproduciblify

Don’t write the notebook and then extract the workflow by hand. Let an agent rebuild the history so extraction works.

start
messy real history
Manual uploads, pasted figures, scratch steps, and one-sample-at-a-time structure.
agent work
rebuild inside Galaxy
Find existing tools first, create tools only as fallback, and restructure with collections.
finish
notebook extracts cleanly
Only real Galaxy outputs become notebook anchors, so extraction can recover the reusable workflow.
galaxyproject/galaxy-skills#29

Skill: workflow reports

After a workflow runs, turn the invocation into a report with the same reproducibility guarantees.

input
workflow definition
Read a local .ga file or a Galaxy workflow download with labels and marked outputs.
agent work
draft report markdown
Use Galaxy directives for inputs, outputs, images, tables, workflow diagrams, and history links.
result
run-specific report
The same template resolves against each invocation, keeping analysis communication reproducible.
galaxyproject/galaxy-skills#14
Paper 2 of 3

Format 2 and gxwf: Schema-Aware Authoring and Validation of Galaxy Workflows

jmchilton.github.io/galaxy-brain/papers/gxwf/

nf-core/rnaseq

nf-core rnaseq workflow diagram with arrows highlighting configuration and data flow
Galaxy RNA-seq workflow graph with arrows highlighting parameters and connections

Format 2 + gxwf

Every workflow system validates something. Galaxy can validate the scientific tool invocation itself.

native .ga
machine serialization
"tool_state": "{
  \"reference_source\": {
    \"reference_source_selector\": \"cached\",
    \"ref_file\": \"hg38\"
  },
  \"output_sort\": \"coordsorted\"
}"
Format 2
human + agent authoring
state:
  reference_source:
    reference_source_selector: cached
    ref_file: hg38
  output_sort: coordsorted
10,000+ ToolShed-served typed parameter schemas
Names, types, select options, conditionals, collections
gxwf validates offline in milliseconds

gxwf: one contract, two runtimes

A shared workflow-state specification and fixture suite keep the Python and TypeScript implementations converging.

TypeScript / npm
published now
@galaxy-tool-util/* packages feed the CLI, web UI, reports, and VS Code extension.
Python / Galaxy
coming soon
Pydantic report models and Galaxy-side workflow-state tooling are staged for integration.
Shared truth
spec + tests
OpenAPI contracts, report-model JSON shapes, and declarative YAML fixtures keep the implementations honest.
Same reports, same validation concepts, different hosts: CLI, web operations, Galaxy, and VS Code.

Published CLI + docs

galaxy-tool-util Getting Started documentation showing npm install -g @galaxy-tool-util/cli
npm install -g @galaxy-tool-util/cli
galaxy-tool-util Workflow Operations documentation showing gxwf validation, cleaning, linting, conversion, and roundtrip commands
workflow validation, cleanup, linting, conversion, roundtrip

Tool state, validated

VS Code editing a Format 2 workflow with an invalid MACS2 format value flagged in the Problems panel
An illegal select value, caught before a single job runs — with the legal options in the message:
state:
  format: BAMX   # typo

$ gxwf validate wf.gxwf.yml
[0] call_peaks  FAIL
  format:
    expected "BAM" | "BAMPE" | "BED",
    actual "BAMX"
Names, types, select options, conditionals — same schema the Galaxy UI uses. Diagnostics are structured (path + category), so an agent fixes in a tight loop instead of waiting on a failed job.

Connections, validated

Connections aren’t just producer→consumer links — they carry Galaxy’s collection algebra.

$ gxwf validate pe-artic-variation.ga \
      --connections

Tool state: 25 validated, 0 skipped
Connections: OK — 46 ok, 0 invalid, 0 skip
A list wired into a single-dataset input implies map-over; an incompatible depth (e.g. list:pairedpaired with no flatten) is rejected statically.

galaxy-workflows-vscode

An IDE for Galaxy workflows

Full .ga + gxformat2 coverage
Native Galaxy workflows and Format 2 workflows both get a real editor experience.
Schema-aware
Validation, hover docs, IntelliSense, formatting, outline, diagrams.
Thank you, David
Huge, repeated thanks to David López for building the extension that makes these IDE demos real.

github.com/davelopez/galaxy-workflows-vscode

IDE work: find the tool

VS Code ToolShed search results for FastQC tools
ToolShed search inside VS Code resolves a human query to a versioned Galaxy tool.

IDE work: complete state keys

VS Code completing valid Format 2 state keys from a Galaxy tool schema
The extension completes Format 2 parameter names from the selected tool schema, including nested state.

IDE work: complete legal values

VS Code completing legal Format 2 select values for a Galaxy tool parameter
Select parameters expose their enum values in-place, before validation or execution.

Format 2 is good for agents because it is good for humans

More intuitive
YAML names, nested state, inputs, outputs, and steps read like the workflow humans already discuss.
Less context
The schema carries tool IDs, legal values, connection shape, and validation categories so prompts do not need to.
Robust tooling
CLI, IDE, browser, and Galaxy validation all report against the same typed workflow surface.
designed for humans, supercharging agents
Paper 3 of 3

Galaxy Workflow Foundry: Compiling Curated Workflow Knowledge into Provenanced Agent Skills

jmchilton.github.io/galaxy-brain/papers/foundry/

Even when we ask agents to build workflows,

designing for humans supercharges agents.

Skills are the wrong source of truth

A hand-authored conversion skill works until the ecosystem moves.
Context flooding
Every run drags in conditionals, collections, tests, wrappers, and caveats.
Brittle composition
Paper, Nextflow, CWL, and interview paths duplicate the same workflow moves.
Prose caveats
”Remember to validate” is weaker than a schema and a command that must pass.
Compressed evidence
Corpus examples and design rationale get summarized until they stop being auditable.
Runtime captivity
A skill written for one agent surface does not become portable by hoping.
No human-scrutable source
When the skill is wrong, there is no richer upstream artifact to inspect and fix.
The skill can run; the maintainer still needs to audit why it says what it says.

Galaxy Workflow Foundry

Make the human-centered knowledge base agent executable.
Foundry diagram showing knowledge base inputs passing through casting molds into cast skills and actionable pipelinesgalaxyproject.github.io/foundry

Pipelines are journeys

Foundry pipelines page showing ordered Mold sequences

A pipeline is not a giant prompt. It is an ordered Mold sequence with visible handoffs.

  • source-specific summary
  • target-specific design briefs
  • corpus comparison
  • draft implementation loop
  • tests, validation, execution, debug
Pipelines make the journey browseable for humans and executable for harnesses.

Interview → Galaxy

A conversation becomes a typed workflow draft, then a validated workflow.
Normalize an interview into a shared freeform summary.
Design Galaxy interface and data flow, then compare to IWC exemplars.
Loop over advance-galaxy-draft-step until no drafty step remains.
Validate, test, execute, and debug with deterministic tooling in the loop.
Foundry interview to Galaxy pipeline page showing phases, loop, and branch

The same analysis, two directions

Example 2 extracted differential ATAC-seq from a completed run. A Foundry pipeline constructed the same analysis from the same initial prompt — no history, no execution.

%%{init: {'theme':'base','themeVariables':{'fontFamily':'Atkinson Hyperlegible','primaryColor':'#25537b','primaryTextColor':'#ffffff','primaryBorderColor':'#2c3143','lineColor':'#58585a','fontSize':'15px'}}}%%
graph LR
  input_0>"ATAC counts"]
  input_1>"sample metadata"]
  step_0["DESeq2 differential test"]
  step_1["Clean table (NA filter)"]
  step_2["Volcano plot"]
  step_3["Filter significant peaks"]
  step_4["Sort by log2FC"]
  step_5["Top gained peaks"]
  step_6["Top lost peaks"]
  input_0 --> step_0
  input_1 --> step_0
  step_0 --> step_1
  step_1 --> step_2
  step_1 --> step_3
  step_3 --> step_4
  step_4 --> step_5
  step_4 --> step_6
  classDef input fill:#edf4fa,stroke:#25537b,color:#2c3143;
  classDef core fill:#25537b,stroke:#2c3143,color:#ffffff;
  class input_0,input_1 input;
  class step_0 core;
  • count matrix + sample sheet → DESeq2 → NA-clean → volcano ∥ filter → sort → top gained / lost
  • every step a real, version-pinned Galaxy tool
  • tool_state schema-validated offline by gxwf

Diagram emitted by gxwf mermaid from the pipeline’s output workflow.

Extraction needs a completed run; construction needs only the intent. Both land on the same kind of validated, reproducible workflow.

Patterns are the reusable moves

1. Patterns MOC
Foundry patterns index with pattern maps
Start from corpus-grounded maps, not a flat pile of recipes.
2. Collections MOC
Foundry Galaxy collection patterns map of content
A map-of-content routes the agent to the right collection operation.
3. Concrete recipe
Foundry relabel via rules and find replace pattern page
Leaf patterns preserve when-to-use guidance, pitfalls, and exemplar links.
Patterns stay human-readable; casts can package the same evidence as runtime references.

Structured Drafting of Workflows

Structured workflow draft diagram with concrete and deferred stepsExtracted workflow spine diagram showing concrete executable workflow steps

Because reproducibility is more important than ever and designing reproducibility infrastructure for humans supercharges agents,

Galaxy must meet this moment by doubling down on our values.

Thanks

Galaxy community
Nekrutenko lab
IWC
ToolShed contributors

Questions?

Extra Slides

Galaxy Notebooks

Galaxy Notebooks: Galaxy-flavored markdown attached to histories.

  • Embed datasets, collections, and interactive visualizations in prose (including drag and drop).
  • Referenced outputs seed workflow extraction (with reports).
  • AI assistant can read history context and draft sections (including MCP support).
  • Every revision attributed: user / agent / restore.
Builds on what used to be called “Pages”, New in 26.1

Make the knowledge base executable

Build actionable skills from inspectable, typed, synchronized source material.
Schemas
typed workflow artifacts
Drafts, summaries, tool state, tests, provenance.
Upstream specs
strictly synchronized
gxformat2, Galaxy collection semantics, CWL, tool XML.
CLI manuals
commands as contracts
gxwf, Planemo, validator outputs, sidecar metadata.
Research + patterns
corpus-grounded moves
IWC examples, design rationale, when-to-use guidance.
Knowledge stays inspectable for humans.
Casts become executable for agents.
That is Foundry.

Workflow draft format

Concrete now
inputs, outputs, step set, producer → consumer edges, branches, when: guards
Deferred explicitly
TODO_* sentinels for tool IDs, versions, and wrapper-defined ports
Intent carried forward
_plan_* fields tell the implementation Mold what the source evidence supports.
class: GalaxyWorkflowDraft
inputs:
  reads:
    type: collection
    collection_type: list:paired
steps:
  align_reads:
    tool_id: TODO_mapper
    tool_version: TODO
    _plan_context: "map paired reads to the reference"
    _plan_in:
      reads: "paired input collection"
      reference: "selected genome"
    in:
      reads: reads
      reference: TODO_reference_port
outputs:
  aligned_bam:
    outputSource: align_reads/TODO_bam_output
A fully resolved draft promotes to ordinary gxformat2 with no translation layer.

Draft tooling: validate in the loop

pick
gxwf draft-next-step
Deterministically identifies the next unresolved step.
resolve
discover or author
Find a Tool Shed wrapper first; author a wrapper only on fallthrough.
implement
advance draft step
Fill tool_state, ports, IDs, versions, and remove planning fields.
validate
draft-validate —concrete
Schema errors route back to the responsible authoring phase.
while gxwf draft-next-step workflow.gxwf.yml says draft:
invoke advance-galaxy-draft-step
gxwf draft-validate --concrete workflow.gxwf.yml
author → validate → fix, one workflow step at a time

Agents need a tight workflow authoring loop

Format 2 turns workflow construction into small, checkable edits instead of one giant serialized Galaxy JSON guess.
The agent can draft a step, validate tool state, validate connections, repair the exact path, and continue.
gxwf validate workflow.gxwf.yml
gxwf validate workflow.gxwf.yml --connections
gxwf draft-validate --concrete workflow.gxwf.yml

# error path -> targeted repair -> repeat
designed for humans, supercharging agents