Reproducibility in the Age of Agents

GCC 2026

John Chilton Marius van den Beek Dannon Baker David López Ahmed Hamid Awan Anton Nekrutenko

Designing for human reproducibility, accelerating it for everyone else.

AI is an ethical & environmental disaster that requires brilliant leadership across every level of society 😔

This slide is just John, not my co-authors. I am here because I need to eat

Why, when it is easier than ever to write well-designed code, are we instead getting inundated with slop¹?

The answer is incentives.

¹ see @jmchilton’s PRs

Bioinformatics is following: the field is going to automate the very reproducibility crisis itself.

Galaxy MUST incentivize reproducibility for agents.

An Optimistic Thesis

Designing for human reproducibility supercharges agents.

Marius made the point Monday: UDTs were in development before agentic use of Galaxy.

Built for humans, supercharging agents.

3 WIP Papers Support this Implicit Thesis

Paper 1

Galaxy Notebooks: Reproducible Communication and Narrative-Driven Workflow Extraction

Paper 2

Format 2 and gxwf: Schema-Aware Authoring and Validation of Galaxy Workflows

Paper 3

Galaxy Workflow Foundry: Compiling Curated Workflow Knowledge into Provenanced Agent Skills

Designing for human reproducibility supercharges agents.

Paper 1 of 3

Galaxy Notebooks: Reproducible Communication and Narrative-Driven Workflow Extraction

jmchilton.github.io/galaxy-brain/papers/galaxy-notebooks/

Example 1: Mobile Resistome

Backward extraction from a four-isolate Staphylococcus aureus notebook recovered a 14-step, sample-agnostic workflow.

8 collection map-over steps
9 workflow outputs
0 dangling inputs
byte-identical re-run across all four isolates

BioProject PRJDB8599 · analysis through heatmaps

History → Notebook → Workflow

History

computational record

datasets, tools, parameters, provenance

→

Notebook

communicative record

why it mattered, which outputs count, what to report

→

Workflow

reusable graph

backward closure from referenced artifacts

Example 2: Differential ATAC-seq

Rendered notebook bits: PDF outputs and tabular outputs stay on-graph.

Count matrix + sample sheet → DESeq2 → NA-filter → volcano ∥ significance filter → sort → top gained / lost peaks.

9 extracted steps from a 590,650-peak universe
6 workflow outputs, 0 dangling inputs, 0 report repairs
45,620 significant peaks reproduced
34,873 B-cell-gained / 10,747 erythroblast-gained

Corces 2016 ATAC-seq atlas · erythroblast vs B-cell · hg19

Extraction Is Easier

Before

A human reverse-engineers a busy history: jobs, outputs, branches, and connections.

Now

The notebook already names the outputs that matter; Galaxy walks the graph backward.

After

The extracted workflow starts with the story attached: the report is seeded from the notebook.

Don’t just make the analysis reproducible, make the communication of the analysis reproducible.

Skill: reproduciblify

Don’t write the notebook and then extract the workflow by hand. Let an agent rebuild the history so extraction works.

start

messy real history

Manual uploads, pasted figures, scratch steps, and one-sample-at-a-time structure.

agent work

rebuild inside Galaxy

Find existing tools first, create tools only as fallback, and restructure with collections.

finish

notebook extracts cleanly

Only real Galaxy outputs become notebook anchors, so extraction can recover the reusable workflow.

galaxyproject/galaxy-skills#29

Skill: workflow reports

After a workflow runs, turn the invocation into a report with the same reproducibility guarantees.

input

workflow definition

Read a local .ga file or a Galaxy workflow download with labels and marked outputs.

agent work

draft report markdown

Use Galaxy directives for inputs, outputs, images, tables, workflow diagrams, and history links.

result

run-specific report

The same template resolves against each invocation, keeping analysis communication reproducible.

galaxyproject/galaxy-skills#14

Paper 2 of 3

Format 2 and gxwf: Schema-Aware Authoring and Validation of Galaxy Workflows

jmchilton.github.io/galaxy-brain/papers/gxwf/

nf-core/rnaseq

Galaxy RNA-seq workflow graph with arrows highlighting parameters and connections

Format 2 + gxwf

Every workflow system validates something. Galaxy can validate the scientific tool invocation itself.

native .ga

machine serialization

"tool_state": "{
  \"reference_source\": {
    \"reference_source_selector\": \"cached\",
    \"ref_file\": \"hg38\"
  },
  \"output_sort\": \"coordsorted\"
}"

Format 2

human + agent authoring

state:
  reference_source:
    reference_source_selector: cached
    ref_file: hg38
  output_sort: coordsorted

10,000+ ToolShed-served typed parameter schemas

Names, types, select options, conditionals, collections

gxwf validates offline in milliseconds

gxwf: one contract, two runtimes

A shared workflow-state specification and fixture suite keep the Python and TypeScript implementations converging.

TypeScript / npm

published now

@galaxy-tool-util/* packages feed the CLI, web UI, reports, and VS Code extension.

Python / Galaxy

coming soon

Pydantic report models and Galaxy-side workflow-state tooling are staged for integration.

Galaxy PR #22996

Shared truth

spec + tests

OpenAPI contracts, report-model JSON shapes, and declarative YAML fixtures keep the implementations honest.

Same reports, same validation concepts, different hosts: CLI, web operations, Galaxy, and VS Code.

Published CLI + docs

galaxy-tool-util Getting Started documentation showing npm install -g @galaxy-tool-util/cli — `npm install -g @galaxy-tool-util/cli`

galaxy-tool-util Workflow Operations documentation showing gxwf validation, cleaning, linting, conversion, and roundtrip commands — workflow validation, cleanup, linting, conversion, roundtrip

Tool state, validated

VS Code editing a Format 2 workflow with an invalid MACS2 format value flagged in the Problems panel

An illegal select value, caught before a single job runs — with the legal options in the message:

state:
  format: BAMX   # typo

$ gxwf validate wf.gxwf.yml
[0] call_peaks  FAIL
  format:
    expected "BAM" | "BAMPE" | "BED",
    actual "BAMX"

Names, types, select options, conditionals — same schema the Galaxy UI uses. Diagnostics are structured (path + category), so an agent fixes in a tight loop instead of waiting on a failed job.

Connections, validated

Connections aren’t just producer→consumer links — they carry Galaxy’s collection algebra.

$ gxwf validate pe-artic-variation.ga \
      --connections

Tool state: 25 validated, 0 skipped
Connections: OK — 46 ok, 0 invalid, 0 skip

A list wired into a single-dataset input implies map-over; an incompatible depth (e.g. list:paired → paired with no flatten) is rejected statically.

galaxy-workflows-vscode

An IDE for Galaxy workflows

Full .ga + gxformat2 coverage

Native Galaxy workflows and Format 2 workflows both get a real editor experience.

Schema-aware

Validation, hover docs, IntelliSense, formatting, outline, diagrams.

Thank you, David

Huge, repeated thanks to David López for building the extension that makes these IDE demos real.

github.com/davelopez/galaxy-workflows-vscode

IDE work: find the tool

VS Code ToolShed search results for FastQC tools

ToolShed search inside VS Code resolves a human query to a versioned Galaxy tool.

IDE work: complete state keys

VS Code completing valid Format 2 state keys from a Galaxy tool schema

The extension completes Format 2 parameter names from the selected tool schema, including nested state.

IDE work: complete legal values

VS Code completing legal Format 2 select values for a Galaxy tool parameter

Select parameters expose their enum values in-place, before validation or execution.

Format 2 is good for agents because it is good for humans

More intuitive

YAML names, nested state, inputs, outputs, and steps read like the workflow humans already discuss.

Less context

The schema carries tool IDs, legal values, connection shape, and validation categories so prompts do not need to.

Robust tooling

CLI, IDE, browser, and Galaxy validation all report against the same typed workflow surface.

designed for humans, supercharging agents

Paper 3 of 3

Galaxy Workflow Foundry: Compiling Curated Workflow Knowledge into Provenanced Agent Skills

jmchilton.github.io/galaxy-brain/papers/foundry/

Even when we ask agents to build workflows,

designing for humans supercharges agents.

Skills are the wrong source of truth

A hand-authored conversion skill works until the ecosystem moves.

Context flooding

Every run drags in conditionals, collections, tests, wrappers, and caveats.

Brittle composition

Paper, Nextflow, CWL, and interview paths duplicate the same workflow moves.

Prose caveats

”Remember to validate” is weaker than a schema and a command that must pass.

Compressed evidence

Corpus examples and design rationale get summarized until they stop being auditable.

Runtime captivity

A skill written for one agent surface does not become portable by hoping.

No human-scrutable source

When the skill is wrong, there is no richer upstream artifact to inspect and fix.

The skill can run; the maintainer still needs to audit why it says what it says.

Galaxy Workflow Foundry

Make the human-centered knowledge base agent executable.

galaxyproject.github.io/foundry

Pipelines are journeys

A pipeline is not a giant prompt. It is an ordered Mold sequence with visible handoffs.

source-specific summary
target-specific design briefs
corpus comparison
draft implementation loop
tests, validation, execution, debug

Pipelines make the journey browseable for humans and executable for harnesses.

Interview → Galaxy

A conversation becomes a typed workflow draft, then a validated workflow.

Normalize an interview into a shared freeform summary.

Design Galaxy interface and data flow, then compare to IWC exemplars.

Loop over advance-galaxy-draft-step until no drafty step remains.

Validate, test, execute, and debug with deterministic tooling in the loop.

The same analysis, two directions

Example 2 extracted differential ATAC-seq from a completed run. A Foundry pipeline constructed the same analysis from the same initial prompt — no history, no execution.

%%{init: {'theme':'base','themeVariables':{'fontFamily':'Atkinson Hyperlegible','primaryColor':'#25537b','primaryTextColor':'#ffffff','primaryBorderColor':'#2c3143','lineColor':'#58585a','fontSize':'15px'}}}%%
graph LR
  input_0>"ATAC counts"]
  input_1>"sample metadata"]
  step_0["DESeq2 differential test"]
  step_1["Clean table (NA filter)"]
  step_2["Volcano plot"]
  step_3["Filter significant peaks"]
  step_4["Sort by log2FC"]
  step_5["Top gained peaks"]
  step_6["Top lost peaks"]
  input_0 --> step_0
  input_1 --> step_0
  step_0 --> step_1
  step_1 --> step_2
  step_1 --> step_3
  step_3 --> step_4
  step_4 --> step_5
  step_4 --> step_6
  classDef input fill:#edf4fa,stroke:#25537b,color:#2c3143;
  classDef core fill:#25537b,stroke:#2c3143,color:#ffffff;
  class input_0,input_1 input;
  class step_0 core;

count matrix + sample sheet → DESeq2 → NA-clean → volcano ∥ filter → sort → top gained / lost
every step a real, version-pinned Galaxy tool
tool_state schema-validated offline by gxwf

Diagram emitted by gxwf mermaid from the pipeline’s output workflow.

Extraction needs a completed run; construction needs only the intent. Both land on the same kind of validated, reproducible workflow.

Patterns are the reusable moves

1. Patterns MOC

Start from corpus-grounded maps, not a flat pile of recipes.

2. Collections MOC

A map-of-content routes the agent to the right collection operation.

3. Concrete recipe

Leaf patterns preserve when-to-use guidance, pitfalls, and exemplar links.

Patterns stay human-readable; casts can package the same evidence as runtime references.

Structured Drafting of Workflows

Structured workflow draft diagram with concrete and deferred steps

Extracted workflow spine diagram showing concrete executable workflow steps

Because reproducibility is more important than ever and designing reproducibility infrastructure for humans supercharges agents,

Galaxy must meet this moment by doubling down on our values.

Thanks

Galaxy community
Nekrutenko lab
IWC
ToolShed contributors

Questions?

Extra Slides

Galaxy Notebooks

Galaxy Notebooks: Galaxy-flavored markdown attached to histories.

Embed datasets, collections, and interactive visualizations in prose (including drag and drop).
Referenced outputs seed workflow extraction (with reports).
AI assistant can read history context and draft sections (including MCP support).
Every revision attributed: user / agent / restore.

Builds on what used to be called “Pages”, New in 26.1

Make the knowledge base executable

Build actionable skills from inspectable, typed, synchronized source material.

Schemas

typed workflow artifacts

Drafts, summaries, tool state, tests, provenance.

Upstream specs

strictly synchronized

gxformat2, Galaxy collection semantics, CWL, tool XML.

CLI manuals

commands as contracts

gxwf, Planemo, validator outputs, sidecar metadata.

Research + patterns

corpus-grounded moves

IWC examples, design rationale, when-to-use guidance.

Knowledge stays inspectable for humans.

→

Casts become executable for agents.

That is Foundry.

Workflow draft format

Concrete now

inputs, outputs, step set, producer → consumer edges, branches, when: guards

Deferred explicitly

TODO_* sentinels for tool IDs, versions, and wrapper-defined ports

Intent carried forward

_plan_* fields tell the implementation Mold what the source evidence supports.

class: GalaxyWorkflowDraft
inputs:
  reads:
    type: collection
    collection_type: list:paired
steps:
  align_reads:
    tool_id: TODO_mapper
    tool_version: TODO
    _plan_context: "map paired reads to the reference"
    _plan_in:
      reads: "paired input collection"
      reference: "selected genome"
    in:
      reads: reads
      reference: TODO_reference_port
outputs:
  aligned_bam:
    outputSource: align_reads/TODO_bam_output

A fully resolved draft promotes to ordinary gxformat2 with no translation layer.

Draft tooling: validate in the loop

pick

gxwf draft-next-step

Deterministically identifies the next unresolved step.

resolve

discover or author

Find a Tool Shed wrapper first; author a wrapper only on fallthrough.

implement

advance draft step

Fill tool_state, ports, IDs, versions, and remove planning fields.

validate

draft-validate —concrete

Schema errors route back to the responsible authoring phase.

while gxwf draft-next-step workflow.gxwf.yml says draft:
invoke advance-galaxy-draft-step
gxwf draft-validate --concrete workflow.gxwf.yml

author → validate → fix, one workflow step at a time

Agents need a tight workflow authoring loop

Format 2 turns workflow construction into small, checkable edits instead of one giant serialized Galaxy JSON guess.

The agent can draft a step, validate tool state, validate connections, repair the exact path, and continue.

gxwf validate workflow.gxwf.yml
gxwf validate workflow.gxwf.yml --connections
gxwf draft-validate --concrete workflow.gxwf.yml

# error path -> targeted repair -> repeat

designed for humans, supercharging agents

Reproducibility in the Age of Agents

AI is an ethical & environmental disaster that requires brilliant leadership across every level of society 😔

Why, when it is easier than ever to write well-designed code, are we instead getting inundated with slop1?

3 WIP Papers Support this Implicit Thesis

Galaxy Notebooks: Reproducible Communication and Narrative-Driven Workflow Extraction

Example 1: Mobile Resistome

History → Notebook → Workflow

Example 2: Differential ATAC-seq

Extraction Is Easier

Skill: reproduciblify

Skill: workflow reports

Format 2 and gxwf: Schema-Aware Authoring and Validation of Galaxy Workflows

nf-core/rnaseq

Format 2 + gxwf

gxwf: one contract, two runtimes

Published CLI + docs

Tool state, validated

Connections, validated

galaxy-workflows-vscode

IDE work: find the tool

IDE work: complete state keys

IDE work: complete legal values

Format 2 is good for agents because it is good for humans

Galaxy Workflow Foundry: Compiling Curated Workflow Knowledge into Provenanced Agent Skills

Even when we ask agents to build workflows,

Skills are the wrong source of truth

Galaxy Workflow Foundry

Pipelines are journeys

Interview → Galaxy

The same analysis, two directions

Patterns are the reusable moves

Structured Drafting of Workflows

Extra Slides

Galaxy Notebooks

Make the knowledge base executable

Workflow draft format

Draft tooling: validate in the loop

Agents need a tight workflow authoring loop

Why, when it is easier than ever to write well-designed code, are we instead getting inundated with slop¹?