‹ Home
Multi-agent orchestration · the flagship system

The orchestration layer that keeps autonomous agents safe, observable, and cheap.

Any autonomous system is really two loops. The work loop does the actual job — designing experiments, running a business, generating a product. The orchestration loop keeps a fleet of agents and tools safe, observable, auditable, and cost-bounded around the clock. I've built a production-grade orchestration loop. It already runs unattended, and it generalizes to wherever your work loop lives.

How it works

Design → govern → ship → repeat.

Every unit of work flows through the same governed spine. Agents do the work, a hard gate stands between execution and production, and a nightly self-improvement loop feeds what it learns back to the start.

Dream Cycle — nightly self-improvement INTAKEwork arrives CONTRACTDOD + acceptance EXECUTEagent does work EXPERT PANEL4-lens QA PRODUCTIONshipped · gated HARD GATE · kill-switch + rollback
One pipeline, every task, fully auditable — the governance spine.
The flagship

Finnick / Hermes — a self-improving multi-agent OS.

A production agent platform on dedicated infrastructure: roughly 30 engines, 5 concurrent learning loops, a curated model-routing gateway, a self-healing escalation bridge, and a nightly "Dream Cycle" that reviews the day and proposes its own improvements. It's designed so a human works about 10 hours a week while the agents carry the load.

Governed at scale: 259 automation surfaces with 99.6% kill-switch coverage, an enforced $250/mo cost cap across 49 model routes, and a real platform migration (OpenClaw → Hermes) executed with a SHA-verified rollback and post-migration soak validation. This is SRE-grade change management, not a notebook.

E19 · SPC

Drift & regression detection

CUSUM and a windowed z-test from industrial process control, with auto-rollback — the right tool for instruments and models that drift over long campaigns.

ESCALATION

Self-healing recovery

Failures become teach-back lessons; the same class stops firing. Unattended recovery, not just alerting.

E26

Stability & hallucination

12-dimension behavioral stability index; content-type-specific hallucination thresholds.

CONDUCTOR / E20

Bandit model routing

Thompson sampling with a Wilson lower-bound confidence gate before any model can dominate.

MEMORY

Tiered memory + provenance

TTL-promoted tiers, write-time summarization, archive-not-delete, and a no-silent-drop guarantee on close.

DREAM CYCLE

Nightly self-improvement

Eight scheduled phases that consolidate, recombine, score, and propose improvements — failure-isolated units.

E28

Cron-time orchestrator

Spreads fire times to prevent top-of-hour contention across the agent fleet.

ROI

Cost accountability

Every task carries its AI cost vs. a role-matched human baseline, Bayesian-calibrated per job type.

Where this orchestration layer applies

One backbone, three places it's already proven.

The orchestration loop doesn't care what the work loop does. It's running today across three very different work loops.

A self-driving lab

Autonomous experimentation

Closed-loop experiment design needs scheduling, drift detection, provenance, and cost governance wrapped around it. That scaffolding is exactly what this OS provides — mapped in detail below.

A running business

The NPE AI Gateway

The same backbone applied to a working company: routing client interactions, wiring tools together, and gating every action with a hard cost cap.

Your system

Wherever agents run unattended

Anywhere you have agents that must run around the clock and stay trustworthy, this is the layer that makes that safe.

Use case — the solid-state chemistry lab

The orchestration layer your science loop will need.

Keep the robotics running, schedule around blocked steps, catch drift over long campaigns, and keep every autonomous action auditable and cost-bounded. That layer is what this OS already runs in production — the scaffolding that wraps an experiment-design module, not a replacement for it.

Autonomous-lab need
Pattern already in production
Silent drift in instruments or models over long campaigns
CUSUM / SPC regression detection + 12-dimension stability index + auto-rollback
Robotics uptime and recovering from failures unattended
Self-healing teach-back + failure-class circuit breaker
Experiments blocked on reagents, furnace time, prior results
Deferred lane with automated revisit — built for exactly this
Provenance and auditability for a regulated environment
Operating contracts + machine-readable drift audit trail + archive-not-delete memory
Cost accountability for autonomous compute and instrument time
Per-task ROI economics, Bayesian-calibrated per job type
Collecting and routing data across instruments and agents, 24/7
Multi-agent orchestration + three-surface event routing + slim per-job toolsets
Trusting autonomous analysis — no fabricated conclusions
Confidence-clamping (proven in OSINT) + type-specific hallucination thresholds
Scaffolding around experiment design — scheduling, gating, provenance
Dream Cycle + learning loops wrap the design module; they don't replace it
Keeping a self-modifying loop safe and human-supervisable
Compliance gates + tamper-evident self-modification + "one more expert before human"
Straight talk

What I own, and what I'd partner on

I own the orchestration and reliability layer: multi-agent scheduling, drift/SPC monitoring, unattended recovery, provenance and audit trails, cost governance, and the safety-gating of software actions. I'd partner on the science: closed-loop Bayesian / active-learning experiment design (e.g. BoTorch / Ax, DFT-informed priors), instrument integration (SiLA 2, OPC-UA, lab drivers), and physical safety interlocks — thermal, atmospheric, collision — which are a hardware-authority domain I'd integrate with, never replace. Knowing that edge precisely is the point.

The pattern register

33 production patterns, benchmarked against the literature.

Each pattern below runs in production, and each is set against the closest published work with the specific distinction noted. Search it, filter it, open any card to see the prior art and where it lives.

Pattern 05 · Reliability

CUSUM / SPC regression detection

Distinction: industrial process-control SPC, calibrated for agentic cadence, with auto-rollback — the natural tool for instruments and models that drift over long campaigns.

Closest prior art: CUSUM is mature in manufacturing QC; rarely applied to LLM-agent performance.

Pattern 10 · Governance

Escalation teach-back — a downward KPI

Distinction: escalation count is an explicit success metric that must trend down; every resolved failure teaches the system so the class stops recurring. The goal is "never fires again."

Closest prior art: alert systems resolve alerts; few target alert-class elimination as a KPI.

Pattern 28 · Economics

Per-task ROI, calibrated per job type

Distinction: every task's AI cost is compared to a role-matched human baseline and Bayesian-calibrated per job type, so high-variance work doesn't hide inside an average.

Closest prior art: cost trackers and t-shirt sizing exist; rarely combined into a live per-task metric.

The full register — 33 patterns, searchable:

How I gate production AI builders

The agents are disciplined because the system makes them.

No agent touches production on vibes. Every unit of work passes through a contract, a quality panel, and a hard gate — enforced at the data layer, not by reminder.

01 · CONTRACT

Operating spec

Role / constraints / tools / workflow / hard-stops, re-read verbatim before every run.

02 · SPEC GATE

Definition-of-Done

No task is claimable until done, acceptance, and validation are populated. Enforced by the query.

03 · REVIEW

Expert brief → verdict

A structured brief, then independent expert-lens verdicts before any production write.

04 · QA

Review panel

A fresh expert panel re-verifies work before it is ever surfaced to a human.

05 · SAFETY

Kill-switch + rollback

Software-write gate, kill-switch manifest, and a literal rollback command on every change.

Behavioral discipline as doctrine

14 named traits — disciplined, precise, honest, follow-through — inherited by every agent and checked in QA. Consistency without fine-tuning.

Reliability is the product

Escalation count is tracked as a number that must trend down; every resolved failure teaches the system so the class stops recurring.

Let's pair my orchestration layer with your science.

I'd welcome the chance to compare notes — where the orchestration, reliability, and observability layers are headed, and where these patterns could accelerate the build.