Claude Code Skill

Forge

Design multi-agent systems that sustain quality over hours of autonomous work. Generator-evaluator loops, sprint contracts, grading criteria, and structured handoffs — distilled from Anthropic's frontier research into a single, opinionated skill.

3
Architecture patterns
9
Design workflow steps
6
Research sources
# Install the skill
$ git clone https://github.com/omegalens/forge.git ~/.claude/skills/forge
View on GitHub Install docs
Every component in a harness encodes an assumption about what the model can't do on its own. Those assumptions deserve stress-testing — they may be wrong, and they go stale as models improve.
The foundational principle of Forge, derived from Anthropic's harness design research
Core Patterns

Three architectures, one decision framework

Your failure modes determine your architecture. Forge provides concrete patterns for each, with decision guides, example prompts, and simplification paths as models improve.

Two-Agent
Initializer + Worker
Structured handoffs between context windows. The initializer expands the prompt into a spec and feature list; the worker implements one feature per session, leaving clean artifacts for the next.
Use when: Context degradation is the bottleneck. Quality within a single session is fine, but work degrades across boundaries.
Generator-Evaluator
GAN-Inspired Feedback Loop
Separating creation from judgment. The generator produces output; a calibrated evaluator grades it against concrete criteria. 5–15 iterations drive convergence toward quality the generator can't self-assess.
Use when: Output quality is subjective or self-evaluation is unreliable. Design, writing, creative work.
Three-Agent
Planner + Generator + Evaluator
Full-stack autonomous builds. The planner expands a one-sentence prompt into a product spec. Generator and evaluator negotiate sprint contracts before each implementation cycle.
Use when: Both scope management and quality assurance are needed. Complex multi-hour builds.
The Design Workflow

From naive implementation to refined system

Don't build a three-agent system for a problem a two-agent system solves. Forge guides you through a methodical process: observe failures first, then architect precisely.

Identify failure modes

Run a naive implementation. Read its traces. Your harness should address failures you actually observe, not hypothetical ones.

Choose architecture

Match the pattern to your bottleneck. Context management, quality, or both. Start simple.

Design grading criteria

The highest-leverage step. Turn "is this good?" into concrete, gradable dimensions weighted toward weakness.

Design artifacts

Agents communicate through files, not conversation. Specs, sprint contracts, evaluation reports — all persisted.

Configure context

Compaction, resets, or neither. Test each new model — you may not need resets anymore.

Wire up testing

Give the evaluator interactive tools. Playwright for web apps. Test as a user would, not just compile checks.

Tune the evaluator

Read its logs, find judgment divergences, update its prompt. 3–5 rounds until it grades like you would.

Optimize post-completion

Hand off to autoresearch. The grading criteria become the metric, the feature list becomes the guard.

Simplify ruthlessly

Remove one component at a time. If quality holds, it wasn't load-bearing. The interesting design space moves, not shrinks.

Proven Results

From the research that built this skill

These numbers come directly from Anthropic's published experiments — the same research that Forge distills into an actionable design workflow.

Solo agent
$9 / 20 min
Broken game play. Features half-implemented. Entity wiring disconnected from runtime.
Three-agent harness
$200 / 6 hr
16-feature spec across 10 sprints. Working game play, AI-assisted sprite generation, export with shareable links.
Simplified harness (Opus 4.6)
$125 / 4 hr
Full DAW in browser. 2+ hours of coherent generation without sprint decomposition. Working arrangement, mixer, and AI-driven composition.
Generator coherence
2+ hours
Opus 4.6 sustained coherent work without context resets — a capability the harness was originally built to compensate for.
Evolution

How Forge was built

Not designed in a single pass. Forge evolved through published research, internal experiments, external academic work, and iterative simplification across model generations.

November 2025
Effective Harnesses for Long-Running Agents
Anthropic publishes the two-agent pattern: initializer + worker with structured handoffs. The foundation — discrete sessions bridged by persistent artifacts. Demonstrated with Claude Agent SDK.
March 2026
Harness Design for Long-Running Application Development
Prithvi Rajasekaran introduces the generator-evaluator loop inspired by GANs. Grading criteria turn subjective quality into gradable dimensions. Three-agent architecture produces a 16-feature retro game maker autonomously.
March 2026
Meta-Harness: End-to-End Optimization
Lee et al. demonstrate that optimizing the harness itself — not just the output — yields larger quality gains. Counterfactual diagnosis across 10M tokens of execution traces. Integrated into Forge as the automated iteration protocol.
March 2026
Forge skill created
All research synthesized into a single Claude Code skill. Architecture patterns, grading criteria guides, optimization handoff protocol, and trace archive convention. Renamed from "harness-design" to "forge" — April 2026.
Research Sources

Standing on shoulders

Every design decision in Forge traces back to published research, documented experiments, or hard-won lessons from production agent systems.