Forge

Core Patterns

Three architectures, one decision framework

Your failure modes determine your architecture. Forge provides concrete patterns for each, with decision guides, example prompts, and simplification paths as models improve.

Two-Agent

Initializer + Worker

Structured handoffs between context windows. The initializer expands the prompt into a spec and feature list; the worker implements one feature per session, leaving clean artifacts for the next.

Use when: Context degradation is the bottleneck. Quality within a single session is fine, but work degrades across boundaries.

Generator-Evaluator

GAN-Inspired Feedback Loop

Separating creation from judgment. The generator produces output; a calibrated evaluator grades it against concrete criteria. 5–15 iterations drive convergence toward quality the generator can't self-assess.

Use when: Output quality is subjective or self-evaluation is unreliable. Design, writing, creative work.

Three-Agent

Planner + Generator + Evaluator

Full-stack autonomous builds. The planner expands a one-sentence prompt into a product spec. Generator and evaluator negotiate sprint contracts before each implementation cycle.

Use when: Both scope management and quality assurance are needed. Complex multi-hour builds.

The Design Workflow

From naive implementation to refined system

Don't build a three-agent system for a problem a two-agent system solves. Forge guides you through a methodical process: observe failures first, then architect precisely.

Identify failure modes

Run a naive implementation. Read its traces. Your harness should address failures you actually observe, not hypothetical ones.

Choose architecture

Match the pattern to your bottleneck. Context management, quality, or both. Start simple.

Design grading criteria

The highest-leverage step. Turn "is this good?" into concrete, gradable dimensions weighted toward weakness.

Design artifacts

Agents communicate through files, not conversation. Specs, sprint contracts, evaluation reports — all persisted.

Configure context

Compaction, resets, or neither. Test each new model — you may not need resets anymore.

Wire up testing

Give the evaluator interactive tools. Playwright for web apps. Test as a user would, not just compile checks.

Tune the evaluator

Read its logs, find judgment divergences, update its prompt. 3–5 rounds until it grades like you would.

Optimize post-completion

Hand off to autoresearch. The grading criteria become the metric, the feature list becomes the guard.

Simplify ruthlessly

Remove one component at a time. If quality holds, it wasn't load-bearing. The interesting design space moves, not shrinks.

Proven Results

From the research that built this skill

These numbers come directly from Anthropic's published experiments — the same research that Forge distills into an actionable design workflow.

Solo agent

$9 / 20 min

Broken game play. Features half-implemented. Entity wiring disconnected from runtime.

Three-agent harness
$200 / 6 hr
16-feature spec across 10 sprints. Working game play, AI-assisted sprite generation, export with shareable links.

Simplified harness (Opus 4.6)
$125 / 4 hr
Full DAW in browser. 2+ hours of coherent generation without sprint decomposition. Working arrangement, mixer, and AI-driven composition.

Generator coherence

2+ hours

Opus 4.6 sustained coherent work without context resets — a capability the harness was originally built to compensate for.

Evolution

How Forge was built

Not designed in a single pass. Forge evolved through published research, internal experiments, external academic work, and iterative simplification across model generations.

November 2025

Effective Harnesses for Long-Running Agents

Anthropic publishes the two-agent pattern: initializer + worker with structured handoffs. The foundation — discrete sessions bridged by persistent artifacts. Demonstrated with Claude Agent SDK.

March 2026

Harness Design for Long-Running Application Development

Prithvi Rajasekaran introduces the generator-evaluator loop inspired by GANs. Grading criteria turn subjective quality into gradable dimensions. Three-agent architecture produces a 16-feature retro game maker autonomously.

March 2026

Meta-Harness: End-to-End Optimization

Lee et al. demonstrate that optimizing the harness itself — not just the output — yields larger quality gains. Counterfactual diagnosis across 10M tokens of execution traces. Integrated into Forge as the automated iteration protocol.

March 2026

Forge skill created

All research synthesized into a single Claude Code skill. Architecture patterns, grading criteria guides, optimization handoff protocol, and trace archive convention. Renamed from "harness-design" to "forge" — April 2026.

Research Sources

Standing on shoulders

Every design decision in Forge traces back to published research, documented experiments, or hard-won lessons from production agent systems.

01

Harness Design for Long-Running Application Development

Prithvi Rajasekaran Anthropic Labs March 2026

The primary source. Introduces the generator-evaluator loop, GAN-inspired architecture, grading criteria design, sprint contracts, and evaluator calibration. Demonstrates three-agent and simplified harnesses producing full-stack applications autonomously over multi-hour sessions.
Primary source
02

Effective Harnesses for Long-Running Agents

Anthropic Engineering Anthropic November 2025

The foundational two-agent pattern. Initializer + coding agent with structured handoffs across context windows. Established the core insight that agents need persistent artifacts to bridge sessions — the same principle Forge's communication artifacts build on.
Foundation
03

Meta-Harness: End-to-End Optimization of Model Harnesses

Lee, Nair, Zhang, Lee, Khattab, Finn Stanford / UW-Madison 2026

Demonstrates that optimizing the harness itself produces larger quality gains than optimizing output alone. Counterfactual diagnosis across execution traces. Informs Forge's trace archive convention and automated iteration protocol.
Harness evolution
04

Building Effective Agents

Anthropic Research Anthropic 2024

The principle behind Forge's "simplify ruthlessly" step: find the simplest solution possible, only increase complexity when needed. Every harness component must earn its keep.
Design philosophy
05

Effective Context Engineering for AI Agents

Anthropic Engineering Anthropic 2025

Context degradation research informing Forge's context management section. Why models lose coherence, how "context anxiety" manifests, and the compaction vs. reset tradeoff.
Context management
06

Generative Adversarial Networks

Goodfellow et al. NeurIPS 2014

The conceptual ancestor. Separating generation from discrimination creates a productive feedback loop — the same structural insight that makes the generator-evaluator split work for agent quality.
Conceptual inspiration

Three architectures, one decision framework

From naive implementation to refined system

Identify failure modes

Choose architecture

Design grading criteria

Design artifacts

Configure context

Wire up testing

Tune the evaluator

Optimize post-completion

Simplify ruthlessly

From the research that built this skill

How Forge was built

Standing on shoulders

Harness Design for Long-Running Application Development

Effective Harnesses for Long-Running Agents

Meta-Harness: End-to-End Optimization of Model Harnesses

Building Effective Agents

Effective Context Engineering for AI Agents

Generative Adversarial Networks