Design multi-agent systems that sustain quality over hours of autonomous work. Generator-evaluator loops, sprint contracts, grading criteria, and structured handoffs — distilled from Anthropic's frontier research into a single, opinionated skill.
Every component in a harness encodes an assumption about what the model can't do on its own. Those assumptions deserve stress-testing — they may be wrong, and they go stale as models improve.
Your failure modes determine your architecture. Forge provides concrete patterns for each, with decision guides, example prompts, and simplification paths as models improve.
Don't build a three-agent system for a problem a two-agent system solves. Forge guides you through a methodical process: observe failures first, then architect precisely.
Run a naive implementation. Read its traces. Your harness should address failures you actually observe, not hypothetical ones.
Match the pattern to your bottleneck. Context management, quality, or both. Start simple.
The highest-leverage step. Turn "is this good?" into concrete, gradable dimensions weighted toward weakness.
Agents communicate through files, not conversation. Specs, sprint contracts, evaluation reports — all persisted.
Compaction, resets, or neither. Test each new model — you may not need resets anymore.
Give the evaluator interactive tools. Playwright for web apps. Test as a user would, not just compile checks.
Read its logs, find judgment divergences, update its prompt. 3–5 rounds until it grades like you would.
Hand off to autoresearch. The grading criteria become the metric, the feature list becomes the guard.
Remove one component at a time. If quality holds, it wasn't load-bearing. The interesting design space moves, not shrinks.
These numbers come directly from Anthropic's published experiments — the same research that Forge distills into an actionable design workflow.
Not designed in a single pass. Forge evolved through published research, internal experiments, external academic work, and iterative simplification across model generations.
Every design decision in Forge traces back to published research, documented experiments, or hard-won lessons from production agent systems.