
Build a Durable Multi-Step Workflow with Claude Opus 4.8
Opus 4.8 shipped with a dynamic workflow capability. Here's how to build a crash-safe, multi-step workflow that an LLM drives end to end.
Key takeaways
- Record workflow progress explicitly in a completed list written to disk before each step, so a crash means resume from where it stopped rather than restart.
- Write state to a temp file and rename it, because rename is atomic on POSIX and a crash mid-write can never leave a half-written, unparseable state file.
- Retry only transient errors like rate limits and 529 overloads with backoff, and let code bugs fail loudly instead of retrying three times.
- Make external side effects idempotent using the runId as an idempotency key, or resume can re-run a step and post the same ticket twice.
On May 28, 2026, Anthropic shipped Claude Opus 4.8, just 41 days after 4.7. TechCrunch covered it the same day, and the headline feature was not really the model. It was a research-preview capability called dynamic workflows, where Claude plans a large task, writes a JavaScript script that orchestrates hundreds of parallel subagents, runs them in the background, and verifies its own output before reporting back. Anthropic's own example was a codebase-scale migration that produced around 750,000 lines of Rust and ran for eleven days from first commit to merge.
The detail I keep coming back to is the eleven days. A workflow that runs for eleven days will, at some point, hit something that kills the process. A machine reboots. A deploy goes out. A rate limit turns into a retry storm. If the whole thing lives in memory and the process dies on day nine, you do not want to start over.
That durability is the part dynamic workflows quietly get right, and it is the part I want to teach here. The feature itself is gated to Claude Code on Enterprise, Team, and Max plans, and the orchestration script is something Claude writes for you rather than something you hand-author. So this is not a tutorial on the dynamic workflow tool specifically. It is a tutorial on the idea underneath it: building a multi-step workflow that an LLM drives, that survives a crash, and that you can resume from wherever it stopped. We will build that with the regular Anthropic SDK and a plain state file, so it runs anywhere you can run Node.
A note on what maps to what, because I want to be honest about the boundary. The "Claude drives the steps" part is what you control directly through the messages API. The "runs hundreds of subagents in the background" part is what the dynamic workflow runtime does for you and what we are not rebuilding. The durability pattern (persist state between steps, resume after a crash, retry the flaky bits) is general workflow engineering. It is good practice whether or not Opus 4.8's new feature is in the picture, and it is the foundation that makes the new feature trustworthy.
What you'll need
- Node.js 20 or newer, and TypeScript.
- An Anthropic API key in the
ANTHROPIC_API_KEYenvironment variable. - One package:
npm install @anthropic-ai/sdk
We use the Anthropic SDK directly. The state store is going to be a JSON file on disk, which is deliberately the least clever option available. A file is enough to demonstrate every durability property that matters, and swapping it for Redis, Postgres, or a key-value store later is a one-function change. Starting with the dramatic version (a real workflow engine) would hide the mechanics, and the mechanics are the whole point.
Every model call uses claude-opus-4-8 with adaptive thinking left on. Opus is heavier than some of these steps strictly need, but keeping the model uniform makes the example easier to read, and 4.8's habit of flagging uncertainty in its own outputs (the thing Bridgewater singled out in the TechCrunch piece) is genuinely useful when an LLM is making decisions you will act on without a human in the loop.
The shape of the problem
Let's give ourselves a concrete task: take a raw customer support transcript and turn it into a structured ticket. That breaks into three steps that have to happen in order.
- Extract. Pull the key facts from the transcript: what the customer wants, what product, how urgent.
- Classify. Decide which team the ticket goes to and set a priority.
- Draft. Write a short internal summary for whoever picks it up.
Each step is a Claude call. Each one depends on the output of the one before it. And each one is a place the process can die. The naive version is three calls in a row inside one function, and it works perfectly until step three fails and you have to re-run steps one and two for no reason, paying for those tokens again.
The durable version treats the workflow as data, not as a call stack. The current state of the run lives in a file. Each step reads the state, does its work, and writes the result back before the next step starts. If the process dies, the file is still there, and resuming means reading the file and skipping whatever is already done.
Step 1: model the workflow as state
First, describe what a run looks like. This is the whiteboard everything reads and writes.
type StepName = "extract" | "classify" | "draft";
interface ExtractResult {
request: string;
product: string;
urgency: "low" | "medium" | "high";
}
interface ClassifyResult {
team: string;
priority: "P0" | "P1" | "P2" | "P3";
}
interface WorkflowState {
runId: string;
transcript: string;
// Which steps have finished, and what they produced.
completed: StepName[];
extract?: ExtractResult;
classify?: ClassifyResult;
draft?: string;
// Bookkeeping for resume and debugging.
status: "running" | "done" | "failed";
lastError?: string;
}
The important field is completed. It is the source of truth for what has already happened. When we resume, we do not look at the clock or guess. We look at this list. If "extract" is in it, extraction is done, full stop, and we move on. Everything else (the extract, classify, draft payloads) is the actual work product, kept so later steps can read it and so a crash never loses a finished step.
This is the single decision that makes the whole thing durable: the workflow's progress is recorded explicitly, separately from the program's control flow. The control flow can restart from zero. The progress cannot.
Step 2: persist state between steps
Now the storage. Two functions, read and write, both pointed at a file named after the run.
import { readFile, writeFile, mkdir } from "node:fs/promises";
import { join } from "node:path";
const STATE_DIR = "./runs";
async function saveState(state: WorkflowState): Promise<void> {
await mkdir(STATE_DIR, { recursive: true });
const path = join(STATE_DIR, `${state.runId}.json`);
// Write to a temp file, then rename. Rename is atomic on POSIX, so a
// crash mid-write can never leave a half-written, unparseable state file.
const tmp = `${path}.tmp`;
await writeFile(tmp, JSON.stringify(state, null, 2), "utf8");
const { rename } = await import("node:fs/promises");
await rename(tmp, path);
}
async function loadState(runId: string): Promise<WorkflowState | null> {
try {
const path = join(STATE_DIR, `${runId}.json`);
return JSON.parse(await readFile(path, "utf8")) as WorkflowState;
} catch {
return null;
}
}
The temp-file-then-rename dance is worth pausing on. If you write directly to the real file and the process dies halfway through writeFile, you are left with a truncated JSON file that throws on the next loadState, and now your durable workflow cannot even read its own progress. Writing to a temp file and renaming sidesteps that, because the rename is atomic. Either the old state is there or the new state is there, never a mangled in-between. This is the kind of detail a real workflow engine handles for you, and the kind of detail that bites you when you roll your own and skip it.
Step 3: drive a step with Opus 4.8
Here is one step, the extract step. It reads the transcript from state, asks Claude to pull out the structured fields, and returns them.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function runExtract(state: WorkflowState): Promise<ExtractResult> {
const response = await client.messages.create({
model: "claude-opus-4-8",
max_tokens: 1024,
thinking: { type: "adaptive" },
output_config: {
format: {
type: "json_schema",
schema: {
type: "object",
properties: {
request: { type: "string" },
product: { type: "string" },
urgency: { type: "string", enum: ["low", "medium", "high"] },
},
required: ["request", "product", "urgency"],
additionalProperties: false,
},
},
},
messages: [
{
role: "user",
content:
"Extract the customer's request, the product involved, and the " +
"urgency from this support transcript. If something is genuinely " +
"unclear, pick the most defensible value rather than inventing " +
`detail.\n\nTranscript:\n${state.transcript}`,
},
],
});
const text = response.content.find((b) => b.type === "text");
if (!text || text.type !== "text") {
throw new Error("extract: no text block in response");
}
return JSON.parse(text.text) as ExtractResult;
}
A few things are doing real work here. The output_config.format with a JSON schema constrains Claude to return exactly the shape we declared, so the JSON.parse downstream is safe rather than hopeful. On Opus 4.8 you cannot prefill an assistant turn to force JSON the old way (it returns a 400), so structured outputs is the right tool anyway. Adaptive thinking stays on because the extraction involves a small judgment call about urgency, and I would rather the model reason about that than snap to the first answer.
The classify and draft steps follow the same template. Classify reads state.extract and returns a team plus a priority. Draft reads both prior results and returns prose. I will not print all three in full, since they are structurally identical, but here is classify so you can see how a later step consumes an earlier one's output:
async function runClassify(state: WorkflowState): Promise<ClassifyResult> {
const extract = state.extract;
if (!extract) throw new Error("classify: extract step has not run");
const response = await client.messages.create({
model: "claude-opus-4-8",
max_tokens: 512,
thinking: { type: "adaptive" },
output_config: {
format: {
type: "json_schema",
schema: {
type: "object",
properties: {
team: { type: "string" },
priority: { type: "string", enum: ["P0", "P1", "P2", "P3"] },
},
required: ["team", "priority"],
additionalProperties: false,
},
},
},
messages: [
{
role: "user",
content:
"Route this support ticket to a team and assign a priority.\n\n" +
`Request: ${extract.request}\n` +
`Product: ${extract.product}\n` +
`Urgency: ${extract.urgency}`,
},
],
});
const text = response.content.find((b) => b.type === "text");
if (!text || text.type !== "text") {
throw new Error("classify: no text block in response");
}
return JSON.parse(text.text) as ClassifyResult;
}
Notice that classify never sees the raw transcript. It works from the structured output of the extract step. That is not just tidiness. It means each step has a small, predictable input, which keeps token costs down and makes each step independently testable. You can feed runClassify a hand-written extract object and check the routing without ever calling the extract model.
Step 4: the driver, with retries
Now the loop that ties the steps together. This is where durability shows up, because the driver reads state, runs only the steps that have not finished, and saves after each one.
async function withRetry<T>(
label: string,
fn: () => Promise<T>,
attempts = 3,
): Promise<T> {
let lastErr: unknown;
for (let i = 0; i < attempts; i++) {
try {
return await fn();
} catch (err) {
lastErr = err;
// Only retry the errors that are actually transient.
if (err instanceof Anthropic.RateLimitError || err instanceof Anthropic.APIError) {
const wait = 2 ** i * 1000; // 1s, 2s, 4s
console.warn(`${label} failed (attempt ${i + 1}), retrying in ${wait}ms`);
await new Promise((r) => setTimeout(r, wait));
continue;
}
throw err; // a bug in our code should fail loudly, not retry
}
}
throw lastErr;
}
const STEPS: Array<{
name: StepName;
run: (s: WorkflowState) => Promise<void>;
}> = [
{
name: "extract",
run: async (s) => { s.extract = await runExtract(s); },
},
{
name: "classify",
run: async (s) => { s.classify = await runClassify(s); },
},
{
name: "draft",
run: async (s) => { s.draft = await runDraft(s); },
},
];
async function runWorkflow(state: WorkflowState): Promise<WorkflowState> {
state.status = "running";
for (const step of STEPS) {
if (state.completed.includes(step.name)) {
console.log(`skipping ${step.name} (already done)`);
continue;
}
try {
await withRetry(step.name, () => step.run(state));
state.completed.push(step.name);
await saveState(state); // checkpoint: the step is now durably done
console.log(`completed ${step.name}`);
} catch (err) {
state.status = "failed";
state.lastError = err instanceof Error ? err.message : String(err);
await saveState(state);
throw err;
}
}
state.status = "done";
await saveState(state);
return state;
}
The two saveState calls are the whole ballgame. The first one runs right after a step succeeds and before the next step begins, so the moment a step's result exists, it is on disk. The second one persists the final status. If the machine dies between the extract checkpoint and the classify checkpoint, the file on disk says extraction is done and nothing else is. There is no window where work is finished but unrecorded.
The retry wrapper distinguishes between errors worth retrying and errors that are bugs. A rate limit or a 529 overload is transient, so we back off and try again. A TypeError from our own code is not transient, and retrying it three times just delays the failure and muddies the logs. Retrying everything is a tempting shortcut and a bad one, because it hides the failures you actually need to see.
Step 5: starting and resuming
Two entry points. One starts a fresh run, the other picks up an existing one. They both end up calling the same runWorkflow, which is the point.
import { randomUUID } from "node:crypto";
async function startRun(transcript: string): Promise<WorkflowState> {
const state: WorkflowState = {
runId: randomUUID(),
transcript,
completed: [],
status: "running",
};
await saveState(state); // persist before doing any work
return runWorkflow(state);
}
async function resumeRun(runId: string): Promise<WorkflowState> {
const state = await loadState(runId);
if (!state) throw new Error(`no run found for ${runId}`);
if (state.status === "done") {
console.log(`run ${runId} already finished`);
return state;
}
console.log(`resuming ${runId}, completed: [${state.completed.join(", ")}]`);
return runWorkflow(state);
}
startRun saves the initial state before it runs a single step. That matters more than it looks. If the process dies during the very first model call, you still have a run on disk with a known runId, and resumeRun can pick it up. Without that first save, a crash early in the first step leaves no trace at all, and resume has nothing to resume.
resumeRun does the obvious thing: load the file, and if the run already finished, do nothing. The "already finished" check is your defense against double-processing, which leads straight into the gotchas.
Gotchas
Idempotency, or what happens when a step runs twice. Resume is built on the promise that running runWorkflow again is safe. That promise holds only if each step is idempotent, meaning running it a second time produces the same result and no extra side effects. Our steps are pure-ish: they call a model and write to state, both safe to repeat. The danger appears the moment a step does something to the outside world. If your draft step does not just write prose to state but also posts the ticket to a real ticketing system, and the process dies after the post but before saveState, then resume re-runs the step and you have posted twice. The fix is to make the external action idempotent (use the runId as an idempotency key the downstream system recognizes) or to record the side effect in state before you trust it. Checkpoint placement is correctness, not just convenience.
Partial failures inside a step. Our checkpoints are step-sized. A step either fully completes and gets recorded, or it does not and gets retried from the top. That is fine when a step is one model call. It gets expensive if a single step does a lot of independent work, like the dynamic workflow case of fanning out across hundreds of subagents. If subagent number 400 fails, you do not want to re-run the first 399. The answer is finer-grained checkpoints: treat each subagent's result as its own recorded unit, the same way we treat each step. The principle scales down; the granularity is a judgment call about how much work you are willing to lose.
The cost of long workflows. A workflow that runs for days is a workflow that spends money for days, and a resumed run that re-does completed steps spends it twice. The completed list is what stops that, so it earns its keep directly in dollars. Two other levers help. Keep each step's input small (classify reading the structured extract, not the raw transcript, is a deliberate cost decision). And on Opus 4.8 specifically, look at the effort parameter: output_config: { effort: "low" } on the simple steps and a higher setting only where the reasoning actually matters. Paying Opus-tier rates with maximum effort on a step that just maps three fields to a team name is waste you will not notice until the bill arrives.
When a real workflow engine beats rolling your own. Everything above is maybe 150 lines, and for a three-step linear flow that is the right amount of code. The honest moment to stop hand-rolling is when you start reaching for features a real engine already has: scheduled and delayed steps, fan-out and fan-in across many parallel branches, visibility into runs that are stuck, automatic backoff policies, versioning so a deploy does not break runs that are mid-flight. Anthropic did not have Claude hand-author a bespoke state file for the eleven-day Rust migration; there is a runtime doing the heavy lifting. Temporal, Inngest, and the cloud-provider workflow services exist because durable execution at scale is a genuinely hard problem with a lot of edge cases. The pattern in this post is the right way to understand what those tools do, and the right way to handle a handful of linear flows. It is the wrong way to run a thousand of them.
Wrapping up
The dynamic workflow feature in Opus 4.8 is impressive because of the scale (a thousand subagents, eleven days, three-quarters of a million lines of Rust), but the reason it can attempt that scale at all is the unglamorous part: state lives on disk, progress is recorded explicitly, and a crash means resume rather than restart. We rebuilt that core in a file and a for loop.
If you take one thing from this, make it the completed list. The instinct is to track progress implicitly, in where the code happens to be when it dies. Track it explicitly instead, as data you write down before you move on, and "resume after a crash" stops being a feature you have to engineer and becomes something that just falls out of the design. Start with the file. Move to a real engine when the flows stop being linear. And let Opus 4.8 flag its own uncertainty along the way, because a workflow you are not watching is exactly the one you want second-guessing itself.