Determinism is a feature

I built a side project a few months ago: a small agent that processes articles from my RSS feeds. The idea was that when a new article lands, the agent fetches it, summarises it, scores it by relevance to my interests, and either saves it to my Obsidian vault or discards it.

It worked beautifully in testing. Then I left it running over a weekend and came back to find my vault had 200 entries, most of them duplicates, several of them partially processed with no summary. The agent had no bugs in the traditional sense. It just had no guaranteed path through its own logic.

The Loop Pattern

The standard way to build an agent like this is a tool calling loop. You give the LLM a task, a set of tools, and let it decide what to call next until it decides it's done:

async function processArticle(url: string) {
  const messages = [
    { role: "user" as const, content: `Process this article for my reading vault: ${url}` }
  ];

  for (let i = 0; i < 20; i++) {
    const response = await anthropic.messages.create({
      model: "claude-opus-4-6",
      max_tokens: 1024,
      tools: [fetchTool, summariseTool, scoreTool, saveTool, discardTool, draftTweetTool],
      messages,
    });

    if (response.stop_reason === "end_turn") break;

    const toolUse = response.content.find(b => b.type === "tool_use");
    if (!toolUse) break;

    const result = await executeTool(toolUse.name, toolUse.input);
    messages.push({ role: "assistant" as const, content: response.content });
    messages.push({
      role: "user" as const,
      content: [{ type: "tool_result", tool_use_id: toolUse.id, content: result }],
    });
  }
}

This is fine for exploratory tasks where the path is genuinely unpredictable. But an article processing pipeline is not exploratory — it has a fixed process. Fetch, then summarise, then score, then branch. The loop makes every step optional and every ordering possible.

Here is what the actual execution paths looked like:

The agent would sometimes score before summarising, sometimes summarise twice, sometimes draft a tweet for a discarded article. Each run was deterministic from the LLM's perspective — it always did something reasonable given its context — but the process was not.

The Graph Version

Flow engineering replaces the loop with an explicit state machine. The process becomes a directed graph: nodes are the steps, edges are the only permitted transitions. The LLM still does all the reasoning inside each node — it just cannot invent its own edges.

Here is the implementation:

type State = "FETCH" | "SUMMARISE" | "SCORE" | "SAVE" | "DISCARD" | "TWEET" | "DONE";

type ArticleContext = {
  url: string;
  content?: string;
  summary?: string;
  score?: number;
};

const transitions: Record<State, State[]> = {
  FETCH:     ["SUMMARISE"],
  SUMMARISE: ["SCORE"],
  SCORE:     ["SAVE", "DISCARD"],
  SAVE:      ["TWEET"],
  TWEET:     ["DONE"],
  DISCARD:   ["DONE"],
  DONE:      [],
};

async function processArticle(url: string) {
  let state: State = "FETCH";
  const ctx: ArticleContext = { url };

  while (state !== "DONE") {
    const next = await executeState(state, ctx);

    if (!transitions[state].includes(next)) {
      throw new Error(`Illegal transition: ${state} → ${next}`);
    }

    state = next;
  }
}

Each state is a focused LLM call that does one thing and returns the name of the next state — nothing else:

async function executeState(state: State, ctx: ArticleContext): Promise<State> {
  switch (state) {
    case "FETCH": {
      ctx.content = await fetchArticleText(ctx.url);
      return "SUMMARISE";
    }

    case "SUMMARISE": {
      const res = await anthropic.messages.create({
        model: "claude-opus-4-6",
        max_tokens: 512,
        system: "Summarise the article in 3-5 sentences. Return only the summary.",
        messages: [{ role: "user", content: ctx.content! }],
      });
      ctx.summary = extractText(res);
      return "SCORE";
    }

    case "SCORE": {
      const res = await anthropic.messages.create({
        model: "claude-opus-4-6",
        max_tokens: 128,
        system: `Score this summary 1-10 for relevance to: software engineering, AI, distributed systems.
                 Reply with JSON: { "score": number, "reason": string }`,
        messages: [{ role: "user", content: ctx.summary! }],
      });
      const { score } = JSON.parse(extractText(res));
      ctx.score = score;
      return score >= 7 ? "SAVE" : "DISCARD";
    }

    case "SAVE": {
      await saveToVault({ url: ctx.url, summary: ctx.summary!, score: ctx.score! });
      return "TWEET";
    }

    case "TWEET": {
      const res = await anthropic.messages.create({
        model: "claude-opus-4-6",
        max_tokens: 128,
        system: "Draft a tweet (max 240 chars) sharing this article. Return only the tweet text.",
        messages: [{ role: "user", content: ctx.summary! }],
      });
      await saveDraftTweet(extractText(res));
      return "DONE";
    }

    case "DISCARD": {
      console.log(`Discarded: ${ctx.url} (score: ${ctx.score})`);
      return "DONE";
    }
  }
}

The throw on an illegal transition is the key line. In the loop version, a bug where the LLM drafts a tweet after discarding an article will silently succeed. In the graph version, it throws immediately and you fix it before it ever ships.

What You Get

Observability without infrastructure. Every state → next is a natural log line. You can reconstruct the exact path any article took. With the loop, you get a wall of tool calls with no structure.

Testable states. You can call executeState("SCORE", mockCtx) in isolation. You can assert that a low score article transitions to DISCARD and a high score one to SAVE. With the loop, the only thing you can test is the full end to end run.

Predictable cost. The graph has a known maximum LLM call count equal to the longest path (4 calls for a saved article, 2 for a discarded one). The loop's call count drifts with model mood and input complexity.

Resumability. Because all state lives in ctx, you can persist it after each transition and resume from any point. If the tweet API is down, you can retry from TWEET without fetching or scoring again. The loop keeps state in the message array — pausing mid execution means losing everything.

When the Loop Is Right

Not every agent should be a graph. The loop is the right model when the shape of the work is genuinely unknown upfront — open ended research, code generation across unknown territory, anything where the agent needs to discover the path, not follow one.

The useful distinction: if your agent is a process (ordered steps, side effects, compliance requirements), model it as a graph. If it is a solver (find the best answer by whatever path), the loop is fine.

An article pipeline is a process. A "debug this error" agent is a solver. The architecture should match.