Agentic Patterns: Tool Invocation Timeout

An agent's tool call is a network operation pretending to be a function call. The model emits a JSON payload, the framework executes it, the result lands back in the context window.

In the demo, the call returns in two seconds. In production it hangs, it 429s, it 504s, it disappears, and the agent has no good idea what to do about any of it.

The previous post argued that the orchestrator is the part that knows what work has been done. That is true at the level of which agent runs next.

It says nothing about the layer below: the single tool call, the HTTP request, the part of the system that lives between "the model decided" and "the model got a result". That layer is what this post is about.

The model is not where retry logic lives

The default loop in every agent framework is roughly this: the model emits a tool call, the framework executes it, the result goes back into the context window, and the model decides what to do next.

When the call fails, the model sees the error and tries again. Sometimes with slightly different arguments, sometimes with the same ones. The retry is a reasoning step.

That is wrong in two ways.

The first is correctness. The model cannot safely retry a non-idempotent operation, because it does not know whether the previous call actually executed.

A POST /opportunities that returned a 504 may have already created the record. A retry on the model's authority creates two.

The model cannot distinguish "no response yet" from "no response ever", because that distinction is a property of the network, not of the conversation.

The second is cost. A retry inside the model loop pays for the full prompt again.

A loop of the model staring at a 429, deciding to retry, and being charged for the entire context to do so adds up quickly. No infrastructure is on fire. The agent is simply doing its job.

The right place for retries is the layer that has memory of attempts, knowledge of network state, and a wall clock. That layer is not the LLM.

Durable tools

The pattern is to stop thinking of a tool as a function the model calls and start thinking of it as a node in a graph the system can resume. Two things change.

The intent of the call is recorded before execution starts, so a worker crash is recoverable rather than catastrophic. And the execution is wrapped in a runtime that owns retries, backoff, and resume-after-crash on the agent's behalf.

LangGraph is the path of least resistance in the LangChain ecosystem. It gives you per-node retry policies, checkpointed state, and graph-level cancellation. Enough for most agents.

import { StateGraph, MessagesAnnotation } from "@langchain/langgraph";
import { ToolNode } from "@langchain/langgraph/prebuilt";
import { SqliteSaver } from "@langchain/langgraph-checkpoint-sqlite";
import { tool } from "@langchain/core/tools";
import { z } from "zod";

const createOpportunity = tool(
  async ({ accountId, amount }, config) =>
    salesforce.createOpportunity(
      { accountId, amount },
      { signal: config?.signal },
    ),
  {
    name: "create_opportunity",
    description: "Create a Salesforce opportunity",
    schema: z.object({
      accountId: z.string(),
      amount: z.number(),
    }),
  },
);

const graph = new StateGraph(MessagesAnnotation)
  .addNode("tools", new ToolNode([createOpportunity]), {
    retryPolicy: {
      maxAttempts: 5,
      initialInterval: 1_000,
      backoffFactor: 2,
      maxInterval: 30_000,
      jitter: true,
      retryOn: (err) =>
        !["AuthError", "PermissionError"].includes(err.name),
    },
  })
  .addEdge("__start__", "tools")
  .compile({
    checkpointer: SqliteSaver.fromConnString("./agent-state.db"),
  });

await graph.invoke(state, {
  configurable: { thread_id: "run-42" },
  signal: AbortSignal.timeout(120_000),
});

The retryPolicy on the tool node owns the question of "how do we recover from a transient failure". The checkpointer owns the question of "what was already done before the worker died".

Invoking the graph again with the same thread_id resumes from the last successful node rather than starting over. The AbortSignal puts a wall-clock ceiling on the whole run, retries included.

The model is no longer in the loop for any of that. It is free to be wrong about which tool to call. It is no longer in a position to be wrong about how to call it.

The four timeouts, and the ones most frameworks skip

Durable runtimes converge on four timeout knobs. LangGraph implements one and a half. The other two and a half are where you eventually discover, slowly, that the runtime has run out.

Schedule-to-start is how long a call may sit in the queue before a worker picks it up. If every worker is busy with their own retries, this is the timeout that says "we are already on fire, do not add to it" and fails fast.

LangGraph does not model this, because nodes execute inline within the graph's process, not across a queue of remote workers.

Start-to-close is the per-attempt budget. The model expects an answer in 30 seconds. The network does not get to take longer than that for any single attempt.

LangGraph leaves this to you: pass the AbortSignal from the config argument into your tool's HTTP client, or wrap the call in Promise.race against a timer. Easy to forget, easy to set inconsistently across tools.

Schedule-to-close is the end-to-end budget across all retries. The one everyone forgets. Without it, exponential backoff with maxAttempts: 5 and a 30-second per-attempt budget can legitimately run for several minutes per tool call.

AbortSignal.timeout() on graph.invoke covers this at the graph level, but it is graph-wide, not per-node, so every node shares the same budget.

Heartbeat is liveness. A long-running tool, such as a web scrape, a data export, or a slow call to another model, periodically reports "I am alive". If the heartbeat stops, the system assumes the worker is dead and reschedules.

LangGraph has no concept of this. A stuck node stays stuck until the outer AbortSignal fires.

For most agents that is enough. The moment you need queues of remote workers, per-attempt budgets that differ meaningfully from the end-to-end one, or heartbeats for genuinely long operations, you have outgrown the in-process model.

At that point you need a distributed durable execution runtime instead. Recognising which side of that line you are on is more useful than picking one camp.

What this leaves the model

A durable tool node is opinionated about what the model is not responsible for. Not retries. Not backoff. Not distinguishing a network blip from an authorization failure.

The circuit breaker post covered the policy layer above this: when to stop a run entirely. Durable tools cover the layer below: how to make each call survive the kinds of failure a single retry can fix.

There is one thing the model is still uniquely positioned to decide. Once the tool returns, successfully, after retries, with a result, should the run continue at all?

The result might be empty, low-confidence, or wrong in a way that more tool calls will not fix. That is a question the infrastructure cannot answer for you.

That is the next pattern: the confidence threshold gate. Durable tools give you a reliable answer. The gate decides whether to act on it.