هندسة SLA لوكلاء LLM: تحديد الفشل والتعافي والتكلفة
نهج عملي وقابل للاختبار لاتفاقيات مستوى الخدمة لوكلاء LLM: تحديد معاني الفشل، المراقبة، ميزانيات إعادة التخطيط والالتزامات الاقتصادية.
Agent orchestration fails less like a web request and more like a supply chain. You cannot patch that with more retries or a bigger model.
As teams wire large language models to calendars, knowledge bases, payment rails, and brittle internal APIs, the true dependency graph shows itself. Tool calls become first-class dependencies, each with its own semantics, quirks, and failure modes that cascade through a plan the agent is trying to execute. Recent public benchmarks and bake-offs have shown that agents struggle when tool outputs drift or when a planned step times out and needs repair. The instinct to measure only uptime and p95 latency is a relic. For agents, the hard work sits in specifying the boundaries of acceptable degradation, recovery procedures, and who pays for what when the plan starts to wobble.
This piece argues for SLA engineering tailored to agents: a concrete, testable contract that enumerates failure classes, ships observability hooks an agent can actually read, sets replanning budgets as first-class constraints, and encodes economic responsibilities when upstream dependencies flake. Then it shows how to wire that into CI, deployment, and procurement so the contract lives in code and in your vendor relationships.
Failure Has Semantics, Not Just Status Codes
Traditional SLAs obsess over availability and basic latency. That’s necessary but not sufficient for LLM agent SLA design. An agent does not simply await a single response; it composes a series of tool interactions, each feeding the next step. A 200 OK can be a failure if the schema drifted and the agent silently misinterprets a field. A 500 can be recoverable if a parallel tool can fill in.
An SLA for agentic systems should specify failure semantics explicitly—classes your orchestration layer can detect, with defined responses that the agent or controller can enact without guessing. To make this concrete, define a minimal failure ontology across your tools, modeled in a way the controller and the tools both implement.
A pragmatic, non-exhaustive set:
- Tool Unavailable (Hard Fail): Transport failure or endpoint down. No response body.
- Latency Breach (Soft Fail): Response exceeded a declared threshold but arrived. Agent may choose to proceed if time budget allows.
- Schema Mismatch (Semantic Fail): Response structure deviates from contract, even if syntactically valid.
- Partial Result (Degradation): Subset of fields populated with explicit absent markers. Declared upfront as loss-tolerant.
- Stale Index (Freshness Fail): Data older than a declared freshness window. Often appears in embeddings and search indexes.
- Access Denied (Policy Fail): Permission boundary or consent requirement. Distinguish between permanent vs. remediable (e.g., re-auth).
- Quota Exhausted (Capacity Fail): Rate limit or token budget breach with backoff hints.
- Ambiguous Response (Confidence Fail): Tool declares low confidence or multi-candidate ambiguity that requires disambiguation.
- Non-Deterministic Divergence (Drift): Same query yields divergent results across short windows beyond declared variance bounds.
- Side-Effect Failure (Actuation Fail): Downstream system accepted instruction but failed to complete external effect, with compensating action required.
Why so much taxonomy? Because agents need to plan. A controller can only make rational decisions—whether to retry, to route to an alternative, to escalate—if it can distinguish a stale index from a hard outage. Lumping everything into a generic error code pushes the planning burden back into the model, which is brittle and costly.
These classes must come with detectable signals. That means tool providers—internal teams or vendors—commit to emitting structured, machine-checkable metadata with every response. If the search index is stale relative to its freshness SLA, it sets a flag and a timestamp. If the email tool cannot send to a certain domain due to policy, the response carries a reason code distinguishable from other failures. If the vector search returns a confidence distribution, the distribution is part of the payload, not hidden in text.
Concrete example: a knowledge retrieval tool that serves embeddings and passages adds to every response a declared freshness time, the embedding model version, and a confidence histogram across top-k results. The agent can then enforce a policy like “accept if freshness < 2 hours and top-1 confidence > threshold; otherwise attempt a targeted refresh, up to 500 ms additional budget.” This is tooling reliability for agents expressed as a contract, not a hunch.
A word on partial results. Agents can often do useful work with incomplete data, but only if the absence is explicit. “Empty string” is not a partial result; a typed null with a reason code is. This is not pedantry. It prevents the model from hallucinating missing fields into being.
Observability That Agents Can Read
Observability for agentic AI is not only for humans scanning dashboards. The agent and its controller must consume traces and metrics at runtime to adapt plans. That requires standard fields, stable semantics, and provenance that survives hops across vendors.
A minimal event model for each tool interaction should include:
- Correlation identifiers that bind together plan, step, tool call, and any spawned child calls.
- Declared contract version and tool version. The agent cannot reason about drift without version stamps.
- Failure class, severity, and detection signal. Machines should parse it without a regex.
- Timing breakdown: queue time, service time, round-trip, and any server-reported wait time due to rate limiting.
- Cost attribution: tokens, billable units, and estimated currency impact for that step.
- Confidence or quality signals exposed by the tool, with units.
- Data provenance and freshness windows.
- Privacy and tenant markers, if operating in a multi-entity environment.
These are not wish-list fields; they enable control. A replanning policy might say: if the expected value of recovery is positive within the remaining cost and time budget, proceed; otherwise, surface a principled degradation to the user or trigger human-in-the-loop. Without structured timing and cost, that decision becomes guesswork.
Crucially, observability must be anti-tamper across hops. When an agent calls a vendor API that in turn calls another service, the upstream must forward correlation IDs and pass through declared failure classes. If a provider cannot propagate, your SLA should force them to normalize and re-emit the ontology locally rather than burying nested errors in free text.
There is also the question of tests. The contract should ship with synthetic probes that assert the observability guarantees. If a provider’s future version drops the confidence histogram or changes the name of a critical field, your CI should fail long before production does.
Replanning Is A Budget, Not A Mood
Most agent demos show plucky improvisation. The agent tries, fails, thinks again, tries a different tool, learns a lesson. In production, that is a cost center and a risk surface. An LLM will happily replan itself into a loop if you let it. A serious LLM agent SLA needs replanning SLAs: firm budgets and declared search strategies that keep the system legible and affordable.
Think about replanning along four axes:
- Time. A wall-clock ceiling for total replanning per user decision, not per tool call. Include queue time and backoff waits.
- Money. A hard currency cap per decision, with internal allocation between inference and tools. When the cap is near, the controller degrades gracefully.
- Breadth. A limit on the number of alternative tools or branches the agent may explore before escalating or stopping.
- Interruptibility. Rules for preemption when a user changes context or when new, higher-value work arrives.
These budgets should be part of the SLA, not tucked away in code. They express an explicit trade-off between success probability and spend. In practice, you set a base budget by scenario (e.g., “refund approval,” “meeting scheduling,” “document classification”) and adjust using policy. A payment-related actuation might forbid any replanning beyond a single fallback due to compliance. A search task over an internal corpus may allow a bit more latitude due to low risk and cheap retries.
Budgets are not just limits; they are signals to the agent. The planner must see the remaining time and money and shape its strategy accordingly. That means surfacing budgets into the prompt or controller loop as structured variables, not as a sidecar comment. If the agent cannot read its constraints, it will ignore them.
Agents also need a defined search strategy. Randomized exploration looks clever in demos but is unpredictable under load. The SLA can name an algorithmic stance—beam search with fixed width, greedy replanning with cached tool choice priors, or bounded best-first search guided by value estimates—so operators know what the system will do under stress. You are not freezing research forever; you are declaring the current behavior so it can be tested and audited.
What about backoff? Retries are fine when the failure class indicates transient capacity pressure and the tool provides backoff hints. They are wasteful when the class is Semantic Fail or Drift. The SLA should couple failure classes to recovery actions: retry with exponential backoff on Quota Exhausted; immediate alternative route on Schema Mismatch; abort and escalate on Side-Effect Failure with compensation required.
One practical consideration: cancellation semantics. Agents need to stop work cleanly when budgets exhaust or when the user cancels. Your tools should declare whether they support idempotent cancellation and whether partial side effects are possible. Without this, you accumulate invisible liabilities—half-finished workflows, orphaned tickets, dangling holds on cards.
A word on embeddings. Many agent plans rely on similarity search, and the hidden failure class is Stale Index. If the retrieval corpus lags behind updates, the agent reasons from a shadow world. Replanning cannot fix that. Your SLA should define freshness windows for the index and declare what happens when they are exceeded: block the action, attempt an on-demand refresh within a small time budget, or proceed with an “uncertain” flag that triggers a post-hoc audit.
Recovery And Economics: Who Pays When The Plan Breaks
When an agent acts, it spends. Model tokens, tool usage, and real-world costs add up. When things fail, the spend may rise as the system tries to recover. Without an economic contract, failure becomes a tax on the operator rather than the provider responsible for the failure.
An LLM agent SLA should connect failure classes to economic responsibilities. The shape of that contract is simple:
- If a provider declares Tool Unavailable or Quota Exhausted within their committed capacity envelope, they do not bill for the failed call and, depending on your leverage, credit a token amount for the replanning costs you incur.
- If the provider returns a Schema Mismatch while claiming backward compatibility, they should bear the cost of retries and any compensating actions tied to that step.
- If you exceed a declared budget because of your own orchestration bug, that is on you. The SLA should not be a shield against incompetence.
This is not about punitive clauses. It is about predictability. When finance asks for the cost of failure, you should be able to show how it is bounded and where credits come from when suppliers underperform.
Recovery procedures should be explicit and tiered. For each failure class, agree on the action ladder:
- Local recovery: retry, reformat, switch to a known-good alternative tool, or fall back to a cached answer if the task allows it.
- Plan repair: re-synthesize a plan fragment with tighter constraints, often with a smaller model or a fixed template to limit variance.
- Human-in-the-loop: route to an operator with a crisp, structured summary, inputs, and proposed next steps. Do not dump a chat log.
- Compensation: if a Side-Effect Failure left a system in an inconsistent state, define the compensating transaction clearly, and who executes and pays for it.
Do not skip the human escalation path. There are classes of error where the ethical move is to stop. For instance, a drafting agent tasked with a regulatory letter that cannot fetch a mandatory clause due to a stale corpus should not improvise. The SLA can mark such tasks as non-degradable: if a required tool is out of spec, the plan halts. In regulated environments—common across Gulf enterprises operating multi-entity structures—these guardrails are non-negotiable.
The SLA should also clarify surge behavior. When an upstream dependency enters a brownout, does the system shed load by deferring non-urgent tasks, or does it queue and burn user patience? Which cohorts get priority? These are product decisions encoded as operations policy. They belong in the contract so customers and internal stakeholders know what to expect.
One more economic detail: attribution. An agent may use multiple foundation models, third-party APIs, and internal tools in one decision. Your observability should trace the decision_cost across those legs so you can assign credits or debits fairly. That means normalizing cost units into currency at the time of decision, not at the end of the month. It also opens the door to hedging—running the same step across two vendors occasionally to detect drift or detect silent data quality issues—without losing track of who owes who what when one underperforms.
Shipping It: CI, Deployment, and Procurement That Respect Agents
A contract is only as good as the machinery that enforces it. Agent orchestration lives or dies on whether the SLA is expressed in code, tested in CI, guarded in deployment, and reflected in procurement.
Start with CI. Write contract tests that spin up a fake tool server exercising each failure class and asserting the controller’s behavior: what is retried, what is escalated, what is logged. These are not mock happy paths; they are adversarial fixtures. Include probes for observability fields. A golden trace for a canonical task, with all metrics and fields, should be part of the repository. When a provider ships a new version, run the fixture against it in a staging environment. If the version drops a field or changes a failure code, the test fails. Treat these as hard gates, not best-effort.
Because models and embeddings drift, add chaos to the mix. Inject schema mismatches, randomize latency tails, return partial results with holes. Watch whether your replanning budgets hold. The controller should keep you within bounds automatically, or loudly refuse to proceed.
Promote with canaries. When you ship a change to the planner policy or to the tool selector, route a small percentage of traffic and compare not just success rates but cost per decision, replanning depth, and time-to-stability. The canary should fail closed if budgets are violated, even if raw success nudges up a few points. That is not pedantry; it guards you from winning today and losing the quarter.
Versioning is your friend. Version your tool contracts, your failure ontology, and your planner policies. Agents should pass these versions along in requests so downstream components can adapt or at least refuse incompatible calls cleanly. A “latest” free-for-all invites silent breakage.
Gate deployments on backpressure behavior. Under surge, does your system drop low-priority requests before it melts, or does it spawn replanning storms that grind everything to a halt? Test it. Bake backpressure signals into the SLA—providers must return specific headers or fields telling you to cool off, and your controller must honor them.
On the procurement side, bake the contract into your RFPs and vendor questionnaires. Ask simple, concrete questions:
- Which of these failure classes do you emit natively? Which can you map reliably?
- Do you include declared freshness timestamps and model or index versions in responses?
- Do you provide confidence distributions or quality scores, and how stable are their semantics across versions?
- What is your stance on idempotent cancellation and compensating actions?
- Can you forward correlation IDs across internal hops and expose a trace export for audit?
- What credits do you offer for brownouts and semantic failures that violate backward compatibility?
Do not let vendors wave this off with generic SLAs about uptime. Tooling reliability for agents is about semantics and observability. A 99.9% uptime means little if 1% of in-spec responses contain a hidden schema drift that silently corrupts your plan.
For internal tools, use the same template. It may feel heavy-handed to make a sibling team sign a contract, but without shared semantics you end up in an argument after every incident. The SLA creates an engineering handshake: here is what we promise, here is how you can detect it, here is the cost-sharing when we fail and you spend to recover.
Regionally, public-sector programs in places like the UAE have already conditioned vendors to deliver against clear digital transformation targets with strong audit trails. Extending that muscle to LLM agent SLA language is natural: observable contracts, explicit recovery, and cost bounds that pass procurement review without hand-waving. The same patterns help large Gulf conglomerates orchestrate agents across multiple entities with nuanced permissions.
Finally, document the human fallback. When the agent escalates, operations teams need a crisp package: the plan state, inputs and outputs so far, failure classes hit, remaining budget, and a suggested next step. This is not an apologetic chatbot transcript. It is a surgical handoff that allows a human to fix the problem or bless a conservative alternative.
A Counter-Argument: Isn’t This Premature Bureaucracy?
There is a fair pushback. Agents are early. Models change monthly. Tool ecosystems are a churn of plugins and point solutions. Why ossify with SLAs when the frontier is moving?
A few responses.
First, this is not ossification; it is scaffolding. A testable SLA does not freeze your design any more than unit tests freeze your code. It gives you a way to change it safely. Version the ontology. Expand failure classes as you learn. Update replanning budgets as you collect evidence. The contract lets you tell what changed and whether you are still within the bounds you promised to users and finance.
Second, the costs are already here. If you are running agents at even modest volume, you are paying for replanning churn, bad tool assumptions, and untracked brownouts. Formalizing the contract surfaces where the money leaks and lets you plug it. It often increases velocity because you stop arguing about what happened and start fixing it.
Third, “better models will fix it” is not enough. Even with stronger reasoning, tool ecosystems will remain heterogeneous. Indexes will go stale. Permissions will bite. Downstream systems will have seasons of load. Your foundation model cannot conjure a fresh inventory catalog or bypass a failing payment gateway. It can only reason with what it is given. Better prompts do not create observability fields.
Fourth, there is a risk argument. Sloppy, open-ended replanning creates unpredictable behavior under stress. It can inflate costs at the worst possible moment and create opaque incidents you cannot audit. A minimal, enforced contract turns that chaos into a manageable set of states and responses. This is not paperwork; it is resilience.
Finally, bureaucracy is not the point. Automation is. A contract that an agent can read is not a PDF to appease legal. It is a live interface that turns planning into engineering. When the tool says “stale index,” the planner does not vibe-check; it pivots. When the budget is at 90%, the planner does not dither; it compresses or escalates. That is how you scale without heroics.
There is a weaker counter-argument too: we already have SLOs and error budgets. True, and reuse as much as you can. But agent failure modes are not just distributions of latency and availability. They are semantics that alter plans. Traditional SLOs do not tell a decision-maker inside the loop what to do when the index is stale or when a tool admits low confidence. You need that guidance inline.
The Spec, In Words You Can Ship
Pulling the threads together, here is a compact blueprint you can implement without waiting for a standards body.
Define a shared failure ontology with at least the ten classes listed earlier, mapped by every tool you call. Require structured responses that carry:
- Contract version and tool version.
- Failure class, severity, and detection signal.
- Freshness timestamps and data provenance.
- Confidence metrics with defined semantics.
- Timing breakdown and rate limit hints.
- Cost attribution in both native units and an estimated currency.
- Privacy and tenant markers where relevant.
Expose these fields in your traces and enforce them with CI probes.
Set replanning SLAs per scenario:
- Time ceilings in milliseconds for the end-to-end decision.
- Currency caps with split allocations (model vs. tools vs. compensation).
- Search breadth caps and declared search strategy.
- Cancellation and interruptibility semantics.
Couple failure classes to recovery actions and economic duties:
- Retry on transient capacity pressure with provider-supplied backoff; no billing for failed attempts, credits for induced replanning where negotiated.
- Alternate routing on schema or semantic failures; provider bears cost if contract was violated.
- Halt and escalate on non-degradable tasks when freshness windows are exceeded.
- Compensating actions defined and costed for actuation failures.
Wire it into operations:
- Contract tests and chaotic fixtures in CI.
- Canary deployments that gate on budget adherence, not just success rate.
- Versioned policies, propagated through requests and traces.
- Backpressure-aware controllers that honor provider signals.
Reflect it in procurement:
- Vendors agree to the ontology mapping, observability fields, and credit terms for brownouts and semantic violations.
- Internal tools sign the same contracts to align expectations and speed incident resolution.
None of this requires exotic infrastructure. It requires discipline and an insistence that agents deserve the same rigor we bring to other critical systems, with the added twist that their dependencies are often probabilistic or third-party. Once your contracts are live, you will discover new failure classes. Add them. You will find that some confidence signals correlate poorly with real outcomes. Fix them or drop them. The point is not to predict everything. It is to make failure legible and recovery bounded.
The payoff is not just reliability. It is product clarity. When a product owner asks how the agent behaves under stress, you can answer in specifics: which steps degrade, which ones stop, how much it will cost at the edge, and how quickly it returns to steady state. Those are the contours of trust.
The next wave of agent engineering will reward teams that treat SLAs as living code and economic instruments, not static PDFs. Specify failure. Make observability machine-readable. Budget replanning like you budget money. Tie recovery to responsibility. Then ship with your eyes open. That is how agent orchestration becomes infrastructure rather than theater.