May 31, 2026 · 15 min read · بنية الوكلاء

ادفع بتقطير مهارات الوكلاء إلى الأجهزة، لا السحابة فقط

يجادل لصالح بنى وكيلة هجينة تُقَيِّم مهارات موقّعة على الأجهزة للحد من التكلفة، تقليل تعرّض البيانات، وتجاوز سيطرة المنصات.

مقال · 16 دقيقة قراءة · بنية الوكلاء · Edge AI

The Center of Gravity Is Wrong

We keep sending our agents to the cloud to think, decide, and act. Then we pay for the privilege. The bill shows up as compute, data egress, and a quiet dependency on other people’s rules.

There’s a cleaner path: distill the skills your agents need into small, verified kernels and push them to the devices that live closest to the work. Phones. Kiosks. IoT boxes in cabinets that never see daylight. Keep orchestration in the cloud, but move more execution to the edge. Treat the agent as a conductor, not a virtuoso. Treat skills as signed, auditable artifacts.

This is not nostalgia for offline software. It’s a practical response to costs, controls, and what recent research now makes achievable. Hybrid agent architecture, with on-device AI agents executing distilled skills, gives enterprises a way to shrink cloud spend, reduce data exposure, and sidestep platform gatekeeping—without giving up governance or verifiability.

Why Cloud-Only Agents Are Showing Their Seams

The last few years rewarded speed. Ship a thin client, lean on a giant model, call tools through a gateway, and meter the whole thing. That approach is fine for prototypes and some high-value workflows. It looks less fine when the volume goes up, or when the platform that grants you access decides to change the rules mid-flight.

Three kinds of friction keep appearing:

Economics. General-purpose reasoning is expensive. Chatty agents are worse. A single customer interaction can cascade into dozens of calls: retrieval, function calling, follow-ups, safety passes. You can spend heavily just to produce output you already compute elsewhere inside the organization.
Exposure. Every cloud call is a data event. Even with careful redaction and data processing agreements, some workloads don’t want their prompts or context swimming through third-party systems. In regulated or multi-entity environments, the math favors local processing that keeps data on the device or inside isolated edges.
Gatekeeping. The platforms that sit between your agents and the end user hold the switches. Rate limits. Policy swings. Silent content filters. Fine for consumer apps. Risky for enterprise processes you intend to run at scale or across jurisdictions. You don’t need to break rules to see the frame: you are downstream from someone else’s governance and margins.

When teams talk about “enterprise AI orchestration,” most of the focus lands on chaining, memory, and tool use in the cloud. That’s reasonable. But it assumes the decisive moves always happen there. What would change if small, capable skills sat on the endpoints themselves, ready to execute without a trip through the network?

Distill Skills, Not Personalities

“Skill distillation for AI agents” has a specific, practical goal: take a behavior that currently requires a heavy model and a lot of context, and compress it into a smaller, verifiable kernel that performs the narrow task well and consistently. The kernel can be a tiny model, a policy, a program, or a hybrid of these. The key is minimality and testability.

Recent work—threads like PANDO and AgentDoG among others—points to viable paths. Researchers have been showing that you can train smaller policies from traces generated by larger, more capable agents, especially when the task domain is tight and the tool interface is stable. This is not magic; it’s supervised imitation and targeted optimization. The result is a component that does one thing, or a small family of things, with less compute and fewer surprises.

Enterprises don’t need a philosophical breakthrough to act on this. They need a process:

Identify bounding boxes. Don’t try to distill “customer support.” Distill “intent classification for these 30 intents” or “on-device redaction of PII in invoices before OCR upload.” Draw crisp lines around the skill boundary.
Generate or collect traces. Use your current best agent stack to produce strong demonstrations. Curate. If you have production logs, anonymize and extract high-quality episodes. If not, synthesize scenarios with clear tool calls and ground truth outcomes. Keep the domain tight.
Train the kernel. Depending on the task, this might be fine-tuning a small model, learning a reactive policy, or compiling a code path with learned thresholds. Many skills end up as hybrid artifacts: a small model backs a deterministic program that handles structure and guardrails.
Fix the interface. Publish a stable ABI for the skill: inputs, outputs, errors, and resource claims. Bundle tests alongside the artifact.
Sign and ship. Create a signed skill artifact with provenance metadata, evaluation scores on a held-out set, and a safety profile. Push it to endpoints through your standard software distribution.
Observe and rotate. Collect privacy-preserving telemetry, re-run evaluations on drift samples, and rotate the skill when it degrades or the domain changes.

The output of that pipeline is not an amorphous agent “persona.” It’s a small, hardened capability with a clear contract. A kiosk doesn’t need creativity to check a warranty. A technician’s phone doesn’t need a 100-billion-parameter model to pick the next troubleshooting step from a finite catalog. Most frontline agent actions are narrow. That’s where distillation pays.

A Hybrid Agent Architecture That Actually Ships

The operating model is simple to describe and specific to implement: orchestrate in the cloud, execute on the device when you can, and offload only when you must. The cloud remains the coordinator, evaluator, and policy brain. Devices hold signed skills and a runtime that can call them safely.

A concrete layout looks like this:

Cloud orchestrator. Maintains conversation state, policy checks, assignment logic (“which skills are allowed for this user/context?”), and fallbacks. Runs heavier reasoning when needed. Handles enterprise AI orchestration across channels and systems.
Skill registry. Stores signed skill artifacts with manifests: version, hash, supported platforms, interfaces, resource budgets, evaluation metrics, and safety profile. Also stores compatibility matrices (which runtime versions can load which skills).
Distribution layer. Uses existing MDM, app stores, or IoT fleet managers to deliver skills and runtimes to endpoints. The unit of delivery is small and frequent. No monoliths.
Endpoint runtime. A lightweight container or sandbox—WebAssembly is a good candidate—that loads skills, enforces permissions (files, sensors, network), and exposes a consistent call interface to local apps. The runtime handles local logging, signature checks, and quotas.
Local context cache. Per-device vector index or key-value store for recent interactions and device-specific data. Scoped to the device, not a cross-user pool.
Privacy shim. Redacts or tokenizes sensitive fields before any off-device call. Also performs local PII detection where possible.
Telemetry and attestation. Batches signed logs (hashes of inputs/outputs, timing, resource use) and ships them upstream. Supports device attestation so the cloud can verify it’s talking to a genuine runtime.
Fallback pathways. When a skill returns “uncertain” or exceeds a confusion budget, the orchestrator can escalate to a larger model in the cloud or to a human.

Here’s how it plays out in practice:

A retail price-check kiosk needs to answer common product questions and fetch stock status. The skill set includes: barcode parsing, local product lookup, a small NLU model trained on the store’s taxonomy, and a query builder for stock APIs. Those skills live on the kiosk. The orchestrator handles rare questions (“compare two warranty plans across categories”) by escalating to a larger model with broader context. For the 95% case, the device responds in milliseconds, without shipping customer context back and forth.

Or consider a field technician diagnosing a power subsystem. The phone holds skills for error-code interpretation, a guided diagnostic tree distilled from service manuals, and a local summarizer that turns notes into structured updates. Photos never leave the device until redacted. The orchestrator steps in only when the device raises an “uncertain” flag or requests authorization for a risky action.

You still get the benefits of cloud brains where they matter: cross-session memory, policy enforcement, and learning from fleet telemetry. But you shed a large fraction of routine work to the edge, where it runs faster, cheaper, and with tighter data boundaries.

For organizations in the Gulf—banks, ports, airlines, city services—the pattern maps well onto regulated, multi-entity environments. You can keep sensitive operations within site boundaries while coordinating across a central brain. When a kiosk in Dubai Mall or a check-in desk at an airport loses connectivity, the distilled skills keep serving most requests, and logs reconcile when the link returns.

Governance You Can Prove: Signed, Auditable Skills

None of this is credible without governance. “Trust us, it’s distilled” is not a policy. Signed, auditable skill artifacts and repeatable evaluation make the model workable.

A minimal governance spec for on-device AI agents should include:

Reproducible builds. Skills are built from versioned code and data with deterministic pipelines. The build environment is recorded, the SBOM attached, and the hash anchored in a registry. If you rebuild with the same inputs, you get the same artifact.
Artifact signing. Every skill, runtime, and configuration bundle is signed by the issuing team. Devices verify signatures and pin trust to known issuers. Rotate keys on a schedule, and revoke compromised issuers quickly.
Safety profile. Each skill ships with a documented risk surface: allowed inputs, banned outputs, resource caps, escalation triggers. The profile references a test suite with clear pass/fail criteria.
Evaluation harness. Skills pass a battery of tests: functional, adversarial, and bias/quality checks suited to the domain. Scores travel with the artifact. When a skill changes, the league table updates.
Attested execution. Devices attest their runtime integrity—ideally leveraging hardware-backed key stores—so the orchestrator can verify that a skill ran inside a known environment with expected permissions.
Auditable logs. Devices hash input/output pairs (or structured summaries thereof) and sign timing/resource records. The orchestrator keeps an append-only ledger of these hashes. You can’t reconstruct sensitive data from the hashes, but you can prove what ran and when, and tie it back to a signed artifact.
Kill switches. Skills carry a revocation list and can be disabled remotely. The orchestrator respects local safety overrides by design.

This is not exotic. It’s standard software supply chain discipline applied to learned artifacts. The difference is that the unit of delivery is smaller and more frequent. The payoff is traceability: you can answer, with evidence, “Which version of the redaction skill processed this invoice on that kiosk at 14:03?” That’s agent verifiability in a form compliance teams can use.

One edge-case worth calling out: learned skills drift. Domain terms shift. Edge data diverges from training sets. This is where the hybrid pattern helps. You can set a “confusion budget” per skill—a rolling measure of uncertainty from local confidence signals and post-hoc checks. When it exceeds a threshold, the runtime escalates. The cloud can then schedule targeted retraining or swap in a new skill with a higher score on current data. The device doesn’t need to improvise; it needs to detect when not to.

Costs, Privacy, and Platform Gatekeeping

There’s a lot of heat around cost claims in AI systems, and little of it is stable. Prices move. Models change. What doesn’t change is the shape of the bill: you pay for tokens, you pay for vector calls, you pay for network, and you pay for guardrails. Every call out is another line item. Moving narrow, high-volume skills to devices zeroes out most of those lines.

Three concrete reductions show up fast:

Token churn. The small classification or routing problems that dominate many agent workflows don’t merit large-model calls. Distilled skills handle these locally with a fraction of the compute. The orchestrator trims long chains because early steps finish on-device.
Context marshalling. You stop shuttling the same blobs around. A kiosk can embed its local catalog once, keep the index fresh, and answer lookups without round trips.
Overhead multipliers. Safety wrappers, retries, batching—these cost. Local skills cut retry rates on the fast path and reserve cloud wrappers for the cases that benefit from them.

Privacy wins are more straightforward. On-device redaction before capture. Local summarization before sync. Local policy checks that stop data from flowing at all. A hybrid agent architecture lets security teams express rules at the only place that guarantees enforcement: the endpoint that sees the raw data.

Then there’s gatekeeping. Platform operators have their own constraints and priorities. When your agent relies on a single API for core behavior, you inherit those constraints. Terms of service can shift, moderation layers can tighten, and rate limits can cap your scaling just when demand shows up. On-device AI agents do not eliminate these dependencies—you still need models, updates, and distribution channels—but they reduce your exposure. If a vendor tightens access to a cloud endpoint, your fleet of devices keeps running its local skills while you adapt. You still escalate when you must, but the pressure drops.

None of this argues for tossing safety. A local skill is not a backdoor around guardrails. It’s a signed, tested component with a tighter mandate than any general-purpose model call. That narrowness is a safety feature, not a risk.

Objections Worth Hearing—and How To Respond

Skeptics have good points. Shipping skills to devices introduces new work. It complicates deployment. It opens attack surfaces. Let’s put the hard questions on the table.

“Devices are messy. How do you handle fragmentation?” You don’t unless you plan for it. The antidote is a small, stable runtime with a clean ABI. Treat the runtime like a kernel that hides device variation from skills. Use feature flags and capability detection so the orchestrator knows what each device can do. Skills declare resource budgets up front. If a device can’t honor them, the orchestrator routes around it. This is not glamorous, but it is software engineering 101.

“Isn’t this just more things to update?” Yes, and that’s fine. Small skills are easy to rotate. Use your existing MDM or over-the-air mechanisms. Version skills separately from host apps. Pin them to device cohorts. Push updates gradually and roll back surgically. The operational footprint is closer to content updates than full app releases.

“What about data exfiltration or rogue skills?” Signed artifacts and per-skill permissions bring the risk down to something you can talk about. Skills don’t get the network by default. They don’t get sensors by default. The runtime logs every capability use. Enterprise distribution channels don’t install unsigned code. In high-sensitivity settings, require device attestation before a skill will run. You can go further with sandboxes like WebAssembly or similar that offer a constrained execution model.

“Won’t model theft get worse if the model is on the device?” If your skill includes a small model, an attacker can try to extract it. Avoid putting crown-jewel IP in a client. Favor distilled policies that are only valuable when paired with your systems and data. Encrypt at rest and rely on hardware-backed keys where available. More importantly, scope the risk: if the skill is valuable only inside your environment—because it calls tools you control—the incentive to steal it drops.

“Aren’t on-device models too weak?” For broad reasoning, yes. For bounded tasks with clear interfaces, not necessarily. The point of skill distillation is to stop asking a sledgehammer to pick a lock. Route the hard, rare cases to the cloud. Let devices handle their daily bread.

“Telemetry at the edge sounds like surveillance.” It can be, if you collect raw data. You don’t have to. Hash inputs and outputs. Sample sparsely. Do not store content unless the user consents for support. Use signed aggregates for performance and safety signals. Your goal is verifiability and fleet health, not peeking into conversations.

“Regulators will worry.” Good. Give them artifacts they can inspect: SBOMs, evaluation reports, signed manifests, and attested logs. It’s easier to show how a kiosk skill handles PII when the skill is a small, testable component than when it’s an opaque cloud call to a vendor’s black box.

Finally, the meta-objection: “This is a lot of architecture for something we can just buy as a SaaS.” If your problem is small and not core to your mission, SaaS is probably right. If your problem is large, touches sensitive data, or is the thing you do for a living, building the muscle to distill and ship skills is worth it. It’s the same rationale that led teams to own mobile apps instead of outsourcing every feature to webviews. Control and margin follow the work.

How to Start Without Boiling the Ocean

You don’t need a lab of PhDs or a heroic rewrite. Pick one or two skills where the economics and privacy math are obvious. Then build a simple path from cloud traces to on-device skill artifacts. A thin slice can look like this:

Choose a narrow, high-volume task. Examples: on-device intent routing for support chats; PII redaction before capture; product attribute extraction from barcode scans; offline FAQ responses for kiosks in areas with intermittent connectivity.
Define the interface and tests first. Inputs, outputs, latency target, acceptable error envelope, and a confusion signal. Build a small evaluation set with real, messy cases.
Train a baseline. Start with a tiny model or even a rules-plus-thresholds approach guided by a larger model’s outputs. The bar is “meets tests,” not “wins benchmarks.”
Wrap it in a runtime. Build the min spec for your device sandbox and signing flow. Do not overdesign. A WASM-based runtime that loads signed artifacts and enforces a few permissions will carry a long way.
Ship to a small cohort. Watch telemetry. Set a high bar for escalation to the cloud so you see where the edges are.
Iterate. Add a second skill that composes with the first. Practice updates. Document the playbook.

Once you have the loop running, formalize the skill registry and build pipelines. Bring security in early to agree on signing, provenance, and attestation schemes. Align with compliance on the logging strategy. These conversations are easier when they anchor to concrete artifacts and tests rather than abstract “AI safety” debates.

As your fleet grows, your orchestrator’s job evolves from “decide everything” to “route and verify.” That shift matters. You stop paying for thought you don’t need, and you stop begging for platform favors. You start acting like a software organization again, not a reseller of someone else’s API.

This also changes vendor conversations. You still use strong cloud models. You still value managed services. You just stop pretending you can outsource your cost structure and privacy posture. When you carry signed skills in your pocket, your negotiation stance improves.

One last practical thread: people and skills. Distillation is a craft. It sits between MLOps, product engineering, and safety. Train a small group to own the pipeline. Reward them for removing cloud calls and tightening safety profiles. Measure wins in reduced token spend, lower latency, and verifiable behavior, not vanity benchmarks.

The Gulf’s delivery pace and its mix of public and private operators create a fertile ground for this pattern. Many organizations already coordinate across free zones and vendors while keeping sensitive operations close. Pushing distilled skills to endpoints extends that habit to agents. The familiar playbook—sign, attest, rotate—still applies. Only the payload changes.

The technology gates are open enough. Phones can run small models. Kiosks can carry sandboxes. IoT boxes can verify signatures and phone home. Research keeps tightening the loop from big-agent traces to small, stable behaviors. You don’t need to wait for the perfect model to start moving work off the wire.

Ship skills, not personalities. Distill what you can prove. Sign what you ship. Let the cloud coordinate, not carry. That’s how enterprises get agents they can afford, govern, and keep running when the network—or the platform—has other plans.