Technical By Michael Smith

Building on Frontier Models: A Stack for Multi-Agent Production Systems

What the production stack for multi-agent AI actually looks like in 2026 — frontier model routing, credential infrastructure, sandboxed tool execution, metering, and the middleware that holds it all together.

Building on Frontier Models: A Stack for Multi-Agent Production Systems

The toy stack stops working at ~5 agents

Most AI-agent tutorials hand you a notebook with an OpenAI key, a LangChain wrapper, and three tools. Tutorials are great. Production is different. Once you have more than about five agents talking to more than about ten external services, the toy stack breaks in five predictable places at once.

This is a walkthrough of what production actually looks like in 2026, based on the stack we run at Titanium Labs to operate our own agency (The Leading Practice) and to deliver client engagements. It’s opinionated. It’s also the same stack we’d recommend if you were building from scratch.

The five things that break

  1. Credentials sprawl. Every agent needs OAuth tokens to Google, Microsoft, GHL, Stripe, Calendly, Slack, your CRM, your data warehouse, and twelve other things. Storing OAuth refresh tokens in app env vars is fine until you have eight apps. Then it’s a security nightmare and a constant rotation problem.

  2. Cost explosions. A single misbehaving agent in a loop can burn a four-figure API bill in an afternoon. We’ve seen it. Twice. You need budget enforcement at the platform layer, not the application layer.

  3. Tool execution risk. Every tool an agent calls is an attack surface. A malicious or hallucinated tool call can delete data, send emails, charge cards. You need sandboxing, policy enforcement, and an audit log per call — not “I trust the LLM.”

  4. Model whiplash. The frontier model landscape changes every six weeks. Claude 4.7 outperforms GPT-5 on your benchmark today. In two months, Qwen 3 Coder ships and changes the answer. You need a provider abstraction so you can swap models without rewriting application code.

  5. Observability gaps. When a multi-agent system produces a bad output, the only useful question is “show me the full trace of every model call, every tool call, and every state mutation that led to this answer.” If you can’t answer that, you can’t debug. Most teams can’t answer that.

The production stack we run

Model layer — ProviderRegistry, not a hardcoded key

We never instantiate a model directly in application code. Every model call goes through a ProviderRegistry that knows about every available provider — Anthropic (Claude Opus, Sonnet, Haiku), OpenAI (GPT-4o/5), Qwen (via Together or self-hosted), and a Cloudflare AI free tier for the cheap stuff. The registry routes based on a per-agent policy: “use Opus for planning, Sonnet for tool use, Haiku for classification.”

This is the single highest-leverage decision in the stack. When Claude 5 ships, we change one config. When OpenAI drops prices, we change one config. When an enterprise client requires zero-data-retention, we route their tenant to a different provider — one config.

Credential layer — Claude Gateway

Every external API the agents touch requires credentials. We do not store OAuth refresh tokens in application env vars. Ever. Apps call our Claude Gateway at runtime to exchange a short-lived access token. The gateway handles refresh, rotation (GHL and Microsoft rotate refresh tokens on each use — fun), location-token exchange (GHL has company tokens that exchange for location-scoped tokens), and credential storage.

Each application has exactly four env vars: GATEWAY_URL, GATEWAY_API_KEY, FALLBACK_GATEWAY_URL, FALLBACK_GATEWAY_API_KEY. That’s it. No GOOGLE_REFRESH_TOKEN. No STRIPE_SECRET_KEY. The blast radius of an app secret leak goes from “rebuild OAuth flows for fifteen services” to “rotate one gateway API key.”

Middleware pipeline — Context → Policy → Sandbox → Budget → Hooks → Artifacts → Metering

Every agent invocation flows through a middleware pipeline. Each stage has a defined contract and can be inspected independently.

  • Context — load the agent’s working memory, prior messages, current task state
  • Policy — enforce per-tenant policy (allowed tools, allowed models, allowed external destinations)
  • Sandbox — execute tool calls in an isolated environment with timeout, memory, and network restrictions
  • Budget — check per-task, per-tenant, and per-day spending limits before the call; deduct after
  • Hooks — emit events for downstream consumers (logs, monitors, billing, customer notifications)
  • Artifacts — persist intermediate outputs (file uploads, screenshots, generated reports) to durable storage
  • Metering — record token usage, latency, success/failure for cost attribution and SLOs

If a stage fails, the pipeline halts with a structured error that an SRE can read. No more “the agent did something weird, can you check the logs?”

Orchestration — WAT pattern (Workflows / Agents / Tools)

We separate three concerns. Workflows are markdown SOPs that describe what the agent is trying to accomplish in human language — objective, inputs, expected tool sequence, edge cases, output format. Agents are LLM-driven decision-makers that read the workflow and orchestrate tool calls. Tools are deterministic Python or TypeScript that does the actual work — API calls, data transforms, file ops.

The probabilistic LLM is constrained to reasoning. The deterministic code does execution. When a workflow fails, you fix the tool (if it’s a bug), the workflow (if the SOP was wrong), or the agent prompt (if the LLM was confused). Three clean failure modes, three clean fixes. No “the AI is broken, we don’t know why” debugging.

Storage — Postgres for everything, pgvector for retrieval

Supabase Postgres is the database. pgvector handles retrieval-augmented generation without a separate vector DB. ClickHouse for high-cardinality observability data. S3-compatible object storage for artifacts. We don’t introduce a new database unless there’s an unbreakable reason — every new datastore is a new operational burden.

Observability — OpenTelemetry traces, per-call

Every model call, every tool call, every middleware stage emits an OpenTelemetry span. We can render the full trace of any agent invocation as a flame graph. When a customer reports a bad output, the answer to “what happened” is one query away. This single capability has eliminated more debugging time than any other investment.

The hard-won lesson

The most important thing we’ve learned running this stack in production for the last 18 months: the bottleneck is never the model. The frontier models are good enough. The bottleneck is always the infrastructure around the model — credentials, budgets, sandboxing, observability, governance. Teams that obsess over prompt engineering and ignore the platform layer get stuck at “interesting demo.” Teams that build the platform layer can swap in any frontier model and immediately get production-grade results.

That’s why we built Ottolax. That’s why we license it to clients. Most teams shouldn’t rebuild this — there’s no advantage in it, and the year of platform work is a year you’re not shipping product.

See Ottolax → · Book a build call →

Tags:

#multi-agent #claude #architecture #ottolax

Found this helpful?

Share it with someone who needs to read this.

Michael Smith

Michael Smith

Founder & Principal

Builder, Operator

AI Strategy & Roadmapping Multi-Agent System Architecture Frontier Model Integration (Claude, GPT, Qwen) Production AI Operations Fractional CAIO Engagements
View full profile →

Ready to Get Started?

Contact us today — we're here to help.

Ready to ship an AI system that actually runs your business?

Book a 30-minute strategy call. We'll map your highest-leverage AI opportunities and tell you exactly what we'd build.

AI Systems Consultancy
Get Relief Today →