MCP Servers for AI Agents: From Demo to Production

An MCP server exposing tools to multiple AI agents in production

You can stand up an MCP server in an afternoon. It works, your AI agent calls a tool, the demo convinces. Then you go to plug it into your real business tools — a CRM, an ERP, a desktop assistant, a coding agent, a Telegram bot — and give it access to live data. That's where it stops being easy.

We build and run MCP servers in production: one of ours exposes 91 tools across six domains — CRM, project management, invoicing, product lookups (91 was the count when we wrote that post; it's past 200 today). We've written about how it's structured. This post is the other half — the decisions the quickstart never mentions, the ones we'd actually argue about, and the gotchas that cost us real time.

What is an MCP server for an AI agent?

It's a single, typed interface between an AI agent and your business systems. Model Context Protocol is the standard: instead of building a custom integration for every agent you run, you build one server that exposes tools, and any MCP-compatible client can call them. The agent says "search contacts where status is discussing"; the server turns that into a query and returns structured results. The agent never touches your database, and your database never needs to know an agent exists.

That indirection is the whole point. It's also where the production problems live, because the server is now the thing standing between an autonomous agent and your real data.

We didn't use an MCP SDK

The protocol is small: initialize, tools/list, tools/call, ping, a couple of notifications, all JSON-RPC 2.0. We implemented it by hand in about a hundred lines of Go and pinned it to the 2025-03-26 spec, rather than pull in an SDK.

For our server that was the right call, and we'd make it again: the surface is tiny, and we'd rather own a hundred lines than track a fast-moving dependency that wraps the same five methods. But know the cost before you copy us — when the spec moves, we move it by hand. If you need OAuth flows, resumable streams, and the spec's sharp edges handled for you, take the SDK. If your server is a thin, typed door onto your own systems, the hand-rolled version is less code than the wrapper around it.

One server, two transports, one registry

We run stdio and Streamable HTTP off the same binary, chosen by a TRANSPORT env var, both routing into one tool registry and one handler — so the two transports can't drift in behaviour. stdio is for local clients (Claude Desktop, a coding agent on your machine). The HTTP transport is gin serving POST/GET/DELETE /mcp, with a session id in the Mcp-Session-Id header for remote agents.

The gotcha that cost us an afternoon: in stdio mode, stdout is the JSON-RPC channel. Anything else written there — a stray log line, gin's startup banner — corrupts the stream, and the client just sees garbage. Every log has to go to stderr, and gin has to be silenced (gin.SetMode(ReleaseMode)). Obvious in hindsight, not obvious at 2am.

The agent only sees the tools it's allowed to call

This is the decision we're most opinionated about. "Multi-tenancy" in most write-ups means filter query results by tenant. We do that — but we also filter the menu. tools/list returns a different set of tools per caller, on two axes.

First, the API key carries an allow-list of modules. Second, and this is the part that matters: each tool declares a Mode (sales, service, success, client_order, client_onboard), we resolve the caller's role from their org membership, and we hide every tool that doesn't match. A sales agent never sees the service-desk tools. A client placing an order sees ordering and nothing else. The model can't misuse a tool it was never shown — a far stronger guarantee than hoping a system prompt holds under a jailbreak.

When resolution fails — unknown caller, lookup error — we fall back to the smallest tool set, not the largest. Fail closed.

Three kinds of callers, sorted at the door

The same server answers three very different callers, and it decides which one you are before any tool runs:

an authenticated teammate — the bearer key maps to a user and an org;
a known client — identified by a contact id passed in an _identity blob, scoped to their own account;
an unknown client — anonymous, dropped into a locked-down onboarding mode.

Those last two are why one server can sit behind both our internal Claude Code and a customer's WhatsApp thread without a second codebase. (When a caller needs a per-module credential, we forward it with a compound token, mcpKey::moduleKey, instead of widening the main one.)

Writes are stateless dry-runs, not two-phase confirms

Every write tool takes an optional dry_run. With it on, the tool does everything except commit, and returns the proposed change as a preview:

{
  "dry_run": true,
  "preview": {
    "action": "create_invoice",
    "client": "Maison Mercier",
    "lines": [{ "item": "House Red 2021", "qty": 6, "unit_price": 52 }],
    "total": 312
  }
}

The agent shows that to a human, the human says yes, the agent re-sends the same call with dry_run off. We deliberately did not build a two-phase confirm with a server-held token — that means session state, expiry, cleanup, and a server that has to remember things between calls. Stateless re-send is uglier on paper and far simpler in production. That confirm step is what stops a hallucinated write from ever landing.

Let Go do the math, let the model do the judgment

The most useful line we drew is between deterministic work and reasoning. Our Supervisor module exposes anomaly rules tagged by kind. calculation rules are evaluated server-side in Go — thresholds, counts, deltas, the things code does perfectly and an LLM does expensively and sometimes wrong. llm rules are handed to the agent to work through in a ReAct loop. Don't ask the model to add up invoice totals; don't ask Go to judge whether a customer message sounds upset. Route each to the side that's actually good at it, and make the boundary explicit in the data.

Fail fast, at boot

When a module registers, it health-checks its backend with a 10-second timeout. If a dependency it needs is unreachable, the server refuses to start rather than come up half-working and fail on the first tool call; a /health endpoint covers the rest. We'd rather a deploy fail loudly than have an agent discover mid-conversation that invoicing was never reachable.

Errors carry a category; Sentry groups by tool

Every error is an apperror with a category — validation, domain, infrastructure, security, system, network — and a code, mapped to a message the model can act on instead of a raw stack trace. Production errors go to Sentry tagged with the tool name and category, so "which tool got flaky this week" is one grouping, not a log grep. The honest gap: we don't yet time per-tool latency. If we rebuilt observability today, that's the first thing we'd add.

If you're taking one to production

Short list, in rough order of how much skipping each has cost us:

Own your JSON-RPC and pin the protocol version, or take an SDK on purpose — just decide, don't drift into it.
Serve every transport from one registry so behaviour can't fork between them.
In stdio, keep stdout clean — logs to stderr, framework banners off.
Filter the tool list per caller, by key scope and by role, and fail closed when you can't resolve them.
Gate writes behind a dry-run that returns a human-readable preview, and keep it stateless.
Split deterministic checks (code) from judgment (the model), and make which-is-which explicit.
Health-check dependencies at boot and refuse to start if one is down.
Give errors a category and tag them by tool so you can see what's actually breaking.

The point

A demo proves the protocol. Production is a pile of small, opinionated decisions — build versus buy, what the agent can see, who's allowed to write, what's allowed to fail the boot. None of them are in the quickstart, and all of them are the difference between a convincing demo and a server you'd trust with real data.

We design and run MCP servers as a service — the tools, the auth, the scoping, the deployment. Tell us your stack and we'll scope it.