AI Security

MCP Server Rate-Limiting Patterns

A practical look at rate-limiting patterns for Model Context Protocol servers, covering per-tool quotas, token budgets, burst control, and abuse-resistant designs.

Shadab Khan
Security Engineer
7 min read

When we first deployed a Model Context Protocol (MCP) server behind a production agent, we assumed our existing API gateway would handle traffic shaping. Within two weeks an LLM had looped a tool call three thousand times in nine minutes, exhausted an upstream quota, and triggered a paging incident that took the on-call through four systems before landing on the MCP layer. The gateway's flat "60 requests per minute per API key" rule was technically enforced the entire time. It simply did not model how agents actually use tools.

Rate limiting for MCP servers is not a gateway problem with a new name. It is a specialised control surface with its own abuse patterns, its own cost model, and its own notion of "a reasonable amount of work." This post walks through the patterns we have found durable, the ones that failed, and how to think about limits when the caller is a language model rather than a human clicking a button.

Why Traditional API Rate Limits Miss the Mark

Most API rate limits answer a simple question: "how many HTTP requests per second should this key be allowed to make?" That question presumes the requester is a deterministic client with a known call graph, a human in the loop, or both. MCP servers violate each assumption.

An agent calling an MCP server may fan out a single user prompt into dozens of tool invocations. Some of those calls are cheap metadata lookups that take two milliseconds and return a hundred bytes. Others trigger expensive downstream work: a vector search, a database scan, a third-party billed API, or a long-running build. Rate limiting by request count flattens this distribution and either starves fast tools or over-permits slow ones.

The second problem is that LLMs, unlike humans, have no instinct for "I have probably asked this enough times." Given an error response, they often retry with a slightly different argument, as if coaxing the tool will help. A rate limit that returns a generic 429 can look to an agent like guidance to rephrase the request, producing a storm of near-identical calls that all count against the same budget.

Per-Tool Quotas Over Per-Server Quotas

The first pattern we now apply by default: move the rate limit boundary from the MCP server as a whole to each tool it exposes. Tools are the natural unit of cost on an MCP server. A list_projects tool that hits a cache is not comparable to a run_deep_scan tool that spins up a worker for ninety seconds. Bucketing them together means the limit is always wrong for one of them.

Per-tool quotas let you express cost as a first-class idea. Cheap tools get generous limits — often high enough that they effectively do not rate-limit at all under normal use. Expensive tools get conservative limits with explicit backpressure. The tool schema is the obvious place to attach this metadata, alongside the input schema and description that the agent already reads.

We attach three values to every tool: a short-window burst quota, a sustained rate, and a concurrency cap. The burst quota handles legitimate parallel fan-out from a single agent turn. The sustained rate sets the long-run average we are willing to serve. The concurrency cap protects downstream systems from unbounded parallelism, which is especially important when the tool wraps a resource with its own connection pool.

Token Budgets as the Real Currency

Request counting is a proxy for cost. For MCP servers connected to LLM-backed features or metered upstream APIs, the actual currency is tokens or dollars, not requests. A pattern that has aged well is a dual-axis limiter: requests per minute as a coarse circuit breaker, and tokens or upstream-cost units per hour as the binding constraint.

Token budgets are particularly important when your MCP server wraps a retrieval-augmented generation pipeline, a summarisation tool, or any workflow that itself invokes an LLM. An agent that calls summarize_document on a hundred-page PDF ten times in a turn can burn through a weekly budget in under a minute. Counting those calls as ten requests dramatically understates the impact; counting them as several hundred thousand tokens consumed tells the truth.

Implementing this cleanly requires a post-call accounting hook. The tool runs, the actual cost is measured, and the limiter is debited by the observed usage rather than a pre-declared estimate. Pre-declared estimates are useful for admission control (rejecting requests that obviously would not fit in the remaining budget), but the ground truth has to be the measured cost.

Tenant-Aware Throttling and Noisy Neighbour Protection

Multi-tenant MCP deployments add another axis. A single tenant running an aggressive batch job should not be able to degrade latency for quieter tenants sharing the same server. The answer is a hierarchy of limits: per-identity (user or API key), per-tenant, and per-server. The most restrictive applicable limit wins.

Per-tenant limits are the hardest to size correctly because tenant behaviour varies wildly. We have had the best results with soft limits that trigger a warning and telemetry event at one threshold, and hard limits that reject requests at a much higher threshold. The gap between the two gives you time to react before an abusive workload becomes an outage. Alerting on the soft-limit crossing is how we catch runaway agents before customers call support.

A complementary pattern is the "fair share" queue. Rather than rejecting over-budget requests outright, a fair-share scheduler serves requests from each tenant in rotation, so a tenant burning budget does not monopolise worker threads. This works well for MCP tools whose latency is dominated by downstream calls and where a small queueing delay is acceptable.

Burst Control and the Agent Retry Loop

An anti-pattern we see repeatedly: a server returns a 429 with no guidance, the agent reasons that the request failed, and it retries with a minor variation. Within ten seconds the agent has issued fifty calls that were all going to be rejected. The rate limiter did its job numerically and failed in practice.

The fix has two parts. First, every 429 response from an MCP server should include structured retry guidance. At minimum: a retry_after in seconds, a reason code that the agent's system prompt can be trained to respect, and a hint about whether retrying with a different argument will help (it usually will not). Second, the agent framework on the calling side should treat rate-limit errors as terminal for the current turn unless the retry-after is very short. Agents that loop through a rate-limited tool produce the worst incidents we see.

We also strongly recommend exponential backoff enforced server-side. If a caller hits the limit, subsequent rejections within a short window should carry progressively longer retry-after values. This creates a feedback signal even for poorly-behaved clients and caps the damage they can do to shared resources.

Observability: Limiting Without Visibility Is Guessing

A rate limiter you cannot observe is a rate limiter you will size wrong. The telemetry we treat as non-negotiable: allow versus reject counters per tool, per tenant, and per caller identity; p50 and p99 latency per tool; token and dollar spend per tool per tenant per hour; and the distribution of retry-after values served. Without this data, tuning limits becomes a guess-and-pray exercise.

We also log every rejection with enough context to reconstruct the caller's intent. For MCP, this means the tool name, the input schema keys (not values, to avoid leaking data), the tenant, and an agent session identifier if present. When an abuse pattern appears, you want to answer "which agent, doing what, on whose behalf?" in under two minutes.

How Safeguard Helps

Safeguard treats MCP servers as first-class assets in its agentic security model, applying policy-driven rate limits and token budgets per tool, per tenant, and per caller identity without custom infrastructure. The platform's guardrail engine enforces burst control, concurrency caps, and structured retry guidance out of the box, so rogue agents cannot loop a tool into an outage. Telemetry feeds into the Safeguard dashboard with allow-versus-reject breakdowns, abuse fingerprinting, and token spend trends per tool, turning rate limits from a static configuration into an observable control. When an abuse pattern is detected, Safeguard can automatically escalate to stricter limits or block a caller identity while preserving service for everyone else.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.