Our LLM Prototype Worked Great. Then We Got Real Traffic.

The demo went perfectly. The CEO tried three queries, got great answers, and said "Ship it by Friday." That was Tuesday.
By Friday, we had a FastAPI endpoint wrapping an OpenAI call with zero caching, zero rate limiting, and a retry strategy of "cross fingers." It handled our internal beta of 30 users fine. Then marketing sent out the announcement email and 2,000 people showed up in the first hour.
P95 latency: 34 seconds. Error rate: 12%. And on Monday morning, I got an email from AWS with a $8,200 bill for four days of production traffic. The CFO was... not amused.
That was the beginning of a painful but educational eight weeks. This post is what I wish I'd known before that Friday deploy.
Related: cost optimization strategies, hallucination prevention, and RAG vs fine-tuning for architecture decisions.
The Gap Between Prototype and Production
Your prototype looks like this. Don't lie — mine did too:
def process_request(user_input):
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": user_input}]
)
return response.choices[0].message.content
No timeout. No retry. No caching. No cost tracking. No rate limiting. Every request goes to GPT-4 whether it needs to or not. This is fine for a demo. It's a disaster in production.
Here's what I've learned breaks first, in order:
- Cost — you have no idea how much this will cost until real users arrive
- Latency — GPT-4 is slow under load, and users don't wait 30 seconds
- Reliability — OpenAI's API has bad days, and your app crashes when it does
- Scaling — everything is sequential, nothing is cached, you can't handle traffic spikes
Let me walk through how I fix each one, in the order you should actually do it.
Fix #1: Caching (Do This First, Today)
The single highest-ROI change you can make: don't call the LLM for questions you've already answered.
We started with exact-match caching (Redis, one afternoon of work). Our cache hit rate: 12%. Disappointing. Users don't ask the exact same question twice — they rephrase.
Then we added semantic caching. "How do I reset my password?" and "I forgot my password" now return the same cached response. Cache hit rate jumped to 64%.
class SemanticCache:
def __init__(self, redis_client, similarity_threshold=0.95):
self.redis = redis_client
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
self.threshold = similarity_threshold
def get(self, query, namespace="default"):
query_embedding = self.encoder.encode(query)
for key in self.redis.keys(f"cache:{namespace}:*"):
cached = self.redis.hgetall(key)
cached_embedding = np.frombuffer(cached[b'embedding'], dtype=np.float32)
similarity = np.dot(query_embedding, cached_embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(cached_embedding)
)
if similarity >= self.threshold:
return cached[b'response'].decode('utf-8'), similarity
return None
def set(self, query, response, namespace="default", ttl=3600):
embedding = self.encoder.encode(query)
key = f"cache:{namespace}:{hashlib.sha256(query.encode()).hexdigest()[:16]}"
self.redis.hset(key, mapping={
'query': query,
'response': response,
'embedding': embedding.tobytes()
})
self.redis.expire(key, ttl)
The threshold matters a lot. We started at 0.90 and had the same problem I described in my cost optimization post — semantically similar queries with different intents getting the same response. 0.95 is my safe default.
Impact: API costs dropped 60% overnight. Latency for cache hits: <50ms vs 3-8 seconds for LLM calls.
Fix #2: Rate Limiting (Before Someone Drains Your Budget)
After the $8K weekend, rate limiting became my second priority. Without it, a single user (or a bot, or a bug in a client app) can burn through your entire monthly budget in hours.
class RateLimiter:
def __init__(self, redis_client):
self.redis = redis_client
def check_limit(self, key, limit, window, priority="normal"):
current = int(time.time())
window_key = f"rate:{key}:{current // window}"
pipe = self.redis.pipeline()
pipe.incr(window_key)
pipe.expire(window_key, window * 2)
count = pipe.execute()[0]
effective_limit = limit * 2 if priority == "premium" else limit
if count > effective_limit:
return False, window - (current % window)
return True, None
We set three layers:
- Per-user: 100 requests/minute (prevents individual abuse)
- Per-IP: 200 requests/minute (catches bots)
- Global: 5,000 requests/minute (circuit breaker for the whole system)
The global limit saved us once when a partner's integration had a retry bug that sent 50,000 requests in 10 minutes. Without it, that would have been a very expensive morning.
Fix #3: Model Routing (Stop Using GPT-4 for Everything)
This was the insight from my cost optimization work, applied to our own system: not every request needs the most expensive model.
class ModelRouter:
def select_model(self, prompt, user_tier="free"):
token_count = len(self.encoding.encode(prompt))
has_code = "```" in prompt or "def " in prompt
is_analytical = any(w in prompt.lower() for w in ["analyze", "compare", "evaluate"])
if token_count < 200 and not is_analytical:
tier = "fast" # GPT-3.5-turbo: $0.0015/1K tokens
elif has_code or is_analytical:
tier = "powerful" # GPT-4: $0.03/1K tokens
else:
tier = "balanced" # GPT-4-turbo: $0.01/1K tokens
# Free users don't get GPT-4
if user_tier == "free" and tier == "powerful":
tier = "balanced"
return self.models[tier]
Simple heuristic, but it knocked our average cost per request down by 67%. The majority of queries are simple enough for GPT-3.5-turbo. Users couldn't tell the difference for 80% of requests.
Fix #4: Circuit Breakers (Because APIs Have Bad Days)
OpenAI's API went down on a Wednesday afternoon. Our system didn't handle it gracefully — it kept retrying, queuing up requests, and eventually crashed when memory ran out. Users saw nothing for 25 minutes.
After that, I added circuit breakers:
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=60):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failure_count = 0
self.state = "CLOSED" # CLOSED = normal, OPEN = rejecting, HALF_OPEN = testing
self.last_failure = None
def call(self, func, *args, **kwargs):
if self.state == "OPEN":
if self._recovery_time_elapsed():
self.state = "HALF_OPEN"
else:
raise CircuitOpenError("Service unavailable, please retry later")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception:
self._on_failure()
raise
def _on_failure(self):
self.failure_count += 1
self.last_failure = datetime.now()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
def _on_success(self):
self.failure_count = 0
self.state = "CLOSED"
Combined with exponential backoff and jitter on retries:
async def call_llm_with_retry(prompt, max_retries=3, base_delay=1.0):
for attempt in range(max_retries):
try:
return circuit_breaker.call(make_llm_call, prompt)
except Exception as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
await asyncio.sleep(delay)
Recovery time dropped from 25 minutes to under 2 minutes. When the API is down, we now return cached responses where possible and a clean "temporarily unavailable" message otherwise, instead of hanging indefinitely.
Fix #5: Monitoring That Actually Tells You Something
Console logs are not monitoring. I learned this when the CEO asked me "why is the AI slow today?" and I had no idea — no dashboards, no metrics, no alerts.
Now every LLM call gets instrumented:
from prometheus_client import Counter, Histogram, Gauge
llm_requests = Counter('llm_requests_total', 'Total requests', ['model', 'status', 'cache_hit'])
llm_latency = Histogram('llm_request_duration_seconds', 'Request duration', ['model'],
buckets=[0.5, 1.0, 2.5, 5.0, 10.0, 30.0])
llm_cost = Counter('llm_cost_dollars_total', 'Cost in dollars', ['model', 'user_tier'])
llm_active = Gauge('llm_active_requests', 'Currently active requests')
And three alerts that have saved us multiple times:
# Alert 1: Error rate spike
- alert: HighErrorRate
expr: rate(llm_requests_total{status="error"}[5m]) > 0.05
for: 2m
labels: {severity: critical}
# Alert 2: Cost runaway
- alert: CostBudgetExceeded
expr: sum(increase(llm_cost_dollars_total[24h])) > 1000
labels: {severity: critical}
# Alert 3: Cache broke
- alert: LowCacheHitRate
expr: rate(llm_requests_total{cache_hit="true"}[1h]) / rate(llm_requests_total[1h]) < 0.30
for: 30m
labels: {severity: warning}
The cache hit rate alert is especially valuable — if it drops, something changed (new query patterns, cache invalidation bug, TTL too short). Catching it early prevents cost surprises.
Fix #6: Streaming (The UX Game-Changer)
This isn't about infrastructure — it's about perception. A response that streams in over 3 seconds feels faster than a response that appears after 3 seconds of blank screen. Users perceive streaming as 3x faster even when total time is identical.
async def stream_response(prompt, max_tokens=2000, cost_limit=0.10):
token_count = 0
cost_per_token = 0.00003
response = await openai.ChatCompletion.acreate(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
stream=True
)
async for chunk in response:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
token_count += 1
# Cost kill switch
if token_count * cost_per_token > cost_limit:
yield "\n\n[Response truncated]"
break
yield content
That cost_limit parameter is a guardrail I add to every streaming endpoint. Without it, a runaway generation can produce 10,000 tokens before you notice. At $0.06/1K output tokens for GPT-4, that adds up.
The Numbers: Before and After
After eight weeks of incremental improvements (not a big-bang rewrite — each fix was a few days of work):
| Metric | Week 1 (panic mode) | Week 8 (calm) | How |
|---|---|---|---|
| P95 Latency | 34s | 2.8s | Caching + model routing |
| Error Rate | 12% | 0.3% | Circuit breakers + retries |
| Cost per 1K requests | $12.40 | $4.10 | Caching + routing |
| Cache Hit Rate | 0% | 64% | Semantic caching |
| Monthly Cost | ~$32K (projected) | $10.5K | All of the above |
| Uptime | 97.2% | 99.8% | Circuit breakers + monitoring |
| Recovery Time | 25 min | 1.8 min | Circuit breakers |
None of these are exotic techniques. Rate limiting, caching, model routing, circuit breakers, monitoring, streaming. Standard backend engineering, applied to LLM-specific problems.
The Checklist I Use for Every New Deployment
Before calling anything "production-ready":
Must have (week 1):
- Rate limiting per user and globally
- Semantic caching with >50% hit rate
- Retry with exponential backoff
- Basic cost tracking per request
- Health checks and timeout limits
Should have (week 2-3):
- Model routing by complexity
- Circuit breakers on all external calls
- Prometheus/Grafana dashboards
- Cost and error rate alerts
- Streaming responses
Nice to have (month 2+):
- Blue-green deployment with canary
- Per-feature cost attribution
- Automated load testing in CI
- A/B testing infrastructure
What I'd Do Differently
Add monitoring before launch, not after. I spent the first week of production flying blind. The cost of adding Prometheus + three alerts on day one: 4 hours. The cost of not having it: an $8K bill and a lot of stress.
Don't skip load testing. We never tested beyond 50 concurrent users. Production peaked at 300. The difference between "works" and "works at scale" is everything.
Start with GPT-3.5-turbo, upgrade where needed. We launched with GPT-4 for everything because the quality was better in testing. But "better" for a simple FAQ question isn't worth 20x the cost. Route by complexity from day one.
Cache aggressively, but set TTLs thoughtfully. We had a caching incident where a product price changed but cached responses kept quoting the old price for 6 hours. Product-specific TTLs and event-driven invalidation solved it, but it shouldn't have happened.
The gap between prototype and production isn't about complex distributed systems or Kubernetes orchestration. It's about six boring, well-understood patterns applied with care. Start there. The fancy stuff can wait.
Scaling your LLM app to production? I've done this enough times to know the landmines. Let's talk about your architecture before you get that first surprise bill.
