Our LLM Prototype Worked Great. Then We Got Real Traffic.

Software development scaling from prototype to production

The demo went perfectly. The CEO tried three queries, got great answers, and said "Ship it by Friday." That was Tuesday.

By Friday, we had a FastAPI endpoint wrapping an OpenAI call with zero caching, zero rate limiting, and a retry strategy of "cross fingers." It handled our internal beta of 30 users fine. Then marketing sent out the announcement email and 2,000 people showed up in the first hour.

P95 latency: 34 seconds. Error rate: 12%. And on Monday morning, I got an email from AWS with a $8,200 bill for four days of production traffic. The CFO was... not amused.

That was the beginning of a painful but educational eight weeks. This post is what I wish I'd known before that Friday deploy.

Related: cost optimization strategies, hallucination prevention, and RAG vs fine-tuning for architecture decisions.

The Gap Between Prototype and Production

Your prototype looks like this. Don't lie — mine did too:

def process_request(user_input):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": user_input}]
    )
    return response.choices[0].message.content

No timeout. No retry. No caching. No cost tracking. No rate limiting. Every request goes to GPT-4 whether it needs to or not. This is fine for a demo. It's a disaster in production.

Here's what I've learned breaks first, in order:

Cost — you have no idea how much this will cost until real users arrive
Latency — GPT-4 is slow under load, and users don't wait 30 seconds
Reliability — OpenAI's API has bad days, and your app crashes when it does
Scaling — everything is sequential, nothing is cached, you can't handle traffic spikes

Let me walk through how I fix each one, in the order you should actually do it.

Fix #1: Caching (Do This First, Today)

The single highest-ROI change you can make: don't call the LLM for questions you've already answered.

We started with exact-match caching (Redis, one afternoon of work). Our cache hit rate: 12%. Disappointing. Users don't ask the exact same question twice — they rephrase.

Then we added semantic caching. "How do I reset my password?" and "I forgot my password" now return the same cached response. Cache hit rate jumped to 64%.

class SemanticCache:
    def __init__(self, redis_client, similarity_threshold=0.95):
        self.redis = redis_client
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.threshold = similarity_threshold

    def get(self, query, namespace="default"):
        query_embedding = self.encoder.encode(query)

        for key in self.redis.keys(f"cache:{namespace}:*"):
            cached = self.redis.hgetall(key)
            cached_embedding = np.frombuffer(cached[b'embedding'], dtype=np.float32)

            similarity = np.dot(query_embedding, cached_embedding) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(cached_embedding)
            )

            if similarity >= self.threshold:
                return cached[b'response'].decode('utf-8'), similarity

        return None

    def set(self, query, response, namespace="default", ttl=3600):
        embedding = self.encoder.encode(query)
        key = f"cache:{namespace}:{hashlib.sha256(query.encode()).hexdigest()[:16]}"
        self.redis.hset(key, mapping={
            'query': query,
            'response': response,
            'embedding': embedding.tobytes()
        })
        self.redis.expire(key, ttl)

The threshold matters a lot. We started at 0.90 and had the same problem I described in my cost optimization post — semantically similar queries with different intents getting the same response. 0.95 is my safe default.

Impact: API costs dropped 60% overnight. Latency for cache hits: <50ms vs 3-8 seconds for LLM calls.

Fix #2: Rate Limiting (Before Someone Drains Your Budget)

After the $8K weekend, rate limiting became my second priority. Without it, a single user (or a bot, or a bug in a client app) can burn through your entire monthly budget in hours.

class RateLimiter:
    def __init__(self, redis_client):
        self.redis = redis_client

    def check_limit(self, key, limit, window, priority="normal"):
        current = int(time.time())
        window_key = f"rate:{key}:{current // window}"

        pipe = self.redis.pipeline()
        pipe.incr(window_key)
        pipe.expire(window_key, window * 2)
        count = pipe.execute()[0]

        effective_limit = limit * 2 if priority == "premium" else limit

        if count > effective_limit:
            return False, window - (current % window)
        return True, None

We set three layers:

Per-user: 100 requests/minute (prevents individual abuse)
Per-IP: 200 requests/minute (catches bots)
Global: 5,000 requests/minute (circuit breaker for the whole system)

The global limit saved us once when a partner's integration had a retry bug that sent 50,000 requests in 10 minutes. Without it, that would have been a very expensive morning.

Fix #3: Model Routing (Stop Using GPT-4 for Everything)

This was the insight from my cost optimization work, applied to our own system: not every request needs the most expensive model.

class ModelRouter:
    def select_model(self, prompt, user_tier="free"):
        token_count = len(self.encoding.encode(prompt))
        has_code = "```" in prompt or "def " in prompt
        is_analytical = any(w in prompt.lower() for w in ["analyze", "compare", "evaluate"])

        if token_count < 200 and not is_analytical:
            tier = "fast"     # GPT-3.5-turbo: $0.0015/1K tokens
        elif has_code or is_analytical:
            tier = "powerful"  # GPT-4: $0.03/1K tokens
        else:
            tier = "balanced"  # GPT-4-turbo: $0.01/1K tokens

        # Free users don't get GPT-4
        if user_tier == "free" and tier == "powerful":
            tier = "balanced"

        return self.models[tier]

Simple heuristic, but it knocked our average cost per request down by 67%. The majority of queries are simple enough for GPT-3.5-turbo. Users couldn't tell the difference for 80% of requests.

Fix #4: Circuit Breakers (Because APIs Have Bad Days)

OpenAI's API went down on a Wednesday afternoon. Our system didn't handle it gracefully — it kept retrying, queuing up requests, and eventually crashed when memory ran out. Users saw nothing for 25 minutes.

After that, I added circuit breakers:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.state = "CLOSED"  # CLOSED = normal, OPEN = rejecting, HALF_OPEN = testing
        self.last_failure = None

    def call(self, func, *args, **kwargs):
        if self.state == "OPEN":
            if self._recovery_time_elapsed():
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenError("Service unavailable, please retry later")

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception:
            self._on_failure()
            raise

    def _on_failure(self):
        self.failure_count += 1
        self.last_failure = datetime.now()
        if self.failure_count >= self.failure_threshold:
            self.state = "OPEN"

    def _on_success(self):
        self.failure_count = 0
        self.state = "CLOSED"

Combined with exponential backoff and jitter on retries:

async def call_llm_with_retry(prompt, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return circuit_breaker.call(make_llm_call, prompt)
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            await asyncio.sleep(delay)

Recovery time dropped from 25 minutes to under 2 minutes. When the API is down, we now return cached responses where possible and a clean "temporarily unavailable" message otherwise, instead of hanging indefinitely.

Fix #5: Monitoring That Actually Tells You Something

Console logs are not monitoring. I learned this when the CEO asked me "why is the AI slow today?" and I had no idea — no dashboards, no metrics, no alerts.

Now every LLM call gets instrumented:

from prometheus_client import Counter, Histogram, Gauge

llm_requests = Counter('llm_requests_total', 'Total requests', ['model', 'status', 'cache_hit'])
llm_latency = Histogram('llm_request_duration_seconds', 'Request duration', ['model'],
                        buckets=[0.5, 1.0, 2.5, 5.0, 10.0, 30.0])
llm_cost = Counter('llm_cost_dollars_total', 'Cost in dollars', ['model', 'user_tier'])
llm_active = Gauge('llm_active_requests', 'Currently active requests')

And three alerts that have saved us multiple times:

# Alert 1: Error rate spike
- alert: HighErrorRate
  expr: rate(llm_requests_total{status="error"}[5m]) > 0.05
  for: 2m
  labels: {severity: critical}

# Alert 2: Cost runaway
- alert: CostBudgetExceeded
  expr: sum(increase(llm_cost_dollars_total[24h])) > 1000
  labels: {severity: critical}

# Alert 3: Cache broke
- alert: LowCacheHitRate
  expr: rate(llm_requests_total{cache_hit="true"}[1h]) / rate(llm_requests_total[1h]) < 0.30
  for: 30m
  labels: {severity: warning}

The cache hit rate alert is especially valuable — if it drops, something changed (new query patterns, cache invalidation bug, TTL too short). Catching it early prevents cost surprises.

Fix #6: Streaming (The UX Game-Changer)

This isn't about infrastructure — it's about perception. A response that streams in over 3 seconds feels faster than a response that appears after 3 seconds of blank screen. Users perceive streaming as 3x faster even when total time is identical.

async def stream_response(prompt, max_tokens=2000, cost_limit=0.10):
    token_count = 0
    cost_per_token = 0.00003

    response = await openai.ChatCompletion.acreate(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        stream=True
    )

    async for chunk in response:
        if chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            token_count += 1

            # Cost kill switch
            if token_count * cost_per_token > cost_limit:
                yield "\n\n[Response truncated]"
                break

            yield content

That cost_limit parameter is a guardrail I add to every streaming endpoint. Without it, a runaway generation can produce 10,000 tokens before you notice. At $0.06/1K output tokens for GPT-4, that adds up.

The Numbers: Before and After

After eight weeks of incremental improvements (not a big-bang rewrite — each fix was a few days of work):

Metric	Week 1 (panic mode)	Week 8 (calm)	How
P95 Latency	34s	2.8s	Caching + model routing
Error Rate	12%	0.3%	Circuit breakers + retries
Cost per 1K requests	$12.40	$4.10	Caching + routing
Cache Hit Rate	0%	64%	Semantic caching
Monthly Cost	~$32K (projected)	$10.5K	All of the above
Uptime	97.2%	99.8%	Circuit breakers + monitoring
Recovery Time	25 min	1.8 min	Circuit breakers

None of these are exotic techniques. Rate limiting, caching, model routing, circuit breakers, monitoring, streaming. Standard backend engineering, applied to LLM-specific problems.

The Checklist I Use for Every New Deployment

Before calling anything "production-ready":

Must have (week 1):

Rate limiting per user and globally
Semantic caching with >50% hit rate
Retry with exponential backoff
Basic cost tracking per request
Health checks and timeout limits

Should have (week 2-3):

Model routing by complexity
Circuit breakers on all external calls
Prometheus/Grafana dashboards
Cost and error rate alerts
Streaming responses

Nice to have (month 2+):

Blue-green deployment with canary
Per-feature cost attribution
Automated load testing in CI
A/B testing infrastructure

What I'd Do Differently

Add monitoring before launch, not after. I spent the first week of production flying blind. The cost of adding Prometheus + three alerts on day one: 4 hours. The cost of not having it: an $8K bill and a lot of stress.

Don't skip load testing. We never tested beyond 50 concurrent users. Production peaked at 300. The difference between "works" and "works at scale" is everything.

Start with GPT-3.5-turbo, upgrade where needed. We launched with GPT-4 for everything because the quality was better in testing. But "better" for a simple FAQ question isn't worth 20x the cost. Route by complexity from day one.

Cache aggressively, but set TTLs thoughtfully. We had a caching incident where a product price changed but cached responses kept quoting the old price for 6 hours. Product-specific TTLs and event-driven invalidation solved it, but it shouldn't have happened.

The gap between prototype and production isn't about complex distributed systems or Kubernetes orchestration. It's about six boring, well-understood patterns applied with care. Start there. The fancy stuff can wait.

Scaling your LLM app to production? I've done this enough times to know the landmines. Let's talk about your architecture before you get that first surprise bill.

Our LLM Prototype Worked Great. Then We Got Real Traffic.

Our LLM Prototype Worked Great. Then We Got Real Traffic.

The Gap Between Prototype and Production

Fix #1: Caching (Do This First, Today)

Fix #2: Rate Limiting (Before Someone Drains Your Budget)

Fix #3: Model Routing (Stop Using GPT-4 for Everything)

Fix #4: Circuit Breakers (Because APIs Have Bad Days)

Fix #5: Monitoring That Actually Tells You Something

Fix #6: Streaming (The UX Game-Changer)

The Numbers: Before and After

The Checklist I Use for Every New Deployment

What I'd Do Differently

Related Articles

Recommended for You