We Cut AI Costs by 73% — Here's What Actually Worked (and What Didn't)

I still remember the Slack message. Our client's CTO — let's call him Marek — sent a screenshot of their AWS bill at 11 PM on a Tuesday. "$45,000. For one month. We haven't even launched the enterprise tier yet."
That was a fintech company, about 120 people, building a customer support platform powered by LLMs. They were doing well — growing fast, customers loved the product. But every new customer made the economics worse, not better. The AI bill was growing faster than revenue. Marek's exact words: "At this rate, we'll be profitable right around... never."
I've been consulting on AI infrastructure for years, but this project taught me more about cost optimization than any other. Not because the solutions were revolutionary — most of them are well-known techniques. But because the sequence matters, the trade-offs are real, and the stuff that looks easy on a blog post (ironically, like this one) has a way of biting you at 2 AM.
For related reading: our guide on hidden costs of AI implementation, building reliable LLM systems, and the LLM production scaling guide.
The $45K Problem: What Were They Actually Paying For?
The first thing I did was ask for a detailed breakdown. Not the AWS bill — that's useless for diagnosis. I wanted to know which features were calling which models and how often.
It took us two days just to instrument this. They didn't have it. (Most companies don't. If you're reading this and you don't know exactly where your AI dollars go — fix that first. Everything else is guesswork.)
Here's what we found:
GPT-4 API calls: $28,500 (63% of total — this is the big one)
GPT-3.5-turbo API calls: $12,800
Embedding generation: $2,400
Infrastructure (GPU instances): $1,300
Total: $45,000/month
And the request volumes:
- Query classification and routing: 1.2M requests/month
- Automated response generation: 850K requests/month
- Sentiment analysis for escalation: 1.5M requests/month
- Document summarization: 200K requests/month
Stare at those numbers for a moment. 1.5 million sentiment analysis requests per month going through GPT-4. That's like hiring a brain surgeon to check if someone has a headache. Sentiment analysis is a solved problem — you can do it with a fine-tuned DistilBERT that costs essentially nothing to run.
This was our first clue that the problem wasn't "AI is expensive." The problem was "we're using a Ferrari to go grocery shopping."
The Spreadsheet That Changed Everything
Before writing a single line of code, I made a spreadsheet. Three columns: task, current model, minimum viable model. Nothing fancy, but Marek later said it was the single most valuable thing we delivered.
Here's what it showed:
| Task (% of traffic) | Using | Actually Needs |
|---|---|---|
| Simple classification (40%) | GPT-4 | Fine-tuned small model or even regex |
| Response generation (35%) | GPT-4 | GPT-3.5-turbo (or GPT-4 for edge cases) |
| Sentiment/escalation (20%) | GPT-4 | DistilBERT or similar |
| Complex analysis (5%) | GPT-4 | GPT-4 (legit) |
Only 5% of their requests actually needed GPT-4's reasoning capabilities. The other 95% were paying a premium for intelligence they weren't using.

What We Did (In Order, Because Order Matters)
I could structure this as "Phase 1, Phase 2, Phase 3" but honestly, real projects don't work that way. We started with the easiest, lowest-risk changes and escalated from there. Some things we planned for week 4 got pulled forward because they were easier than expected. One thing we planned for week 2 took until week 5 because of a bug I'll tell you about.
First: Caching (the obvious one that nobody does)
I asked Marek: "Do you cache any LLM responses?"
Long pause. "No."
We analyzed their query logs. 23% of queries were exact duplicates. The same customer asks the same question three times because the chat UI doesn't show a loading indicator fast enough. Same query, same answer, three API calls to GPT-4.
A Redis exact-match cache took us three days to implement. Just hash the query, check Redis, done. Immediate 23% reduction in API calls, zero quality impact.
const getCachedResponse = async (query, context) => {
// Tier 1: Exact match (Redis) — this alone saved 23% of API calls
const exactMatch = await redis.get(hashQuery(query));
if (exactMatch) return { response: exactMatch, source: 'exact-cache' };
// Tier 2: Semantic similarity (Vector DB)
const embedding = await generateEmbedding(query);
const similar = await vectorDB.search(embedding, { threshold: 0.95, limit: 1 });
if (similar.length > 0) {
await redis.set(hashQuery(query), similar[0].response, 'EX', 3600);
return { response: similar[0].response, source: 'semantic-cache' };
}
// Tier 3: Intent-based templates
const intent = await classifyIntent(query);
if (intent.confidence > 0.9 && intentTemplates[intent.label]) {
const response = await fillTemplate(intentTemplates[intent.label], context);
return { response, source: 'template-cache' };
}
// Miss — call the LLM and cache for next time
const response = await callLLM(query, context);
await cacheResponse(query, embedding, response);
return { response, source: 'llm' };
};
Then we added semantic caching — using a vector database (Pinecone) to find queries that aren't identical but mean the same thing. "How do I reset my password?" and "I forgot my password, help" should return the same cached response.
This is where our first real mistake happened.
The Cache Bug That Cost Us Three Days
We set the semantic similarity threshold at 0.90. Seemed reasonable. Tested it on a few hundred queries. Looked great.
Then we deployed it, and within an hour, we started getting customer complaints. The cache was matching "I want to cancel my subscription" with "I want to upgrade my subscription" (cosine similarity: 0.91). Customers asking to cancel were getting enthusiastic responses about premium features.
Not our finest moment.
We cranked the threshold up to 0.95, added category-aware caching (only match queries within the same intent category), and ran a week-long shadow test before turning it back on. Lesson learned: semantic similarity is not semantic understanding. Two sentences can be very similar in embedding space while meaning opposite things.
After fixing the threshold and adding intent guards, semantic caching worked beautifully. Combined with exact match, we hit a 64% overall cache rate.
Second: Model Right-Sizing
This was the boring, unglamorous part that saved the most money.
We migrated classification tasks from GPT-4 to GPT-3.5-turbo. A/B tested it with 5% of traffic for a week. Quality score went from 99.3% to 99.1%. The team debated for two days whether 0.2% mattered. Marek ended the debate: "That 0.2% costs us $24,000 a month. Ship it."
For sentiment analysis, we trained a DistilBERT model on 50K labeled examples from their historical data. Accuracy: 94.7%. Not as good as GPT-4's ~97%, but honestly? For determining if a customer is angry enough to escalate to a human agent, 94.7% is more than enough. The cost? Essentially zero — it runs on a single CPU.
Third: Self-Hosting (Proceed With Caution)
This is the part where I have to be honest: self-hosting is not for everyone, and it was almost not for this client either.
We deployed quantized Mistral 7B and Llama 2 7B models for the remaining simple-to-moderate tasks. The setup:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.2",
quantization_config=quantization_config,
device_map="auto"
)
# 13GB model → 3.5GB, runs at 45 tokens/sec on a single A10 GPU
Cost per request dropped from $0.0015 (GPT-3.5-turbo) to $0.0003 (self-hosted Mistral). But here's what blog posts about self-hosting never mention:
- Cold start is real. Our model had a 2-3 second cold start. For a customer support chat, that's unacceptable. We had to implement predictive pre-warming based on traffic patterns. That took an extra week.
- You need DevOps. Someone has to maintain the GPU instances, handle scaling, deal with out-of-memory errors at 3 AM. Budget 20-40 hours/month of engineering time.
- Auto-scaling is tricky. We set it up with 1-4 A10 instances. The scaling lag meant that during traffic spikes, some requests waited 5+ seconds. We eventually added a fallback: if self-hosted queue depth > threshold, route to GPT-3.5-turbo API instead.
Was it worth it? For this client, yes — they had the volume (1M+ requests/month) and the engineering team to maintain it. For a startup processing 100K requests/month? Probably not. Stick with API calls and focus on caching.

The Numbers: Before and After
After about 8 weeks (originally planned for 6, but, you know... the cache bug, a GPU driver issue, and Marek's two-week vacation in the middle), here's where we ended up:
Monthly Costs — After:
Self-hosted models (Mistral/Llama): $2,100
GPT-3.5-turbo API calls: $4,200
GPT-4 API calls (complex only): $3,800
Embedding generation: $800
Infrastructure (GPU + caching): $1,100
Total: $12,000/month
Monthly savings: $33,000 (73%)
Annual savings: $396,000
| Category | Before | After | Savings |
|---|---|---|---|
| GPT-4 API | $28,500 | $3,800 | 87% |
| GPT-3.5-turbo API | $12,800 | $4,200 | 67% |
| Self-hosted models | $0 | $2,100 | (new cost) |
| Embeddings | $2,400 | $800 | 67% |
| Infrastructure | $1,300 | $1,100 | 15% |
| Total | $45,000 | $12,000 | 73% |
Where do the requests go now:
Cache hits (no LLM call): 64%
Self-hosted models: 24%
GPT-3.5-turbo: 9%
GPT-4: 3%
And the part I'm most proud of — quality actually improved:
| Metric | Before | After |
|---|---|---|
| Quality Score | 99.3% | 99.5% |
| User Satisfaction | 4.6/5.0 | 4.7/5.0 |
| Average Latency | 2.3s | 2.0s |
| P95 Latency | 4.1s | 3.2s |
Why did quality go up? Because the fine-tuned DistilBERT was actually better at sentiment analysis than GPT-4 for this domain. It was trained on their specific data — their customers' language, their edge cases. A specialized tool outperformed a general-purpose genius. There's a life lesson in there somewhere.
What I'd Do Differently Next Time
Start with the audit. We spent two days instrumenting their system to figure out where the money was going. Next time, I'd insist on proper cost tagging from day one. If you can't answer "how much does feature X cost per month?" within 30 seconds, you're flying blind.
Don't underestimate cache invalidation. We set cache TTLs between 1-24 hours depending on query type. But product info changes — prices, availability, policies. We had a situation where a product was discontinued but the cache kept serving responses referencing it for 6 hours. Now we have event-driven cache busting for content changes.
Progressive rollout, always. Our pattern: 10% traffic → validate for a week → 30% → validate → 60% → full rollout. It's slower, but it catches problems when they're small. The cancel/upgrade cache mix-up? Would have been a disaster at 100% traffic. At 10%, it was a funny Slack thread.
Quality measurement is hard. "99.5% quality" sounds precise, but what does it actually mean? We used a combination of semantic similarity to baseline responses, factual consistency checks, and A/B-tested user satisfaction scores. None of these are perfect. The honest answer is that quality measurement in LLM systems is still more art than science.
Should You Do This?
Here's my honest framework:
- Spending <$2K/month on AI: Don't bother with most of this. Add basic Redis caching (one day of work) and right-size your model selection. That's it.
- $2K-$10K/month: Caching + model right-sizing will get you 40-60% savings with minimal risk. Skip self-hosting.
- $10K-$50K/month: Full optimization makes sense. You'll save enough to justify the engineering investment. Consider self-hosting if you have the DevOps capacity.
- $50K+/month: You should have done this yesterday. Call me.
The one thing I'd tell every team, regardless of spend: know where your money goes. Build that spreadsheet. You'll be surprised.
# Start here — figure out where the money goes
async def audit_ai_costs():
requests = await db.query("""
SELECT
endpoint,
model,
AVG(tokens) as avg_tokens,
COUNT(*) as request_count,
SUM(cost) as total_cost
FROM ai_logs
WHERE timestamp > NOW() - INTERVAL '30 days'
GROUP BY endpoint, model
ORDER BY total_cost DESC
""")
for req in requests:
complexity = analyze_complexity(req)
potential_savings = calculate_savings(req, complexity)
print(f"{req.endpoint}: ${potential_savings}/month potential")
The Architecture Now
For those who want the technical picture:
User Request → API Gateway → Request Classifier
↓
┌───────────────┼───────────────┐
↓ ↓ ↓
Exact Cache Semantic Cache Template Cache
↓ ↓ ↓
[Cache Hit] ← [Cache Hit] ← [Cache Hit]
↓
[Cache Miss] → Complexity Analyzer
↓
┌───────────────┼───────────────┐
↓ ↓ ↓
Self-hosted GPT-3.5-turbo GPT-4
(Simple: 24%) (Moderate: 9%) (Complex: 3%)
↓ ↓ ↓
Cache & Return Results
Six months later, Marek messaged me again. This time it wasn't a panicked screenshot of a bill. It was: "Just crossed 3x our pre-optimization traffic. AI costs went from $12K to $14K. Thanks for building something that actually scales."
That felt good.
Dealing with growing AI costs? I've helped teams cut 40-70% without sacrificing quality. Let's talk — the audit alone is usually an eye-opener.
