Building AI Agents That Actually Work (Not Just Demo Well)

Let me tell you about the agent that almost got me fired.
It was a customer support agent. Simple task: look up orders, check refund eligibility, process refunds, send confirmation emails. The demo went perfectly — the CEO watched it handle three test cases flawlessly. "Ship it," he said.
Two hours into production, I got a call. The agent had processed 47 refunds in a row. Every single one approved. Including a $2,200 order from someone who had received their items six months ago and just wanted to see if they'd get lucky. The agent didn't just approve it — it sent a cheerful email saying "We're happy to process your refund! We value your satisfaction."
Why? Because the refund eligibility tool returned a generic error (the database was briefly unreachable), and the agent — lacking a check for that case — interpreted the empty response as "no restrictions found" and proceeded with the refund. Then, with the confirmation tool, it entered a retry loop, sending the same customer three emails.
Total damage: $14,000 in invalid refunds before someone noticed. Fixable, but embarrassing.
I tell you this because every "build AI agents" tutorial shows you the happy path. I want to show you the cliffs on either side. If you're new to LLMs in general, start with my enterprise LLM overview first.
What Actually Makes Agents Different
Forget the buzzwords. An agent is just an LLM in a loop:
User gives goal → Model thinks → Model calls a tool → Model sees result → Model thinks again → ...repeat until done
That's it. The "magic" is that the model decides which tools to call and when to stop. This is both the power and the danger — because sometimes it decides wrong, and sometimes it doesn't stop.
The difference from a regular chatbot? A chatbot answers your question. An agent does something about it. That's a massive difference in risk, cost, and value.
The Only Pattern You Need to Start: ReAct
There are papers about dozens of agent architectures. Multi-agent orchestration, plan-and-execute, tree-of-thought planning, reflexion... Most of them are academically interesting and practically unnecessary for your first agent.
Start with ReAct (Reasoning + Acting). It's the workhorse pattern, and it's what Claude, GPT-4, and every major model is optimized for. The model alternates between thinking ("I need to look up this order") and acting (calling the lookup tool). It builds on chain-of-thought prompting but adds the ability to take real actions.
Here's a minimal implementation:
from anthropic import Anthropic
def react_agent(query: str, tools: list, max_iterations: int = 10):
client = Anthropic()
messages = [{"role": "user", "content": query}]
for _ in range(max_iterations):
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
tools=tools,
messages=messages
)
if response.stop_reason == "tool_use":
tool_use = next(
block for block in response.content
if block.type == "tool_use"
)
# Execute the tool
tool_result = execute_tool(tool_use.name, tool_use.input)
# Feed the result back
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_use.id,
"content": tool_result
}]
})
else:
return response.content[0].text
return "Max iterations reached"
That max_iterations = 10 is not decoration. Without it, I've seen agents loop 200+ times trying to "figure out" a task, burning $15 in API calls on a single customer question. Always set a ceiling.
Tools: Where Agents Get Their Power (and Their Problems)
An agent without tools is just an LLM talking to itself. Tools are what make agents useful — and dangerous.
Here's what I've learned about designing tools:
Make tools stupid-simple. One tool, one job. Don't create a manage_customer tool that can look up orders and process refunds and send emails. Make three separate tools. The model is better at choosing between specific tools than navigating a complex one.
Write descriptions as if you're explaining to a new employee. The model reads the tool description to decide when and how to use it. Vague descriptions → wrong tool choices.
tools = [
{
"name": "search_database",
"description": "Search the company database for customer records by name, email, or order ID. Returns up to 10 matching records. Use this BEFORE attempting any customer-specific actions.",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Customer name, email address, or order ID to search for"
},
"limit": {
"type": "integer",
"description": "Maximum results to return (default: 10)",
"default": 10
}
},
"required": ["query"]
}
},
{
"name": "send_email",
"description": "Send an email to a customer. Only use AFTER confirming with the user that the email content is correct. This action cannot be undone.",
"input_schema": {
"type": "object",
"properties": {
"to": {"type": "string", "description": "Recipient email address"},
"subject": {"type": "string", "description": "Email subject line"},
"body": {"type": "string", "description": "Email body (plain text)"}
},
"required": ["to", "subject", "body"]
}
}
]
See that "This action cannot be undone" in the email tool description? That's not for humans — it's for the model. It makes Claude more cautious about calling that tool, which is exactly what you want for irreversible actions.
Distinguish between read and write tools. This is the lesson from my refund disaster. Read tools (search, lookup) are safe to retry. Write tools (send email, process payment, update record) are not. Your error handling should be very different for each.
MCP: The Protocol That's Changing Everything
Model Context Protocol (MCP) standardizes how agents connect to external services. Instead of building custom tool integrations for every database, API, and service, you connect to MCP servers:
mcp_client = MCPClient("postgresql://localhost/mydb")
# Agent can now query your database using tools provided by the MCP server
response = agent.run(
"Find all orders from last week that haven't shipped",
tools=mcp_client.get_tools()
)
MCP servers exist for databases, GitHub, Slack, file systems, and dozens of other services. It's cut our integration time by 70-80% on recent projects. For agents that need semantic search, pair MCP with a vector database.
Multi-Agent Systems: Usually Overkill, Sometimes Essential
The multi-agent pattern is simple in concept:
class AgentOrchestrator:
def __init__(self):
self.agents = {
"researcher": ResearchAgent(),
"analyst": AnalysisAgent(),
"writer": WritingAgent(),
}
def execute(self, task: str) -> str:
research = self.agents["researcher"].run(task)
analysis = self.agents["analyst"].run(research)
output = self.agents["writer"].run(analysis)
return output
I'll be blunt: for 80% of use cases, a single well-prompted agent with good tools will outperform a multi-agent system. Multi-agent adds complexity, latency, and cost. The hand-off between agents loses context. Debugging becomes a nightmare — which agent made the bad decision?
Use multi-agent when:
- Subtasks genuinely require different capabilities (e.g., code generation + code review)
- You need parallel processing (agents working simultaneously on different parts)
- Security boundaries demand it (one agent with read access, another with write access)
Don't use it because it sounds cool on a slide deck.
The Production Checklist (From Painful Experience)
Taking an agent from "works on my laptop" to "runs in production" is where most projects die. Here's my checklist, forged from every mistake I've described and a few I'm too embarrassed to mention. For the full journey, see my prototype-to-production guide.
1. Cost Limits (Non-Negotiable)
Without cost limits, a confused agent will drain your budget in minutes. I'm not exaggerating — I've seen a single agent task consume $47 because it kept retrying a failing tool call with increasingly elaborate prompts.
class CostAwareAgent:
def __init__(self, max_cost_usd: float = 1.0):
self.max_cost = max_cost_usd
self.total_cost = 0.0
def run(self, task: str):
while self.total_cost < self.max_cost:
response = self._call_model(task)
self.total_cost += self._calculate_cost(response)
if response.stop_reason == "end_turn":
return response
raise CostLimitExceeded(f"Exceeded ${self.max_cost} limit")
Set per-task limits and per-user limits and daily limits. Belt, suspenders, and a backup belt. For deeper strategies on controlling AI spend, see my cost optimization case study.
2. Error Handling That Helps the Agent
When a tool fails, don't just swallow the error. Return a helpful message to the agent so it can adjust:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10)
)
def execute_tool_with_retry(tool_name: str, inputs: dict):
try:
return execute_tool(tool_name, inputs)
except RateLimitError:
raise # Let tenacity handle retry
except ToolExecutionError as e:
logger.error(f"Tool {tool_name} failed: {e}")
return f"Error: {e}. Please try a different approach."
That last line — "Please try a different approach" — is doing real work. It tells the model to change strategy instead of retrying the same failing call. Small thing, huge impact.
3. Observability (Because You Will Need to Debug at 2 AM)
import structlog
logger = structlog.get_logger()
def agent_step(step_num: int, action: str, result: str):
logger.info(
"agent_step",
step=step_num,
action=action,
result_length=len(result),
tokens_used=count_tokens(result)
)
Log every single step. Every tool call, every result, every decision. When something goes wrong in production (and it will), you'll be reading these logs at 2 AM trying to figure out why the agent decided to email a customer in Portuguese.
Track these numbers weekly:
- Steps per task — sudden increases mean the agent is confused
- Tool failure rate — by tool, not aggregate
- Cost per task — broken down by task type
- Completion rate — how often does the agent actually finish vs. hit limits
4. Human-in-the-Loop for Anything Irreversible
This is the lesson from my $14,000 refund adventure. For any action that can't be undone — sending emails, processing payments, modifying production data — require human approval. Yes, it slows things down. That's the point.
The exception: once you have enough data to trust the agent on specific action types (say, after 1,000 successful refunds under $50 with a <1% error rate), you can selectively remove the human check. But earn that trust with data, don't assume it.
A Complete Example: The Support Agent (Done Right This Time)
Here's the customer support agent, rebuilt with everything I've learned from breaking the first one:
from anthropic import Anthropic
class CustomerSupportAgent:
def __init__(self):
self.client = Anthropic()
self.tools = [
self._lookup_order_tool(),
self._check_refund_eligibility_tool(),
self._process_refund_tool(),
self._send_notification_tool(),
]
def handle_request(self, customer_message: str, customer_id: str):
system_prompt = """You are a customer support agent.
Rules you MUST follow:
1. Always look up the order FIRST before any other action.
2. If any tool returns an error, tell the customer you need to
check with a team member. Do NOT proceed with assumptions.
3. For refunds over $100, say you need manager approval.
4. Never send more than one email per interaction.
5. If you're unsure about anything, ask the customer to clarify
rather than guessing.
Be warm but concise. Nobody likes reading a novel from support."""
messages = [{
"role": "user",
"content": f"Customer ID: {customer_id}\n\nMessage: {customer_message}"
}]
for _ in range(8): # Hard limit: 8 steps max
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=system_prompt,
tools=self.tools,
messages=messages
)
if response.stop_reason == "tool_use":
messages = self._handle_tool_use(response, messages)
else:
return response.content[0].text
return "I need to escalate this to a team member. You'll hear back within 2 hours."
Spot the differences from the naive version:
- Explicit rules in the system prompt, including what to do on error
- Hard iteration limit (8, not unlimited)
- Graceful fallback when limit is hit ("escalate to human" instead of "max iterations reached")
- Dollar threshold for human approval ($100+)
- One email per interaction rule to prevent my loop nightmare
The Honest Assessment: When NOT to Build Agents
Not everything should be an agent. I've talked clients out of agents more often than I've talked them into one.
Don't use an agent when a pipeline works. If your workflow is always the same steps in the same order — use a pipeline. Agents shine when the path to the solution isn't known in advance.
Don't use an agent for simple Q&A. If users ask questions and you look up answers, that's RAG, not an agent. Adding an agent loop to Q&A just adds cost and latency.
Don't use an agent when latency matters. Agent loops take 5-30 seconds depending on complexity. If your users expect sub-second responses, an agent is the wrong tool. (Unless you can do the agentic work async and notify users when it's done.)
Don't use an agent when you can't tolerate mistakes. Agents will make wrong tool calls sometimes. If one wrong decision means regulatory violation, financial loss, or safety risk — and you can't add human approval without killing the UX — think hard before proceeding. To reduce incorrect outputs, apply hallucination prevention techniques.
What I'd Tell You Over Coffee
If you're just getting started with agents, here's my honest advice:
Build one agent. One. Pick a boring, low-stakes use case — a research assistant that searches the web, a code helper that runs tests, a data analyst that queries your database. Get it working. Watch it fail. Fix it. Watch it fail differently. Fix that too.
After about two weeks of this, you'll have a gut feeling for what agents are good at and where they break. That intuition is worth more than any architecture diagram.
Then, when someone asks you to build "an AI agent that handles our entire customer onboarding" — you'll know exactly which parts should be agentic, which should be pipelines, and which need a human in the chair.
The technology is genuinely powerful. But power without guardrails is just a liability. Build the guardrails first.
Want help designing agent systems for your team? I've deployed agents that range from "saved us 200 hours/month" to "nearly emailed the entire customer list." I can help you aim for the first outcome. Let's talk.
