Building Enterprise AI Chatbots: Architecture and Best Practices
How to design production-grade AI chatbots — intent classification, RAG integration, conversation memory, escalation flows, and evaluation metrics.
Enterprise AI chatbots need intent routing, knowledge grounding, conversation memory, escalation logic, and measurable performance — not just an LLM API call.
Architecture Overview
User Message → Input Guard → Intent Classifier
↓
FAQ Handler | RAG Handler | Task Handler
↓
Response Generator (LLM)
↓
Output Guard → Escalation Check → Human Agent (if confidence < 0.7)
1. Intent Classification
from openai import OpenAI
client = OpenAI()
INTENT_SYSTEM = """Classify the user message into one of these intents:
- faq: general questions about products/services
- order_status: queries about specific orders
- technical_support: technical issues
- complaint: customer complaints
- escalate: explicit request for human agent
Return JSON: {"intent": "...", "confidence": 0.0-1.0}"""
def classify_intent(message: str) -> dict:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": INTENT_SYSTEM},
{"role": "user", "content": message},
],
response_format={"type": "json_object"},
temperature=0,
)
return eval(response.choices[0].message.content)
2. RAG-Grounded Responses
from qdrant_client import QdrantClient
qdrant = QdrantClient(url="http://qdrant:6333")
def retrieve_context(query: str, collection: str, top_k: int = 4) -> list[str]:
embedding = get_embedding(query)
results = qdrant.search(
collection_name=collection,
query_vector=embedding,
limit=top_k,
score_threshold=0.75,
)
return [r.payload["text"] for r in results]
def generate_grounded_response(query: str, context_chunks: list[str]) -> str:
context = "\n\n".join(f"[{i+1}] {chunk}" for i, chunk in enumerate(context_chunks))
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Answer using ONLY the provided context. Cite sources using [1], [2] notation.",
},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
],
temperature=0.3,
max_tokens=500,
)
return response.choices[0].message.content
3. Conversation Memory
from redis import Redis
import json
redis_client = Redis(url="redis://localhost:6379")
def get_conversation_history(session_id: str, max_turns: int = 10) -> list[dict]:
raw = redis_client.lrange(f"chat:{session_id}", -max_turns * 2, -1)
return [json.loads(m) for m in raw]
def save_message(session_id: str, role: str, content: str) -> None:
msg = json.dumps({"role": role, "content": content})
redis_client.rpush(f"chat:{session_id}", msg)
redis_client.expire(f"chat:{session_id}", 3600) # 1h TTL
4. Escalation Logic
ESCALATION_TRIGGERS = [
"speak to a human", "talk to agent", "frustrated", "urgent",
]
def should_escalate(message: str, intent: dict, consecutive_failures: int) -> bool:
if any(t in message.lower() for t in ESCALATION_TRIGGERS):
return True
if intent["confidence"] < 0.5 and consecutive_failures >= 2:
return True
if intent["intent"] == "complaint":
return True
return False
5. Evaluation Metrics
| Metric | Target | |--------|--------| | Intent accuracy | > 90% | | Resolution rate | > 75% | | CSAT score | > 4.0/5 | | Avg turns to resolve | < 4 | | Escalation rate | < 20% |
Conclusion
Ventra Rocket has built chatbots handling thousands of conversations daily, achieving 80%+ self-service resolution rates. The architecture — intent classification, RAG grounding, memory, escalation — is the minimum for production enterprise chatbots.
Related Articles
RAG Architecture for Enterprise Document Processing
A practical guide to designing a Retrieval-Augmented Generation system for querying enterprise internal documents with high accuracy using vector databases and LLMs.
LLM Integration Best Practices for Enterprise
A battle-tested guide to integrating Large Language Models into enterprise systems — prompt engineering, cost optimisation, safety guardrails, structured output, and evaluation frameworks.