ai10 min read

Building Enterprise AI Chatbots: Architecture and Best Practices

How to design production-grade AI chatbots — intent classification, RAG integration, conversation memory, escalation flows, and evaluation metrics.

By Ventra Rocket

·Published on 28 February 2026

#AI Chatbot#RAG#NLP#Enterprise AI#LangChain

Enterprise AI chatbots need intent routing, knowledge grounding, conversation memory, escalation logic, and measurable performance — not just an LLM API call.

Architecture Overview

User Message → Input Guard → Intent Classifier
    ↓
FAQ Handler | RAG Handler | Task Handler
    ↓
Response Generator (LLM)
    ↓
Output Guard → Escalation Check → Human Agent (if confidence < 0.7)

1. Intent Classification

from openai import OpenAI

client = OpenAI()

INTENT_SYSTEM = """Classify the user message into one of these intents:
- faq: general questions about products/services
- order_status: queries about specific orders
- technical_support: technical issues
- complaint: customer complaints
- escalate: explicit request for human agent

Return JSON: {"intent": "...", "confidence": 0.0-1.0}"""

def classify_intent(message: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": INTENT_SYSTEM},
            {"role": "user", "content": message},
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return eval(response.choices[0].message.content)

2. RAG-Grounded Responses

from qdrant_client import QdrantClient

qdrant = QdrantClient(url="http://qdrant:6333")

def retrieve_context(query: str, collection: str, top_k: int = 4) -> list[str]:
    embedding = get_embedding(query)
    results = qdrant.search(
        collection_name=collection,
        query_vector=embedding,
        limit=top_k,
        score_threshold=0.75,
    )
    return [r.payload["text"] for r in results]

def generate_grounded_response(query: str, context_chunks: list[str]) -> str:
    context = "\n\n".join(f"[{i+1}] {chunk}" for i, chunk in enumerate(context_chunks))
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Answer using ONLY the provided context. Cite sources using [1], [2] notation.",
            },
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
        ],
        temperature=0.3,
        max_tokens=500,
    )
    return response.choices[0].message.content

3. Conversation Memory

from redis import Redis
import json

redis_client = Redis(url="redis://localhost:6379")

def get_conversation_history(session_id: str, max_turns: int = 10) -> list[dict]:
    raw = redis_client.lrange(f"chat:{session_id}", -max_turns * 2, -1)
    return [json.loads(m) for m in raw]

def save_message(session_id: str, role: str, content: str) -> None:
    msg = json.dumps({"role": role, "content": content})
    redis_client.rpush(f"chat:{session_id}", msg)
    redis_client.expire(f"chat:{session_id}", 3600)  # 1h TTL

4. Escalation Logic

ESCALATION_TRIGGERS = [
    "speak to a human", "talk to agent", "frustrated", "urgent",
]

def should_escalate(message: str, intent: dict, consecutive_failures: int) -> bool:
    if any(t in message.lower() for t in ESCALATION_TRIGGERS):
        return True
    if intent["confidence"] < 0.5 and consecutive_failures >= 2:
        return True
    if intent["intent"] == "complaint":
        return True
    return False

5. Evaluation Metrics

| Metric | Target | |--------|--------| | Intent accuracy | > 90% | | Resolution rate | > 75% | | CSAT score | > 4.0/5 | | Avg turns to resolve | < 4 | | Escalation rate | < 20% |

Conclusion

Ventra Rocket has built chatbots handling thousands of conversations daily, achieving 80%+ self-service resolution rates. The architecture — intent classification, RAG grounding, memory, escalation — is the minimum for production enterprise chatbots.

Claude Code + Cursor: How a 2-Person Startup Shipped a SaaS in 30 Days

Two non-technical Vietnamese founders built a full dental clinic management SaaS — booking, patient records, invoicing, SMS reminders — in 30 days using Claude Code and Cursor. 15 paying clinics in month one. Pre-seed raised on traction.

Claude CodeCursorAI

5 May 2026·10 min read

AI Video Generation at Scale: Helping a Marketing Agency Produce 200 Videos/Month

A Vietnamese digital marketing agency serving 30+ e-commerce brands slashed video production cost from $800 to $35 per video and scaled to 200+ videos/month using an AI pipeline built on Claude, ElevenLabs, Runway Gen-3, and FFmpeg.

Video AIAIEnterprise

28 April 2026·11 min read

Gemini for Enterprise: Building a Multi-Modal Knowledge Base for a Hospital Network

A private hospital group in Vietnam with 12 locations unified 50,000+ medical records — PDFs, handwritten notes, X-rays, lab results — into a single AI-powered search system using Gemini 1.5 Pro. Diagnosis lookup time dropped from 15 minutes to 30 seconds.

GeminiAIEnterprise

21 April 2026·11 min read