Overview
AI Document Assistant is an enterprise document processing system built for a major legal consultancy in Ho Chi Minh City. The system allows lawyers and legal specialists to ask natural-language questions about the contents of thousands of contracts, case law documents, and legal texts stored in the knowledge base.
Rather than manually reading through files to locate a specific clause, users simply ask: "Find all contracts with penalty clauses exceeding 500 million VND" — and receive accurate, cited results within seconds.
The RAG (Retrieval-Augmented Generation) architecture ensures every answer is grounded in actual documents in the system, not the LLM's general knowledge — essential in legal work where factual accuracy is paramount.
The Challenge
Legal documents have unique characteristics: specialized terminology, complex nested structures with multiple sub-clauses, and embedded tables. Many legacy documents are scanned PDFs with no text layer, requiring OCR before any processing. Vietnamese text with diacritics adds additional complexity for NLP models primarily trained on English.
Security requirements were strict: contract documents are highly confidential and cannot be sent to any external cloud API. The entire system had to run on-premise on the client's own servers.
Our Solution
Ventra Rocket deployed an on-premise RAG architecture using Ollama running a Llama 3.1 70B model locally, paired with ChromaDB as the vector store for semantic embedding search. The document processing pipeline uses Tesseract OCR for scanned PDFs and python-docx/openpyxl for Word and Excel files.
The chunking strategy was customized for legal text: splitting by article and clause rather than fixed character count ensures each chunk is a complete semantic unit. Hybrid search combining vector similarity with BM25 keyword scoring significantly improves retrieval quality for specialized legal terminology.
Key Features
- Multi-format Ingestion: Automatically processes PDF (including scanned), Word, Excel, and PowerPoint — extracting text, tables, and metadata into a unified knowledge base.
- Semantic Q&A: Ask questions in natural Vietnamese about document contents — the system responds with precise citations (filename, page, paragraph).
- Document Summarization: Automatic structured summaries for lengthy documents: key points, obligations of each party, critical clauses, and potential risks.
- Contract Comparison: Compare two contract versions side by side, highlighting changed clauses and analyzing the legal impact of each difference.
- Batch Processing: Upload 100 documents at once; the system indexes in the background and notifies when ready for querying.
Impact & Results
In a pilot with a 15-lawyer team, average time to locate information within documents decreased by 60%. A task that previously required 2 hours of reading through 50 contracts to find relevant clauses now completes in 5 minutes at higher accuracy.
Lawyers responded particularly positively to source citation — they can immediately verify any AI answer by clicking a reference to view the original passage, preserving the reliability standard required in legal practice.
Tech Stack Details
LangChain provides the abstraction layer for the RAG pipeline — enabling easy swapping of LLMs and vector stores as requirements evolve. FastAPI builds a high-performance async API backend for compute-heavy tasks like document ingestion. ChromaDB serves as the on-premise vector store with strong performance for datasets under 1 million documents. Tesseract OCR with a Vietnamese language pack processes scanned PDFs at 94% character accuracy for clearly printed text.