Back to projects
AI/MLIn Development

AI Document Assistant

AI document assistant — summarize, extract, answer questions from PDF/Word/Excel using RAG.

60%faster processing
PythonLangChainOpenAIFastAPI

Overview

AI Document Assistant is an enterprise document processing system built for a major legal consultancy in Ho Chi Minh City. The system allows lawyers and legal specialists to ask natural-language questions about the contents of thousands of contracts, case law documents, and legal texts stored in the knowledge base.

Rather than manually reading through files to locate a specific clause, users simply ask: "Find all contracts with penalty clauses exceeding 500 million VND" — and receive accurate, cited results within seconds.

The RAG (Retrieval-Augmented Generation) architecture ensures every answer is grounded in actual documents in the system, not the LLM's general knowledge — essential in legal work where factual accuracy is paramount.

The Challenge

Legal documents have unique characteristics: specialized terminology, complex nested structures with multiple sub-clauses, and embedded tables. Many legacy documents are scanned PDFs with no text layer, requiring OCR before any processing. Vietnamese text with diacritics adds additional complexity for NLP models primarily trained on English.

Security requirements were strict: contract documents are highly confidential and cannot be sent to any external cloud API. The entire system had to run on-premise on the client's own servers.

Our Solution

Ventra Rocket deployed an on-premise RAG architecture using Ollama running a Llama 3.1 70B model locally, paired with ChromaDB as the vector store for semantic embedding search. The document processing pipeline uses Tesseract OCR for scanned PDFs and python-docx/openpyxl for Word and Excel files.

The chunking strategy was customized for legal text: splitting by article and clause rather than fixed character count ensures each chunk is a complete semantic unit. Hybrid search combining vector similarity with BM25 keyword scoring significantly improves retrieval quality for specialized legal terminology.

Key Features

Impact & Results

In a pilot with a 15-lawyer team, average time to locate information within documents decreased by 60%. A task that previously required 2 hours of reading through 50 contracts to find relevant clauses now completes in 5 minutes at higher accuracy.

Lawyers responded particularly positively to source citation — they can immediately verify any AI answer by clicking a reference to view the original passage, preserving the reliability standard required in legal practice.

Tech Stack Details

LangChain provides the abstraction layer for the RAG pipeline — enabling easy swapping of LLMs and vector stores as requirements evolve. FastAPI builds a high-performance async API backend for compute-heavy tasks like document ingestion. ChromaDB serves as the on-premise vector store with strong performance for datasets under 1 million documents. Tesseract OCR with a Vietnamese language pack processes scanned PDFs at 94% character accuracy for clearly printed text.

Related Projects

AI/ML

CapyPrep

Smart exam prep platform — practice English, GMAT, SAT with AI-powered grading and an AI study buddy.

50%learning efficiency boost
AI Document Assistant — Intelligent Document Processing | Ventra Rocket