Serverless RAG on AWS
Introduction
In the context of rapidly advancing artificial intelligence (AI), businesses are seeking solutions to integrate up-to-date information into large language models (LLMs) to create AI applications that deliver accurate and contextually relevant responses. A startup faced the challenge of developing a smart Q&A application capable of providing answers based on internal data without managing complex infrastructure. They chose to implement a Retrieval Augmented Generation (RAG) serverless solution on Amazon Web Services (AWS).
Challenge
The goal was to build a Q&A application allowing employees to query information from internal documents, such as HR policies, product guidelines, or internal reports. However, they encountered the following issues:
- Outdated Information: Traditional LLMs are trained on static data, unable to incorporate the latest information from internal documents.
- High Operational Costs: Maintaining server infrastructure for storing and processing vector data for RAG was costly and complex for a startup.
- Complex Integration: Building a complete RAG pipeline, from document ingestion to response generation, required integrating multiple technologies and services.
Solution
The startup adopted the serverless RAG solution described in the AWS repository, utilizing AWS services such as Amazon Lambda, Amazon Bedrock, Amazon S3, and LanceDB to create a fully serverless RAG pipeline. The process included the following key steps:
1. Document Ingestion and Processing
- Amazon S3: The company stored internal documents (PDFs, HTML, text) in an S3 bucket. Documents were automatically ingested upon upload through an event-driven mechanism.
- Amazon Lambda: A Lambda function was triggered to process documents, extracting text content and converting it into vector embeddings using the Amazon Titan Text Embeddings v2 model.
- LanceDB: The embeddings were stored in LanceDB, a serverless vector database backed by S3, ensuring efficiency and low cost for storage and retrieval.
2. Retrieval and Response Generation
- When a user submitted a query through the application interface, the query was converted into a vector embedding using the Amazon Titan model.
- LanceDB performed a similarity search to identify the most relevant documents based on the query’s embedding.
- The relevant documents were combined with the original query to form an augmented prompt, which was then sent to the Anthropic Claude v2 model on Amazon Bedrock to generate accurate and contextually relevant responses.
3. User Interface
- The application used Vite React for the front-end interface, hosted on Amazon CloudFront for fast delivery. The
appconfig.jsonfile contained public information such as the Lambda URL, WebSocket, and credentials from Amazon Cognito for backend communication. - Users could access the application via a provided URL (e.g.,
https://dxxxxxxxxxxx.cloudfront.net) and log in using Amazon Cognito, with strict security requirements, including a minimum 8-character password policy requiring numbers, uppercase, lowercase, and special characters.
Architecture

Results
After implementing the serverless RAG solution, the startup achieved significant outcomes:
- Accurate and Up-to-Date Responses: The application could answer questions about HR policies, such as “How many leave days are employees entitled to annually?”, by retrieving relevant documents from S3 and providing accurate responses based on the latest internal data.
- Cost Efficiency: With the pay-per-use pricing of Amazon Bedrock and LanceDB, the company only paid for the resources used. Processing a 1MB document cost less than half a cent, significantly reducing expenses compared to maintaining traditional infrastructure.
- Rapid Deployment: Using AWS CloudFormation, the entire infrastructure was deployed in minutes, allowing the business to focus on application development rather than server management.
- Scalability: The serverless solution automatically scaled with demand, ensuring consistent performance even as the number of documents or users increased.
Features
-
Chat Playground Interact with LLMs and inspect retrieved documents.
-
Chat History Management Manage your chat history, select which messages are to be forwarded or add messages to test and debug your prompts.
-
Serverless Knowledge Base This sample makes use of LanceDB and S3 as vector database. With this configuration, you'll only pay for the storage you use and you won't have to manage additional infrastructure.
-
Dynamic Prompt Management Users can override the default system prompt by specifying new prompts in the settings.
Conclusion
By adopting the serverless RAG solution from the AWS Samples repository, the startup successfully built a cost-effective, scalable, and intelligent Q&A application. This solution not only improved the accuracy of AI responses but also alleviated the burden of infrastructure management, enabling the company to focus on innovation and product development. The project demonstrates the power of AWS’s serverless architecture and AI services in empowering startups to build advanced AI applications.
6/12/2025
QaiDora Products
QaiDora draws inspiration from the myth of Pandora’s box—a symbol of unexpected possibilities and hope. For us, AI models are like modern Pandora’s boxes, holding untapped potential to turn challenges into opportunities. At QAI, QaiDora serves as an ecosystem of AI products designed to drive innovation and deliver competitive advantages.
Trusted by
Contact us







