Retrieval Augment Generation(RAG)

Building a Retrieval-Augmented Generation (RAG) Application Using Local LLM and Milvus

Introduction

In this article, we will walk through the development of a Retrieval-Augmented Generation (RAG) application that processes markdown (.md) documents, generates embeddings using a fast embedding model, stores them in a Milvus vector database, and uses a locally hosted Ollama Large Language Model (LLM) for answering user queries. This approach eliminates the need for cloud-based LLMs, offering more control over data and reducing API costs.

Key Components and Workflow

  1. Data Processing
    • Method: Extract all text content from markdown (.md) documents and store it in a dictionary. The filename serves as the key, and the text content is the value. This is the most critical step in the RAG creation, one must create their md files such that the related content is grouped within the same header(which can be used by markdown chunker to group same text together within the same chunk), you can use some vision models(like layoutparser) to remove header and footer from the files which can create unnecessary headers and has no use in RAG, we can also use basic string processing techniques to clean your data.
    • Advantages:
      • This is the most important step as better cleaned and better pre-processed data gives the best result.
    • Disadvantages:
      • More processing makes it more Computationally expensive.
  2. Text Chunking with LangChain
    • Method: We have used the markdown chunking method provided by LangChain to break down the document content into smaller, manageable chunks. Markdown chunking not only chunks the document but it also stores the header information in the metadata field which can be used as a keyword in the later stages of the retrieval process. With each chunk we have also created the summary of the text and extracted some important keywords out of it which is getting stored as a metadata in the vector database and used for better searching in later stages of the retrieval process(this is called metadata filtering). The choice of chunking method along with choice of their hyper parameter is another very important part as optimal chunks with related information results in better retrieval. Our goal should be to chunk in such a way that each chunk created contains a unique information and not related with the information in other chunks. The choice of the chunking method is generally documents and field specific there is no thumb rule like which is better or which one is not. You will generally have to try each strategy level by level and which ever works best for your documents is the best one. There are generally 5 levels of chunking: CharacterSplit, RecursiveCharacterSplit, Document Specific, Semantic and Agentic chunking. Start from the basic one and move level-by-level.
    • Advantages:
      • Improved retrieval accuracy by focusing on smaller text segments.
      • Enhanced performance in embedding generation.
    • Disadvantages:
      • May result in loss of broader context if not carefully chunked.
    • Challenges:
      • Choice of the chunk size and overlap hyperparameter for optimal embedding performance.
  3. Vector Embedding Generation
    • Method: Choice of good llm model to generate vector embeddings for each chunk. We have used fastembed embedding to generate the embedding which is an open source embedding model but there are many open source and cloud-based free and paid models available which can be used to generate embeddings. The choice of the model is again document and application dependent. If your documents are private and restricted you could use open source model and setup in your local machine with good GPU. But if there are no such concerns, there are many cloud-based general and specialized models available online which generally gives you good performance than opensource free models although now a days there are many application specific open source specialized models available which can also give you good performance.
    • Advantages:
      • Rapid processing even with large datasets.
      • Compatible with various downstream applications (e.g., similarity search).
    • Disadvantages:
      • Depending on the model, embedding accuracy may vary.
    • Challenges: Ensuring the embedding model is optimized for the specific domain or content type.
  4. Milvus Vector Database Integration
    • Method: We have stored the generated vector embeddings in Milvus, an open-source vector database, for efficient retrieval. This should also be chosen wisely as there are many vectored databases available currently and each database supports different features, build for different task and have distinct advantage and disadvantage. Ex:- I have chosen cosine similarity as a metric for similarity search which was available on Milvus but might not be available on some other database, similarly there are some database which does not provide metadata filtering facility but provide fast and efficient retrieval. So, it becomes your choice of features you consider most important and what features can be compromised and make your decision accordingly.
    • Features:
      • Advantages:
        • High performance in handling large-scale vector data.
        • Seamless integration with the LangChain framework.
      • Disadvantages:
        • Requires a learning curve to fully utilize advanced Milvus features.
      • Challenges: Efficiently managing vector updates and ensuring optimal search performance.
  5. Local LLM Integration with Ollama
    • Method: The retrieved text chunks are passed as context to the llama3 LLM, which is running locally and integrated via LangChain’s Ollama method. This choice of correct Chat LLM is again application dependent. Specialized LLMs generally gives you good answer compared to the general LLM. Fine tuning of LLMs can also be very helpful here but it takes a lot of time and GPU resource to do it especially if your model is like llama3-70b which is of size 40 GB. We should focus on optimizing above steps to improve our RAG performance instead and this should be done as a last resort and only if you have sufficient resources available with you.
    • Advantages:
      • Cost-effective by eliminating the need for cloud-based LLM API calls.
      • Enhanced privacy and security with local data processing.
    • Disadvantages:
      • Limited by the computational resources available locally.
      • Requires manual updates and maintenance of the LLM.
    • Challenges: Ensuring that the locally hosted model performs comparably to cloud-based alternatives.

Challenges of Using a Local LLM Instead of OpenAI Models

  1. Computational Resource Management
    • Running a local LLM can be resource-intensive, requiring powerful GPUs or other hardware accelerations.
    • Solution: Regularly monitor and optimize resource usage, possibly leveraging model quantization or optimization techniques.
  2. Model Maintenance and Updates
    • Unlike cloud-based models that are frequently updated, local models require manual updates.
    • Solution: Establish a routine for updating models and incorporating the latest advancements in LLMs.
  3. Initial Setup Complexity
    • Setting up and configuring a local LLM involves a steeper learning curve compared to using an API-based model.
    • Solution: Follow detailed documentation and community resources to streamline the setup process.
  4. Performance Variability
    • The performance of local models may vary depending on the hardware and specific use case.
    • Solution: Benchmark the model on various tasks to identify strengths and weaknesses and optimize accordingly.

Conclusion

By developing a RAG application that leverages markdown processing, LangChain chunking, fast embedding models, Milvus for vector storage, and a locally hosted Ollama LLM, you gain greater control over your data and reduce reliance on third-party APIs. While there are challenges associated with managing and maintaining local LLMs, the benefits of cost savings, data security, and customization make it a compelling choice for certain use cases.