Build and Deploy Your Own Knowledge Retrieval App

Although there’s plenty of guidance on building LLM-based conversational tools, there’s still a lack of information on building full-stack web applications and deploying them. You can leverage modern libraries to build and deploy an LLM app to your own infrastructure.

6 min readOct 5, 2023

Web application built to explore custom documents and knowledge bases using Embedchiain and Fast Dash. Reference: https://chatdocs.dkedar.com/. GIF by author.

How often do you find yourself buried under a mountain of documents, ranging from technical documentation, reports, and whitepapers to academic research and even YouTube videos? Have you ever thought of securely delegating the work of parsing and analyzing information to a Large Language Model (LLM) assistant?

While there’s an abundance of material on the implementation of LLM-based conversational tools, most lack practical guidance on deploying an LLM assistant within a web application context. Fortunately, with the help of modern innovative libraries, creating and deploying a personalized knowledge retrieval tool — both securely and efficiently — has become more attainable than ever.

Using natural language to interact with knowledge bases, PDFs, YouTube videos, and heaps of online information is a very effective use case for LLM-based applications. It’s no surprise that there’s no dearth of tools and services that enable such interaction. Notable mentions in this realm include ChatDOC, ChatDocuments, and LightPDF.

Most of these tools leverage a popular LLM technique called Retrieval Augmented Generation (RAG), colloquially known as “rag”. Its primary function is to segment source documents into text chunks and then retrieve those that align most closely with a user’s query.

Understanding RAG

Retrieval Augmented Generation, or RAG, has garnered substantial attention recently. Since extensive literature is available about this topic, let’s take a brief overview.

In essence, RAG operates by segmenting source documents into manageable chunks. These chunks are then vectorized and stored in vector databases. When a user poses a query, the system fetches the relevant chunks, which are then processed by LLMs such as GPT-4 to craft a comprehensive response (Figure 1).

For those interested in reading more, I recommend the following resources:

Figure 1. Retrieval Augmented Generation (RAG) workflow at a high level. Image by author.

Building a RAG workflow

Constructing a Retrieval Augmented Generation (RAG) workflow involves a sequence of interconnected components: a data loader, a chunking algorithm, an embedding model, a vector database, retrieval algorithms, and of course, an LLM. Designing a RAG workflow from scratch could be a tall order, requiring one to choose tools for each of these components and meticulously select their configurations.

Enter Embedchain, an open-source Python library that streamlines the entire process of building a RAG pipeline on many different source data formats. Embedchain builds upon Langchain but smartly trims down configuration choices by defining and selecting optimal defaults. This makes it a convenient choice for our web application.

Let’s walk through how to deploy a rudimentary RAG workflow using Embedchain:

import os
from embedchain import App

os.environ["OPENAI_API_KEY"] = "sk_xxxx"

def explore_your_knowledge_base(web_page_urls: list, 
                                query: str) -> str:
    "Interact with your web resources"

    # Define a custom app
    app = App()

    # Iteratively add all URLs to the app's vectorstore
    [app.add(url) for url in web_page_urls]

    # Query the LLM (gpt-3.5-turbo by default)
    response = app.query(query)

    return response

Just like that, you’re set! By default, Embedchain opts for OpenAI’s gpt-3.5-turbo LLM, the text-embedding-ada-002 model for embeddings, and ChromaDB for the vector database. Read about all the other configurations on Embedchain’s documentation page.

To make our application conversational, we can adjust the configurations slightly by choosing a custom prompt and tracking the conversation history. We should also handle preliminary validations, like checking for API keys and queries. The complete modified code for the Embedchain app is hosted on GitHub.

Deploying RAG as a web application

We saw how easy it is to build an LLM-based RAG workflow using Embedchain. Our next goal is its web deployment. This is where Fast Dash — another versatile Python library — comes into play.

Disclaimer: As the author of Fast Dash, I designed this tool aiming to swiftly create and launch Python-based prototypes. It provided me an avenue to share intricate analytics and machine learning outcomes seamlessly with my colleagues.

At its core, Fast Dash operates on two guiding principles.

Simplicity in Function: Every web application can be distilled down to a single Python function.
Function Annotations: With thorough annotations, a Python function can carry all the blueprint details needed for an interactive web application.

Fast Dash inculcates these principles by incorporating a @fastdash decorator for our Python functions. To learn more, refer to the Fast Dash documentation or read this post.

To transform our Embedchain RAG workflow into a web application, here’s the adjusted code:

import os
from embedchain import App
from fast_dash import fastdash, Chat, dmc

os.environ["OPENAI_API_KEY"] = "sk-xxxx"

web_page_urls_component = dmc.MultiSelect(
    description="Include all the reference web URLs",
    placeholder="Enter URLs separated by commas",
    searchable=True,
    creatable=True
)

@fastdash
def explore_your_knowledge_base(web_page_urls: web_page_urls_component, 
                                query:str) -> Chat:
    "Interact with your web resources"

    # Define a custom app
    app = App()

    # Iteratively add all URLs to the app's vectorstore
    [app.add(url) for url in web_page_urls]

    # Query the LLM (gpt-3.5-turbo by default)
    response = app.query(query)

    # Convert this conversation into a dictionary 
    # to render a chat interface
    chat = dict(query=query, response=response)

    return chat

Running this code will deploy the Python function as a fully functioning web app! Notice the minor modifications we made to our original function:

Modified Annotations: We modified input and output data type annotations. Fast Dash reads and understands these annotations. In this case, Fast Dash uses the Dash component web_page_urls_component to display a list of URLs and a text input box for the query. Besides that, we also use the Fast Dash component Chat to render a chat interface as the output.
Refined Return Type: Instead of returning a response string, we modify the function to return a dictionary with two keys, query and response, as required by the Chat component.
@fastdash Decorator: Finally, we add a @fastdash decorator to deploy our function as a fully functioning web application.

Running the above script will deploy this application to port 8080 by default. Here’s how it looks:

Result of deploying the code from the previous snippet. Image by author.

We can also incorporate additional input parameters or even offload the OpenAI API key responsibility to the user. On doing that, here’s what we get.

Embedchain RAG web application deployed with Fast Dash. Reference: https://chatdocs.dkedar.com/. Image by author.

The full source code to build and deploy this web application is hosted publicly at https://github.com/dkedar7/embedchain-fastdash.

GitHub - dkedar7/embedchain-fastdash: Built with Fast Dash, this app uses Embedchain, which…

Built with Fast Dash, this app uses Embedchain, which abstracts the entire process of loading and chunking datasets…

github.com

Deploying to a cloud service

One of the many benefits of Fast Dash being Flask-based is the ease of deploying apps using Gunicorn in production environments. Our next step is to encapsulate the app within a Docker container and host it using Google Cloud’s Cloud Run service.

Google Cloud Run allows deploying a web app directly from its publicly hosted GitHub repository. To see this in action, navigate to the app repository here and click on the “Deploy on Google Cloud” button. This ensures a safe and streamlined deployment of the app onto your Google Cloud setup.

While the entire deployment process typically takes between 10 to 15 minutes, here’s a time-lapsed video to showcase the deployment steps.

Time-lapsed video demonstrating the process of deploying Fast Dash app to explore custom knowledge bases on Google Cloud Run. Video by author.

Conclusion

The task of constructing a knowledge retrieval application can initially seem overwhelming. However, as we saw, the rise of innovative libraries and tools has remarkably simplified the process. These innovations not only streamline the development but also deployment, making the once-arduous task nearly turnkey.

For one-click deployment of your own knowledge retrieval web application, visit the GitHub repository and use the “Deploy on Google Cloud” button!