Building an AI-Driven Document Query Assistant: A Step-by-Step Python Tutorial

Introduction

This tutorial is designed to explain a Python script that utilizes various packages and techniques to create a sophisticated document-based question answering system. This system leverages the power of AI models to interpret and respond to user queries based on a repository of documents. It’s an excellent example of how to integrate multiple AI and machine learning components to build a practical application that can significantly enhance information retrieval and user interaction. By the end of this tutorial, even those new to coding will understand the mechanics behind this code and why it’s a valuable addition to any data-driven project. You can think this script is a RAG AI system, and it’s best to be used with documents that are of PDF type.

Setting Up Your Python Environment

Before diving into the code, it’s essential to set up a Python environment. This environment will contain all the libraries and their specific versions required to run the code smoothly without affecting other Python projects you might have. Here’s how to do it:

Install Python: Ensure Python is installed on your system. Python 3.8 or newer is recommended.
Create a Virtual Environment:
1. Open your terminal or command prompt.
2. Navigate to the project directory.
3. Run python -m venv env to create a virtual environment named env.
4. Activate the environment with source env/bin/activate on Unix or macOS, or .\env\Scripts\activate on Windows.
Install Dependencies: Install all required packages by running pip install streamlit dotenv sklearn llama_index.

Don’t forget to register for OpenAI’s API Key and store it in the .env file, OK?

Code Explanation

Importing Packages

# code:
import streamlit as st
import os
from dotenv import load_dotenv
from llama_index.llms.openai import OpenAI
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader,
load_index_from_storage, StorageContext
from llama_index.core.settings import Settings
from llama_index.core.response.pprint_utils import pprint_response
import warnings
from llama_index.core.indices.postprocessor import SentenceTransformerRerank
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import TextNode
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Each import statement brings a specific library into the scope:

streamlit: For creating the web app interface.
os: To handle file and directory operations.
dotenv: To manage environment variables.
llama_index: Components from llama_index to handle document indexing and response generation.
sklearn: Machine learning tools for text processing and similarity calculation.
warnings: To manage warnings during runtime.

Loading Environment Variables

# code:
load_dotenv('.env')

This command loads environment variables from a .env file, which is crucial for keeping sensitive data (like API keys) out of the source code.

Explanation of the .env File

In many development projects, particularly those that involve sensitive or environment-specific settings, it is crucial to manage such data securely and efficiently. The .env file plays a vital role in this context, as it allows developers to separate configuration and credentials from the codebase. This separation enhances security and flexibility, making the application easier to manage across different environments (development, testing, production).

What is a .env File?

A .env file is a simple text file that contains environment variables. These are key-value pairs that can be used to configure application settings or store sensitive information like passwords, API keys, or database URIs. The .env file is usually located in the root directory of a project, and it’s often excluded from version control (e.g., through .gitignore) to prevent sensitive data from being exposed publicly.

How Does It Work?

When a Python script runs, it typically has access to an environment—a set of variables that define how the operating system behaves and how various programs are executed. By using the .env file, developers can add or modify environment variables specific to their application without affecting the global system environment.

The dotenv package in Python is commonly used to load these variables into the application’s environment. When you use load_dotenv(), it reads the .env file, and the variables set there can then be accessed using Python’s os.environ like so:

# code:
import os
from dotenv import load_dotenv
load_dotenv() # Load environment variables from .env
api_key = os.environ.get("API_KEY") # Access the API_KEY environment variable

Examples of Variables in a .env File

Here are a few examples of what might be included in a .env file for a document-based question answering system:

# code:
OPENAI_API_KEY="your_api_key_here"
STORAGE_PATH="./vectorstore"
DOCUMENTS_DIR="./documents"
MODEL_NAME="gpt-3.5-turbo"
DATABASE_URL="postgres://username:password@localhost:5432/mydatabase"

Each line represents a different setting or piece of information necessary for the application. For example:

API_KEY might be used to authenticate API requests.
STORAGE_PATH and DOCUMENTS_DIR could specify where to store application data and documents, respectively.
MODEL_NAME might indicate which AI model to load.
DATABASE_URL could provide all the details needed to connect to a database.

Why Use a .env File?

Using a .env file has several advantages:

Security: Sensitive data like API keys and passwords are kept out of the source code.
Flexibility: It’s easy to change settings without modifying the code, making the app more adaptable to different environments.
Convenience: Centralizes configuration, making it easier to manage and understand the application’s settings.

Detailed Description of Streamlit Functions

Streamlit is an open-source Python library used to create and share beautiful, custom web apps for machine learning and data science. In the context of our document-based question answering system, Streamlit is crucial for building the user interface that interacts with the backend logic. Here, we’ll explain some specific Streamlit functions used in the application:

st.cache_resource, st.error, st.file_uploader, st.rerun, and st.session_state

Understanding these functions will help clarify how the web interface manages user interaction and maintains state.

st.cache_resource

The st.cache_resource function is used to cache data or computations in a Streamlit app. This is especially useful when dealing with expensive data loading or processing operations that you don’t want to repeat every time the app re-runs. Caching can significantly speed up the app by storing the results of function calls and reusing them when the inputs haven’t changed.

# code:
@st.cache_resource(show_spinner=False)
def initialize():
    # Expensive data loading or processing
    # Return results that can be cached

show_spinner: This optional parameter, when set to False, prevents the spinner (loading animation) from displaying while the function is being executed.

st.error

st.error is used to display error messages in the app. If something goes wrong—such as if the necessary documents are not found in the specified directory—this function can provide a clear message to the user, improving the app’s usability and troubleshooting experience.

# code:
if not os.listdir(documents_path):
    st.error("No documents found. Please upload your documents.")

This function renders the message in a visually distinctive style, making it clear that an error has occurred.

st.file_uploader

st.file_uploader allows users to upload files through the web interface. This function is essential for the document-based system, as it enables users to provide the documents that the system will analyze and respond to questions about.

# code:
uploaded_files = st.file_uploader("Upload documents", accept_multiple_files=True, type=['pdf', 'txt', 'docx'])

accept_multiple_files: When set to True, this allows users to upload more than one file at a time.
type: This parameter specifies the types of files that can be uploaded, limiting uploads to PDF, TXT, and DOCX formats in this case.

st.rerun

st.rerun is used to programmatically rerun the app from the top. This is useful when the state of the app has changed significantly due to user actions, such as after uploading new documents. By rerunning the script, the app can update to reflect the new data without requiring the user to manually refresh or restart.

st.rerun()

This function resets the execution of the script, ensuring that any new or changed data is incorporated into the app’s state.

st.session_state

st.session_state is used to maintain state across reruns of the app. It acts like a persistent, modifiable dictionary that can store variables such as user inputs, computed values, or temporary data that needs to be accessed by different parts of the app over time.

# code:
if 'messages' not in st.session_state:
    st.session_state.messages = [{'role': 'assistant', 'content': 'Ask me a question!'}]

This code snippet checks if messages is not already a key in st.session_state. If it’s not, it initializes it with a welcome message. This ensures that as users interact with the app, their conversation can continue seamlessly even if the app reruns.

Setting Up and Handling Directories

Setting Up and Handling Directories
# code:
storage_path = "./vectorstore"
documents_path = "./documents"
if not os.path.exists(storage_path):
    os.makedirs(storage_path, exist_ok=True)
if not os.path.exists(documents_path):
    os.makedirs(documents_path, exist_ok=True)

These lines set up directories for storing vector indices and documents, creating them if they don’t exist.

Configuring and Initializing Components

Settings: Configures the language model to be used.
Reranker: Initializes a sentence transformer model for improving response relevance.
Parser: Splits documents into sentences for indexing.

Enhancing and Reranking Responses

The enhance_and_rerank_responses function combines semantic reranking with text enhancement to find the most relevant and comprehensive response based on TF-IDF and cosine similarity.

Additional Information on the ML Models and Their Roles

Machine learning models play a pivotal role in enhancing the capabilities of applications like the document-based question answering system described earlier. Two key components in this application are the Sentence Transformer Reranker and the TF-IDF Vectorizer. Understanding how these models work and why they are chosen will provide deeper insight into their contribution to the application’s functionality.

What is Sentence Transformer Rerank? The Sentence Transformer Rerank model is a type of neural network specifically designed for the task of semantic search and text ranking. It leverages transformers, which are advanced deep learning models known for their effectiveness in understanding the context and meaning of text.

How Does It Work? The model uses a technique known as sentence embedding, where sentences are converted into high-dimensional vectors. These vectors capture the semantic meanings of the sentences, allowing the model to perform operations such as ranking by comparing the semantic similarity between vectors. In the context of our application, it is used to rerank responses based on their relevance to the user’s query.

Why Use It?

Improved Relevance: By understanding the deeper meaning behind the text, the Sentence Transformer can rerank responses in a way that prioritizes those most relevant to the query, rather than just those that contain similar words.
Efficiency: It helps in filtering out less relevant answers quickly, focusing on quality content that is most likely to satisfy the user’s informational needs.

What is TF-IDF Vectorizer? TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The TF-IDF value increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

How Does It Work?

Term Frequency (TF): Measures the frequency of a word in a document.
Inverse Document Frequency (IDF): Measures how common or rare a word is across all documents.
By multiplying these two values, TF-IDF gives a weight to each word, signifying its importance in the document.

Why Use It?

Identifying Key Words: Helps in pinpointing which words are most telling about the content of a document.
Relevance: It can be used to compute a kind of ‘relevance score’ between documents and queries, enhancing the quality of information retrieval.
Simplicity and Effectiveness: Despite its simplicity, TF-IDF remains one of the most effective methods for transforming text data into a usable format for machine learning models.

Combining the Models

In our application, the Sentence Transformer Rerank model initially processes responses to understand their meaning and rank them based on how well they match the query semantically. Then, the TF-IDF Vectorizer further processes these responses, comparing the importance of words in the response texts to determine which is most likely to be the comprehensive and best-matching answer. This combination ensures that the answers are not only contextually relevant but also rich in content that directly addresses the user’s questions.

Main Application Function

# code:
def main():

This function orchestrates the entire application, handling file uploads, initializing the index, processing user queries, and displaying responses.

Running the Application

# code:
if __name__ == "__main__":
    main()

This conditional ensures that the main() function runs only if the script is executed as the main program. You can use the command:

# (app.py is the file which holds the complete code of our script.)
# code:
streamlit run app.py

As we speak the code is working without error. If you’re using newer packages from libraries using pip install, the code might not be working without errors. So make sure to check appropriate documents so you know how to fix it.

Conclusion

This code demonstrates a complex integration of AI, machine learning, and web development to create a responsive and intelligent document-based querying system. It’s an excellent project for anyone looking to delve into AI-powered applications, providing a practical foundation in handling user inputs, processing data, and generating meaningful outputs.

Complete Code:

import streamlit as st
import os
from dotenv import load_dotenv
from llama_index.llms.openai import OpenAI
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, load_index_from_storage, StorageContext
from llama_index.core.settings import Settings
from llama_index.core.response.pprint_utils import pprint_response
import warnings
from llama_index.core.indices.postprocessor import SentenceTransformerRerank
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import TextNode
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Load environment variables
load_dotenv('.env')
# Suppress specific FutureWarnings from huggingface_hub
warnings.filterwarnings("ignore", category=FutureWarning, module='huggingface_hub')
class QueryBundle:
    def __init__(self, query_str):
        self.query_str = query_str
# Define paths
storage_path = './vectorstore'
documents_path = './documents'
# Set the model configuration
Settings.llm = OpenAI(model='gpt-3.5-turbo', temperature=0.1)
# Ensure directories exist
if not os.path.exists(storage_path):
    os.makedirs(storage_path, exist_ok=True)
if not os.path.exists(documents_path):
    os.makedirs(documents_path, exist_ok=True)
# Initialize the reranker
reranker = SentenceTransformerRerank(model="cross-encoder/ms-marco-MiniLM-L-2-v2", top_n=5)
# Initialize the parser
parser = SentenceSplitter()
def pprint_response(response, show_source=False):
    if isinstance(response, str):
        print(response)  # Handle the string directly
    else:
        if response.response is None:
            print("No response.")
        else:
            print(response.response)
            if show_source:
                print("Source:", response.source)
                
class EnhancedTextNode:
    def __init__(self, text_node):
        self.node = text_node  # Wrap the original TextNode
    def get_content(self, metadata_mode):
        return self.node.text  # Implement a method that the reranker might call
# Modify the enhance_and_rerank_responses function to wrap TextNodes
def enhance_and_rerank_responses(responses, query):
    """ Combine reranking and enhancing to select the most comprehensive and relevant response. """
    if not responses:
        return "No responses available."
    
    # Reranking using the semantic reranker
    query_bundle = QueryBundle(query)
    nodes = [EnhancedTextNode(TextNode(text=res)) for res in responses]  # Wrap TextNodes for compatibility
    reranked_nodes = reranker.postprocess_nodes(nodes=nodes, query_bundle=query_bundle)
    reranked_responses = [node.node.text for node in reranked_nodes]  # Adjust access to text
    # Enhance the response quality by selecting the most comprehensive answer
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(reranked_responses)
    cosine_matrix = cosine_similarity(tfidf_matrix)
    avg_similarity = cosine_matrix.mean(axis=0)
    best_response_idx = avg_similarity.argmax()
    return reranked_responses[best_response_idx]
# Function to initialize or load the index
@st.cache_resource(show_spinner=False)
def initialize():
    if not os.path.isfile(os.path.join(storage_path, 'docstore.json')):
        documents = SimpleDirectoryReader(documents_path).load_data()
        nodes = parser.get_nodes_from_documents(documents)
        index = VectorStoreIndex(nodes)
        index.storage_context.persist(persist_dir=storage_path)
    else:
        storage_context = StorageContext.from_defaults(persist_dir=storage_path)
        index = load_index_from_storage(storage_context)
    return index
def main():
    # Check for documents and possibly upload new ones
    if not os.listdir(documents_path):
        st.error("No documents found. Please upload your documents.")
        uploaded_files = st.file_uploader("Upload documents", accept_multiple_files=True, type=['pdf', 'txt', 'docx'])
        if uploaded_files:
            for uploaded_file in uploaded_files:
                with open(os.path.join(documents_path, uploaded_file.name), "wb") as f:
                    f.write(uploaded_file.getvalue())
            st.rerun()  # Rerun the script after files are uploaded
            
    # Initialize index if documents are present
    if os.listdir(documents_path):
        index = initialize()
        st.title('Ask the Document')
        if 'messages' not in st.session_state:
            st.session_state.messages = [{'role': 'assistant', 'content': 'Ask me a question!'}]
        chat_engine = index.as_chat_engine(chat_mode='condense_question', verbose=True)
        if prompt := st.text_input('Your question'):
            st.session_state.messages.append({'role': 'user', 'content': prompt})
        for message in st.session_state.messages:
            with st.expander(f"{message['role'].title()} says:"):
                st.write(message['content'])
        if st.session_state.messages[-1]['role'] != 'assistant':
            with st.spinner('Thinking...'):
                response = chat_engine.chat(prompt)
                response_texts = response.response if isinstance(response.response, list) else [response.response]
                st.write(response_texts)
                
                # Create a QueryBundle object
                # query_bundle = QueryBundle(prompt)
                
                # Assume response.response is a list of strings
                # nodes = [TextNode(text=res) for res in response.response] if isinstance(response.response, list) else []
                # reranked_nodes = reranker.postprocess_nodes(nodes=nodes, query_bundle=query_bundle)
                # reranked_response = ' '.join([node.text for node in reranked_nodes])
                # st.write(reranked_response)
                best_response = enhance_and_rerank_responses(response_texts, prompt)
                pprint_response(best_response, show_source=True)
                st.session_state.messages.append({'role': 'assistant', 'content': best_response})
    else:
        st.write("Upload documents to start using the application.") 
        
if __name__ == "__main__":
    main()

Requirements.txt

Some packages in requirements.txt are not being used by the script. These packages are here because I was experimenting with them to see what packages are working well with the script. Now, it’s too hard to remove all the dependencies from unused packages, so I just left them in the requirements.txt. Remember, you need to do:

# code:
pip install -r requirements.txt

By executing the command above the script can now import these packages and utilize these packages for various purposes.

Packages:

Packages:
aiohttp==3.9.5
aiosignal==1.3.1
altair==5.3.0
annotated-types==0.6.0
anyio==4.3.0
async-timeout==4.0.3
attrs==23.2.0
beautifulsoup4==4.12.3
black==24.4.2
blinker==1.8.2
cachetools==5.3.3
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
dataclasses-json==0.6.5
Deprecated==1.2.14
dirtyjson==1.0.8
distro==1.9.0
exceptiongroup==1.2.1
filelock==3.14.0
frozenlist==1.4.1
fsspec==2024.3.1
gitdb==4.0.11
GitPython==3.1.43
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.5
httpx==0.27.0
huggingface-hub==0.23.0
idna==3.7
Jinja2==3.1.4
joblib==1.4.2
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
llama-index==0.10.35
llama-index-agent-openai==0.2.4
llama-index-cli==0.1.12
llama-index-core==0.10.35.post1
llama-index-embeddings-openai==0.1.9
llama-index-indices-managed-llama-cloud==0.1.6
llama-index-legacy==0.9.48
llama-index-llms-openai==0.1.18
llama-index-multi-modal-llms-openai==0.1.5
llama-index-program-openai==0.1.6
llama-index-question-gen-openai==0.1.3
llama-index-readers-file==0.1.22
llama-index-readers-llama-parse==0.1.4
llama-parse==0.4.2
llamaindex-py-client==0.1.19
markdown-it-py==3.0.0
MarkupSafe==2.1.5
marshmallow==3.21.2
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.5
mypy-extensions==1.0.0
nest-asyncio==1.6.0
networkx==3.3
nltk==3.8.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.1.105
openai==1.27.0
packaging==24.0
pandas==2.2.2
pathspec==0.12.1
pillow==10.3.0
platformdirs==4.2.1
protobuf==4.25.3
pyarrow==16.0.0
pydantic==2.7.1
pydantic_core==2.18.2
pydeck==0.9.0
Pygments==2.18.0
pypdf==4.2.0
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytz==2024.1
PyYAML==6.0.1
referencing==0.35.1
regex==2024.4.28
requests==2.31.0
rich==13.7.1
rpds-py==0.18.1
safetensors==0.4.3
scikit-learn==1.4.2
scipy==1.13.0
sentence-transformers==2.7.0
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
soupsieve==2.5
SQLAlchemy==2.0.30
streamlit==1.34.0
striprtf==0.0.26
sympy==1.12
tenacity==8.3.0
threadpoolctl==3.5.0
tiktoken==0.6.0
tokenizers==0.19.1
toml==0.10.2
tomli==2.0.1
toolz==0.12.1
torch==2.3.0
tornado==6.4
tqdm==4.66.4
transformers==4.40.2
triton==2.3.0
typing-inspect==0.9.0
typing_extensions==4.11.0
tzdata==2024.1
urllib3==2.2.1
watchdog==4.0.0
wrapt==1.16.0
yarl==1.9.4

EssayBoard