Skip to content

High-performance Retrieval-Augmented Generation (RAG) system that uses FAISS for vector search and Llama 3.1 (via Ollama) for embeddings and text generation.

Notifications You must be signed in to change notification settings

lysandroc/rag2web

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

FAISS-Powered RAG System with Llama 3.1

A high-performance Retrieval-Augmented Generation (RAG) system that uses FAISS for vector search and Llama 3.1 (via Ollama) for embeddings and text generation. This system allows you to ingest web content and ask questions about it, with the model providing contextually relevant answers.

System Architecture

  flowchart TD
    %% Define contrast for readability
    classDef ingestion fill:#004466,stroke:#002233,stroke-width:2px,color:#ffffff
    classDef storage fill:#663399,stroke:#442266,stroke-width:2px,color:#ffffff
    classDef retrieval fill:#994400,stroke:#662200,stroke-width:2px,color:#ffffff
    classDef generation fill:#8B0000,stroke:#600000,stroke-width:2px,color:#ffffff
    classDef explain fill:#f4f4f4,stroke:#888888,stroke-width:1px,font-style:italic

    subgraph Setup["βš™οΈ 1. System Initialization"]
        A["πŸ› οΈ Load NLTK, FAISS, Ollama"] --> B["πŸ“ Create FAISS Index"]
        B --> C["⚑ Enable GPU (if available)"]
        C --> D["πŸ—‚οΈ Load existing index (if available)"]
        E["πŸ“ **Process:**<br>1. Load required libraries<br>2. Initialize FAISS (CPU/GPU)<br>3. Prepare metadata storage"]:::explain
    end
    
    subgraph Ingestion["πŸ“₯ 2. Content Ingestion"]
        F["🌐 Fetch Webpage (requests)"] --> G["πŸ› οΈ Process HTML (BeautifulSoup)"]
        G --> H["πŸ“œ Extract & Clean Text"]
        H --> I["πŸ”€ Chunking (Semantic Boundaries)"]
        J["πŸ“ **Process:**<br>1. Fetch page<br>2. Remove unnecessary elements<br>3. Split into meaningful chunks"]:::explain
    end
    
    subgraph Embedding["πŸ“Š 3. Vector Embeddings"]
        I --> K["πŸ”’ Generate Embeddings (Ollama)"]
        K --> L["πŸ—ƒοΈ Cache Embeddings (Avoid redundant calls)"]
        L --> M["πŸ“‚ Add Vectors to FAISS Index"]
        N["πŸ“ **Process:**<br>1. Convert chunks to vectors<br>2. Store in FAISS<br>3. Use caching for efficiency"]:::explain
    end
    
    subgraph Storage["πŸ’Ύ 4. Index & Metadata Storage"]
        M --> O["πŸ’Ύ Save FAISS Index"]
        M --> P["πŸ“‹ Save Metadata (JSON)"]
        Q["πŸ“ **Process:**<br>1. Store FAISS index on disk<br>2. Maintain metadata for retrieval"]:::explain
    end
    
    subgraph QueryProcessing["πŸ” 5. Query Processing"]
        R["❓ User Question"] --> S["πŸ”’ Convert Question to Vector"]
        S --> T["πŸ”Ž Search FAISS Index"]
        O --> T
        T --> U["πŸ“š Retrieve Top Matching Chunks"]
        P --> U
        V["πŸ“ **Process:**<br>1. Encode query<br>2. Find closest vectors<br>3. Retrieve relevant text"]:::explain
    end
    
    subgraph Reranking["πŸ“Š 6. Reranking Results"]
        U --> W["🧐 Compute Keyword Match Score"]
        U --> X["πŸ“ Compute Position Score"]
        W --> Y["πŸ“Š Adjust Weighted Scores"]
        X --> Y
        Y --> Z["πŸ“œ Return Top-K Relevant Chunks"]
        AA["πŸ“ **Process:**<br>1. Score relevance based on<br>- Vector similarity<br>- Keyword match<br>- Chunk position"]:::explain
    end
    
    subgraph Answer["πŸ’‘ 7. Answer Generation"]
        Z --> AB["πŸ“ Construct Prompt with Context"]
        R --> AB
        AB --> AC["πŸ€– Generate Response (Ollama)"]
        AC --> AD["πŸ“© Return User Response"]
        AE["πŸ“ **Process:**<br>1. Build question-context prompt<br>2. Generate LLM response"]:::explain
    end
    
    %% Links between subgraphs
    Setup -.-> Ingestion
    Ingestion -.-> Embedding
    Embedding -.-> Storage
    Storage -.-> QueryProcessing 
    QueryProcessing -.-> Reranking
    Reranking -.-> Answer
    
    %% Apply class definitions
    class A,B,C,D,E setup
    class F,G,H,I,J ingestion
    class K,L,M,N embedding
    class O,P,Q storage
    class R,S,T,U,V retrieval
    class W,X,Y,Z,AA reranking
    class AB,AC,AD,AE generation
Loading

Key Components

  • Vector Database: FAISS (Facebook AI Similarity Search)
  • Language Model: Llama 3.1:8B (via Ollama)
  • Embeddings: Generated by Llama 3.1:8B through Ollama's embeddings API
  • Text Processing: NLTK and BeautifulSoup
  • Vector Dimensions: 4096-dimensional embeddings

Dependencies

Required Packages

poetry install

Installing Ollama

  1. Follow the instructions at Ollama's official website to install Ollama for your platform.

  2. Pull the Llama 3.1 8B model:

ollama pull llama3.1:8b

Creating the Custom Model

The system uses a custom Ollama model called crewai-llama-model. To create this model:

  1. Create a file named ModelFile with the following content:
FROM llama3.1:8b
PARAMETER temperature 0.1
PARAMETER num_ctx 8192
PARAMETER seed 42
PARAMETER top_p 0.9

SYSTEM """
You are an AI assistant that provides accurate, helpful, and concise answers based on the given context.
Your task is to analyze the provided context and answer questions based solely on that information.
If the context doesn't contain the answer, acknowledge that the information isn't available.
Always prioritize information from the provided context over your general knowledge.
"""
  1. Create the model using the following command:
ollama create custom-llama-model -f MakeFile
  1. Verify the model was created:
ollama list

This custom model uses Llama 3.1:8B as its base but is configured with specific parameters that optimize it for RAG tasks with lower temperature for more factual responses and increased context window.

Running the System

Basic Usage

  1. Clone this repository:
git clone https://github.com/stay22/webrag22.git
cd webrag22
  1. Install dependencies:
poetry install
  1. Run the script:
python main.py
  1. When prompted, enter a URL to analyze:
Enter a webpage URL: https://example.com/some-article
  1. Ask questions about the content:
Ask a question (or type 'exit' to quit): What is the main topic of this article?

Command Line Arguments

You can also run the script with a URL directly:

python main.py https://example.com/some-article

Configuration Options

The system provides several configuration options that can be modified in the code:

  • chunk_size: Size of text chunks in characters (default: 500)
  • chunk_overlap: Overlap between chunks in characters (default: 100)
  • use_gpu: Whether to use GPU for FAISS (default: False)
  • index_dir: Directory to store FAISS indexes (default: "./faiss_indexes")

To enable GPU acceleration, install the GPU version of FAISS:

pip uninstall faiss-cpu
pip install faiss-gpu

Then set use_gpu=True when initializing the WebRAG22 class.

Detailed Process Flow

sequenceDiagram
    participant U as User
    participant WF as WebFetcher
    participant TP as TextProcessor
    participant CH as Chunker
    participant EM as Embedding Model
    participant FX as FAISS Index
    participant LLM as Language Model
    
    U->>WF: Provide URL
    WF->>TP: HTML Content
    TP->>TP: Remove scripts, styles
    TP->>TP: Extract main content
    TP->>TP: Clean & normalize text
    TP->>CH: Clean text
    
    CH->>CH: Split into sentences
    CH->>CH: Group sentences into chunks
    CH->>CH: Add overlap between chunks
    
    loop For each chunk
        CH->>EM: Send chunk text
        EM->>EM: Generate embedding
        EM->>FX: Store <chunk, embedding>
    end
    
    U->>EM: Ask question
    EM->>EM: Generate query embedding
    EM->>FX: Search similar embeddings
    FX->>FX: Calculate vector similarity
    FX->>FX: Rerank with keywords & position
    FX->>LLM: Top relevant chunks
    
    LLM->>LLM: Generate answer with context
    LLM->>U: Return answer
Loading

Performance Optimizations

  • Embedding Caching: Prevents redundant embedding generation
  • Batched Processing: Processes chunks in groups for efficiency
  • FAISS for Vector Search: Much faster than database-based vector search
  • Index Persistence: Save/load capabilities to avoid reprocessing
  • Reranking: Combines vector similarity, keyword matching, and position scoring

Troubleshooting

Common Issues

  1. "Failed to connect to Ollama server":

    • Make sure Ollama is running: ollama serve
    • Verify the model is downloaded: ollama list
  2. "Error saving index: maximum recursion depth exceeded":

    • This can happen with very large documents
    • Try decreasing the batch size in generate_embeddings(batch_size=10)
  3. Slow performance with large documents:

    • Consider enabling GPU acceleration if you have a compatible GPU
    • Increase batch size for faster processing (if memory permits)

Acknowledgments

About

High-performance Retrieval-Augmented Generation (RAG) system that uses FAISS for vector search and Llama 3.1 (via Ollama) for embeddings and text generation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages