Setting up an AI/LLM Stack in Haiku: A Practical Guide part II

9 minute read

Using AI Components in Haiku

In the previous post we have seen which tools and frameworks are available in Haiku and how to install them. Now let’s see how to use these with some practical examples. You can find the code used in this post in this repository.

Connecting to llama.cpp via OpenAI API and LangChain

from langchain_openai import OpenAI
from langchain.llms.llamacpp import LlamaCpp
import os

# Option 1: Connect to local llama.cpp
def connect_local_llama():
    # Path to the downloaded model (e.g., Mistral 7B)
    model_path = "/Dati/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf"

    # Parameters optimized for CPU
    llm = LlamaCpp(
        model_path=model_path,
        temperature=0.7,
        max_tokens=512,
        n_ctx=2048,
        n_batch=512,  # Reduced batch size for CPU
        verbose=False
    )

    return llm

# Option 2: Connect to an OpenAI-compatible API
def connect_openai_compatible(base_url="<http://localhost:8000/v1>"):
    # Configuration for an OpenAI-compatible API server
    os.environ["OPENAI_API_KEY"] = "not-needed-for-local"
    os.environ["OPENAI_API_BASE"] = base_url

    llm = OpenAI(
        temperature=0.7,
        model_name="mistral-7b"  # The name must match the one configured in the server
    )

    return llm

# Test the connection
def test_connection(llm):
    result = llm.invoke("What are Large Language Models?")
    print(result)

# Example of usage
if __name__ == "__main__":
    # Choose one of the options
    llm = connect_local_llama()
    # llm = connect_openai_compatible()

    test_connection(llm)

In this first example, I’m demonstrating how to establish a connection with a llama.cpp model. There are two main approaches:

Direct connection to local llama.cpp: This option uses the model directly from the llama-cpp-python library, which we installed previously. It’s useful when we want to run the model completely locally.
Connection to an OpenAI-compatible API: If we have a local or remote server that exposes an OpenAI-compatible API (like llama.cpp with the server option), we can use this configuration.

The parameters have been optimized for CPU execution, reducing the batch size and limiting the context to improve performance. Obviously, on Haiku, the absence of GPU acceleration will impose limitations on the size of usable models, but lighter models like quantized Mistral 7B can work acceptably well.

Creating Embeddings Using llama.cpp and FAISS

import numpy as np
from langchain_community.embeddings import LlamaCppEmbeddings
import faiss

# Configure embeddings
def setup_embeddings():
    # Path to the embedding model
    embed_model_path = "/Dati/models/ggml-all-MiniLM-L6-v2-f16.gguf"

    # Initialize the embedding model
    embeddings = LlamaCppEmbeddings(
        model_path=embed_model_path,
        n_ctx=512,  # Reduced context to improve performance
        n_batch=32,  # Reduced batch for CPU
        verbose=False
    )

    return embeddings

# Create a FAISS index
def create_faiss_index(documents, embeddings):
    # Get embeddings for each document
    embedded_documents = []
    for doc in documents:
        embed = embeddings.embed_query(doc)
        embedded_documents.append(embed)

    # Convert to numpy array
    vectors = np.array(embedded_documents).astype('float32')

    # Create the FAISS index
    dimension = len(vectors[0])
    index = faiss.IndexFlatL2(dimension)  # Simple L2 index, efficient on CPU

    # Add vectors to the index
    index.add(vectors)

    return index, documents

# Example usage
if __name__ == "__main__":
    # Sample documents
    documents = [
        "Haiku is an open source operating system inspired by BeOS",
        "Python is a versatile and powerful programming language",
        "Large Language Models are neural network-based models",
        "FAISS is a library for vector similarity search",
        "The CPU (Central Processing Unit) is the brain of a computer"
    ]

    # Configure embeddings
    embeddings = setup_embeddings()

    # Create the FAISS index
    index, docs = create_faiss_index(documents, embeddings)

    print(f"FAISS index created with {index.ntotal} vectors of dimension {index.d}")

In this example, I’m demonstrating how to create embeddings (vector representations of text) using llama.cpp and how to use FAISS to index them. Embeddings are fundamental for many AI applications, such as semantic search, recommendation systems, and RAG (Retrieval Augmented Generation).

I’m using a lightweight embedding model based on MiniLM, which is efficient even on CPUs. For Haiku, it’s important to select quantized embedding models (like those in GGUF format) to reduce memory consumption and improve performance.

The FAISS configuration has been kept simple, using a flat L2 index that works well on CPU without requiring excessive memory. For larger collections, it might be necessary to explore other types of FAISS indices that offer a better trade-off between speed and precision.

Content Retrieval via Semantic Query

import numpy as np
from langchain_community.embeddings import LlamaCppEmbeddings
import faiss

def semantic_search(query, index, documents, embeddings, k=3):
    """
    Performs a semantic search using FAISS

    Args:
        query (str): The search query
        index (faiss.Index): The FAISS index
        documents (list): List of original documents
        embeddings (LlamaCppEmbeddings): Embedding model
        k (int): Number of results to return

    Returns:
        list: The k most relevant documents
    """
    # Create the query embedding
    query_vector = embeddings.embed_query(query)

    # Convert the embedding to a numpy array
    query_vector = np.array([query_vector]).astype('float32')

    # Search in the FAISS index
    distances, indices = index.search(query_vector, k)

    # Collect the results
    results = []
    for i, idx in enumerate(indices[0]):
        if idx != -1:  # -1 indicates that not enough results were found
            results.append({
                'document': documents[idx],
                'score': float(distances[0][i])  # Convert to standard Python float
            })

    return results

# Example usage
def demo_semantic_search():
    # Sample documents
    documents = [
        "Haiku is an open source operating system inspired by BeOS",
        "Python is a versatile and powerful programming language",
        "Large Language Models are models based on neural networks",
        "FAISS is a library for vector similarity search",
        "The CPU (Central Processing Unit) is the brain of a computer",
        "Neural networks are the foundation of modern artificial intelligence",
        "BeOS was an advanced multimedia operating system for its time",
        "Artificial intelligence requires powerful computing capabilities"
    ]

    # Configure embeddings
    embed_model_path = "/Dati/models/ggml-all-MiniLM-L6-v2-f16.gguf"
    embeddings = LlamaCppEmbeddings(
        model_path=embed_model_path,
        n_ctx=512,
        n_batch=32,
        verbose=False
    )

    # Create embeddings and FAISS index
    embedded_documents = []
    for doc in documents:
        embed = embeddings.embed_query(doc)
        embedded_documents.append(embed)

    vectors = np.array(embedded_documents).astype('float32')
    dimension = len(vectors[0])
    index = faiss.IndexFlatL2(dimension)
    index.add(vectors)

    # Run a sample query
    query = "What is the relationship between AI and computing power?"
    results = semantic_search(query, index, documents, embeddings, k=3)

    print("Query:", query)
    print("Most relevant results:")
    for i, result in enumerate(results):
        print(f"{i+1}. {result['document']} (score: {result['score']:.4f})")

if __name__ == "__main__":
    demo_semantic_search()

In this third example, I’m showing how to use embeddings and FAISS to implement semantic search. Semantic search allows us to find relevant documents based on meaning rather than simple keyword matching.

The process is relatively simple:

We convert the query into an embedding vector
We use FAISS to find the most similar vectors in the index
We retrieve the original documents corresponding to the found vectors

An interesting feature of FAISS is that it also returns a distance score (in this case, the L2 distance), which indicates how relevant the document is to the query. The lower the score, the more semantically similar the document is to the query.

On Haiku, this type of semantic search works well even with the CPU, especially for moderately sized document collections.

Complete Implementation: Simple RAG System

Now let’s put together all the components we’ve seen so far to create a simple but functional RAG (Retrieval Augmented Generation) system:

import numpy as np
import os
from langchain.llms.llamacpp import LlamaCpp
from langchain_community.embeddings import LlamaCppEmbeddings
from langchain.prompts import PromptTemplate
import faiss

class SimpleRAGSystem:
    def __init__(self, llm_model_path, embed_model_path):
        """
        Initializes the RAG system

        Args:
            llm_model_path (str): Path to the LLM model
            embed_model_path (str): Path to the embedding model
        """
        # Initialize the LLM model
        self.llm = LlamaCpp(
            model_path=llm_model_path,
            temperature=0.7,
            max_tokens=512,
            n_ctx=2048,
            n_batch=512,
            verbose=False
        )

        # Initialize the embedding model
        self.embeddings = LlamaCppEmbeddings(
            model_path=embed_model_path,
            n_ctx=512,
            n_batch=32,
            verbose=False
        )

        # Prompt template for RAG
        self.prompt_template = PromptTemplate(
            input_variables=["context", "question"],
            template="""
            Use the following context to answer the question.

            Context:
            {context}

            Question: {question}

            Answer:
            """
        )

        # Storage for documents and FAISS index
        self.documents = []
        self.index = None
        self.is_index_built = False

    def add_documents(self, documents):
        """
        Adds documents to the knowledge base

        Args:
            documents (list): List of documents to add
        """
        self.documents.extend(documents)
        self.is_index_built = False

    def build_index(self):
        """
        Builds the FAISS index from documents
        """
        if not self.documents:
            print("No documents to index")
            return

        # Create embeddings for all documents
        embedded_documents = []
        for doc in self.documents:
            embed = self.embeddings.embed_query(doc)
            embedded_documents.append(embed)

        # Convert to numpy array
        vectors = np.array(embedded_documents).astype('float32')

        # Create the FAISS index
        dimension = len(vectors[0])
        self.index = faiss.IndexFlatL2(dimension)
        self.index.add(vectors)

        self.is_index_built = True
        print(f"FAISS index built with {len(self.documents)} documents")

    def retrieve(self, query, top_k=3):
        """
        Retrieves the most relevant documents for the query

        Args:
            query (str): The search query
            top_k (int): Number of documents to retrieve

        Returns:
            list: The most relevant documents
        """
        if not self.is_index_built:
            self.build_index()

        # Create the query embedding
        query_vector = self.embeddings.embed_query(query)
        query_vector = np.array([query_vector]).astype('float32')

        # Search in the FAISS index
        distances, indices = self.index.search(query_vector, top_k)

        # Collect the results
        retrieved_docs = []
        for i, idx in enumerate(indices[0]):
            if idx != -1:
                retrieved_docs.append(self.documents[idx])

        return retrieved_docs

    def answer(self, question):
        """
        Answers a question using the RAG system

        Args:
            question (str): The question

        Returns:
            str: The generated answer
        """
        # Retrieve relevant documents
        retrieved_docs = self.retrieve(question)

        if not retrieved_docs:
            return "I couldn't find relevant information to answer your question."

        # Join the retrieved documents into a single context
        context = "\\n\\n".join(retrieved_docs)

        # Create the complete prompt
        prompt = self.prompt_template.format(context=context, question=question)

        # Generate the response
        response = self.llm.invoke(prompt)

        return response

# Example usage
if __name__ == "__main__":
    # Paths to models
    llm_model_path = "/Dati/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf"
    embed_model_path = "/Dati/models/ggml-all-MiniLM-L6-v2-f16.gguf"

    # Create the RAG system
    rag_system = SimpleRAGSystem(llm_model_path, embed_model_path)

    # Add sample documents
    documents = [
        "Haiku is an open-source operating system inspired by BeOS, born in 2001 as a spiritual continuation project of BeOS.",
        "BeOS was a highly advanced multimedia operating system for the '90s, developed by Be Inc.",
        "Haiku uses a block-level file system called BFS (Be File System) that supports extended attributes and indexing.",
        "The CPU (Central Processing Unit) is the main component of a computer responsible for executing calculations.",
        "GPUs (Graphics Processing Units) are specialized processors for graphics rendering and parallel calculations.",
        "Large Language Models (LLMs) are artificial intelligence models trained on enormous amounts of text.",
        "FAISS (Facebook AI Similarity Search) is a library for efficient vector similarity search in large datasets.",
        "Python is an interpreted, high-level, and general-purpose programming language created by Guido van Rossum.",
        "Machine learning is a subset of artificial intelligence that focuses on automatic learning from data.",
        "RAG (Retrieval Augmented Generation) is a technique that combines information retrieval with text generation."
    ]

    rag_system.add_documents(documents)

    # Test the RAG system
    questions = [
        "What is Haiku and what is its origin?",
        "How do Large Language Models work?",
        "What are the differences between CPU and GPU?",
        "What is RAG and how does it work?"
    ]

    for q in questions:
        print("\\nQuestion:", q)
        print("\\nAnswer:", rag_system.answer(q))

In this complete example, I’ve created a SimpleRAGSystem class that:

Initializes both an LLM and an embedding model
Manages a collection of documents and builds a FAISS index for fast retrieval
Provides methods to retrieve relevant documents based on a query
Generates responses to questions using the retrieved context

The RAG process works in two main steps:

Retrieval: When a question is asked, the system finds the most relevant documents in its knowledge base using the semantic similarity between the question and the documents.
Generation: The system then uses these relevant documents as context to help the LLM generate a more informed and accurate response.

This approach is particularly useful on Haiku because it allows us to leverage external knowledge without loading all the information into the LLM’s context, which would require more memory and processing power.

Final Considerations

I’ve shown how to configure and use an AI/LLM stack in Haiku, demonstrating that it’s possible to run language models and semantic retrieval systems even on a lightweight operating system without GPU acceleration.

LLM models, even quantized ones, require a significant amount of memory. For an acceptable experience, I recommend:

Using quantized models: Models in GGUF format with 4-bit or 3-bit quantization can significantly reduce memory consumption.
Limiting the context: Reducing the context size (n_ctx) is one of the most effective ways to reduce memory usage.
Considering cloud services: For larger models or intensive workloads, it’s always possible to use AI cloud services through their APIs.

From my experiments, the memory footprint when running the SimpleRAGSystem is around 1.7GB.

Despite the limitations, Haiku offers a stable and responsive environment for AI experimentation, especially for educational projects or developing applications that require moderate-sized language models.

Comments

Comment on this post by publicly replying to this Mastodon post using a Mastodon or other ActivityPub/Fediverse account.

Comments on this website are based on a Mastodon-powered comment system. Learn more about it here.

There are no known comments, yet. Be the first to write a reply.

Loading comments relies on JavaScript. Try enabling JavaScript and reloading, or visit the original post on Mastodon.

Setting up an AI/LLM Stack in Haiku: A Practical Guide part II

Using AI Components in Haiku

Connecting to llama.cpp via OpenAI API and LangChain

Creating Embeddings Using llama.cpp and FAISS

Content Retrieval via Semantic Query

Complete Implementation: Simple RAG System

Final Considerations

You may also enjoy

Mesop+Autogen: The Definitive AI Stack for Haiku?

Welcome to my Blog

Setting up an AI/LLM Stack in Haiku: A Practical Guide part I

Comments