Back to Tutorials
advanced

Self-Hosted AI with Ollama

Run Memoid completely locally using Ollama for both LLM and embeddings - no cloud APIs required

Prerequisites

  • Python 3.8+
  • Docker
  • 8GB+ RAM

What You’ll Build

A fully self-hosted AI memory system that:

  • Runs entirely on your local machine
  • Uses Ollama for LLM and embeddings
  • Stores data in local PostgreSQL with pgvector
  • Requires no external API keys
  • Gives you complete data privacy

Prerequisites

  • Python 3.8+
  • Docker and Docker Compose
  • At least 8GB RAM (16GB recommended)
  • Ollama installed (ollama.ai)

Architecture

+------------------+
|   Your App       |
+--------+---------+
         |
+--------v---------+
|   Memoid API     |
+--------+---------+
         |
    +----+----+
    |         |
+---v---+ +---v---+
|Ollama | |Postgres|
| LLM   | |pgvector|
+-------+ +--------+

Step 1: Install Ollama

Download and install Ollama:

# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Or download from https://ollama.ai for Windows

Pull the required models:

# LLM for text generation
ollama pull llama3.1:8b

# Embedding model
ollama pull nomic-embed-text

Verify Ollama is running:

curl http://localhost:11434/api/tags

Step 2: Set Up PostgreSQL with pgvector

Create a docker-compose.yml:

version: '3.8'

services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_USER: memoid
      POSTGRES_PASSWORD: memoid
      POSTGRES_DB: memoid
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

Start PostgreSQL:

docker-compose up -d

Step 3: Configure Memoid for Local Use

Create a configuration file config.py:

MEMOID_CONFIG = {
    "llm": {
        "provider": "ollama",
        "config": {
            "model": "llama3.1:8b",
            "temperature": 0.1,
            "max_tokens": 2000,
            "ollama_base_url": "http://localhost:11434"
        }
    },
    "embedder": {
        "provider": "ollama",
        "config": {
            "model": "nomic-embed-text",
            "ollama_base_url": "http://localhost:11434"
        }
    },
    "vector_store": {
        "provider": "pgvector",
        "config": {
            "host": "localhost",
            "port": 5432,
            "user": "memoid",
            "password": "memoid",
            "database": "memoid",
            "embedding_model_dims": 768  # nomic-embed-text dimensions
        }
    }
}

Step 4: Build the Local Memory System

Create local_memory.py:

import requests
import psycopg2
from psycopg2.extras import RealDictCursor
import json
import uuid

class LocalMemorySystem:
    def __init__(self, config):
        self.ollama_url = config["llm"]["config"]["ollama_base_url"]
        self.llm_model = config["llm"]["config"]["model"]
        self.embed_model = config["embedder"]["config"]["model"]

        # Connect to PostgreSQL
        pg_config = config["vector_store"]["config"]
        self.conn = psycopg2.connect(
            host=pg_config["host"],
            port=pg_config["port"],
            user=pg_config["user"],
            password=pg_config["password"],
            database=pg_config["database"]
        )

        self._init_db()

    def _init_db(self):
        """Initialize the database schema."""
        with self.conn.cursor() as cur:
            # Enable pgvector extension
            cur.execute("CREATE EXTENSION IF NOT EXISTS vector")

            # Create memories table
            cur.execute("""
                CREATE TABLE IF NOT EXISTS memories (
                    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
                    user_id VARCHAR(255),
                    memory TEXT NOT NULL,
                    embedding vector(768),
                    metadata JSONB DEFAULT '{}',
                    created_at TIMESTAMP DEFAULT NOW()
                )
            """)

            # Create index for vector similarity search
            cur.execute("""
                CREATE INDEX IF NOT EXISTS memories_embedding_idx
                ON memories USING ivfflat (embedding vector_cosine_ops)
                WITH (lists = 100)
            """)

            self.conn.commit()

    def _get_embedding(self, text: str) -> list:
        """Generate embedding using Ollama."""
        response = requests.post(
            f"{self.ollama_url}/api/embeddings",
            json={
                "model": self.embed_model,
                "prompt": text
            }
        )
        return response.json()["embedding"]

    def _generate_response(self, prompt: str) -> str:
        """Generate text using Ollama LLM."""
        response = requests.post(
            f"{self.ollama_url}/api/generate",
            json={
                "model": self.llm_model,
                "prompt": prompt,
                "stream": False
            }
        )
        return response.json()["response"]

    def _extract_facts(self, text: str) -> list:
        """Extract facts from text using LLM."""
        prompt = f"""Extract key facts from this conversation as a JSON array.
Each fact should be a single, standalone statement.

Text: {text}

Return only the JSON array, no other text.
Example: ["User likes coffee", "User works as an engineer"]"""

        response = self._generate_response(prompt)

        try:
            # Parse JSON from response
            import re
            json_match = re.search(r'[.*]', response, re.DOTALL)
            if json_match:
                return json.loads(json_match.group())
        except:
            pass

        return [text]  # Fallback: store original text

    def add(self, messages: list, user_id: str, metadata: dict = None):
        """Add memories from a conversation."""
        # Combine messages into text
        text = "\n".join(
            f"{m['role']}: {m['content']}"
            for m in messages
        )

        # Extract facts
        facts = self._extract_facts(text)

        stored = []
        with self.conn.cursor() as cur:
            for fact in facts:
                embedding = self._get_embedding(fact)
                memory_id = str(uuid.uuid4())

                cur.execute("""
                    INSERT INTO memories (id, user_id, memory, embedding, metadata)
                    VALUES (%s, %s, %s, %s, %s)
                """, (memory_id, user_id, fact, embedding, json.dumps(metadata or {})))

                stored.append({"id": memory_id, "memory": fact})

            self.conn.commit()

        return {"memories": stored}

    def search(self, query: str, user_id: str, limit: int = 5, threshold: float = 0.7):
        """Search for similar memories."""
        embedding = self._get_embedding(query)

        with self.conn.cursor(cursor_factory=RealDictCursor) as cur:
            cur.execute("""
                SELECT
                    id,
                    memory,
                    metadata,
                    1 - (embedding <=> %s::vector) as score
                FROM memories
                WHERE user_id = %s
                AND 1 - (embedding <=> %s::vector) > %s
                ORDER BY embedding <=> %s::vector
                LIMIT %s
            """, (embedding, user_id, embedding, threshold, embedding, limit))

            results = cur.fetchall()

        return {"results": [dict(r) for r in results]}

    def get_all(self, user_id: str, limit: int = 100):
        """Get all memories for a user."""
        with self.conn.cursor(cursor_factory=RealDictCursor) as cur:
            cur.execute("""
                SELECT id, memory, metadata, created_at
                FROM memories
                WHERE user_id = %s
                ORDER BY created_at DESC
                LIMIT %s
            """, (user_id, limit))

            results = cur.fetchall()

        return {"results": [dict(r) for r in results]}

    def delete(self, memory_id: str):
        """Delete a specific memory."""
        with self.conn.cursor() as cur:
            cur.execute("DELETE FROM memories WHERE id = %s", (memory_id,))
            self.conn.commit()

Step 5: Create a Chat Application

Create chat.py:

from config import MEMOID_CONFIG
from local_memory import LocalMemorySystem
import requests

class LocalChatbot:
    def __init__(self):
        self.memory = LocalMemorySystem(MEMOID_CONFIG)
        self.ollama_url = MEMOID_CONFIG["llm"]["config"]["ollama_base_url"]
        self.model = MEMOID_CONFIG["llm"]["config"]["model"]

    def chat(self, user_id: str, message: str) -> str:
        # Search for relevant memories
        memories = self.memory.search(message, user_id, limit=5)
        context = "\n".join(
            f"- {m['memory']}"
            for m in memories.get("results", [])
        )

        # Build prompt with context
        prompt = f"""You are a helpful assistant with memory of past conversations.

Relevant memories:
{context if context else "No relevant memories yet."}

User: {message}
Assistant:"""

        # Generate response
        response = requests.post(
            f"{self.ollama_url}/api/generate",
            json={
                "model": self.model,
                "prompt": prompt,
                "stream": False
            }
        )
        answer = response.json()["response"]

        # Store the conversation
        self.memory.add(
            messages=[
                {"role": "user", "content": message},
                {"role": "assistant", "content": answer}
            ],
            user_id=user_id
        )

        return answer


def main():
    bot = LocalChatbot()
    user_id = "local_user"

    print("Local AI Chatbot (powered by Ollama)")
    print("=" * 40)
    print("Type 'quit' to exit, 'memories' to see stored memories\n")

    while True:
        message = input("You: ").strip()

        if message.lower() == "quit":
            break

        if message.lower() == "memories":
            memories = bot.memory.get_all(user_id)
            print("\nStored Memories:")
            for m in memories.get("results", []):
                print(f"  - {m['memory']}")
            print()
            continue

        if message:
            response = bot.chat(user_id, message)
            print(f"Bot: {response}\n")


if __name__ == "__main__":
    main()

Running the System

  1. Start PostgreSQL:
docker-compose up -d
  1. Ensure Ollama is running:
ollama serve
  1. Run the chatbot:
python chat.py

Performance Tuning

Model Selection

ModelRAM RequiredSpeedQuality
llama3.1:8b8GBFastGood
llama3.1:70b48GBSlowExcellent
mistral:7b6GBFastGood
phi3:mini4GBVery FastModerate

Embedding Models

ModelDimensionsQuality
nomic-embed-text768Good
mxbai-embed-large1024Better
snowflake-arctic-embed1024Best

Database Optimization

-- Increase work memory for better search
SET work_mem = '256MB';

-- Tune index parameters for your data size
CREATE INDEX ON memories
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);  -- Increase for larger datasets

Docker Deployment

Create a complete docker-compose.yml for production:

version: '3.8'

services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_USER: memoid
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-memoid}
      POSTGRES_DB: memoid
    volumes:
      - pgdata:/var/lib/postgresql/data
    restart: unless-stopped

  ollama:
    image: ollama/ollama
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  app:
    build: .
    environment:
      OLLAMA_URL: http://ollama:11434
      POSTGRES_HOST: postgres
    depends_on:
      - postgres
      - ollama
    restart: unless-stopped

volumes:
  pgdata:
  ollama:

Security Considerations

  1. Network Isolation: Keep services on internal network
  2. Database Credentials: Use strong passwords
  3. Model Selection: Review model licenses for commercial use
  4. Data Encryption: Enable PostgreSQL SSL for production

Next Steps

  • Add GPU acceleration for faster inference
  • Implement model fine-tuning with your data
  • Set up backup and recovery for PostgreSQL
  • Add monitoring with Prometheus/Grafana

Related Tutorials