advanced
Self-Hosted AI with Ollama
Run Memoid completely locally using Ollama for both LLM and embeddings - no cloud APIs required
Prerequisites
- Python 3.8+
- Docker
- 8GB+ RAM
What You’ll Build
A fully self-hosted AI memory system that:
- Runs entirely on your local machine
- Uses Ollama for LLM and embeddings
- Stores data in local PostgreSQL with pgvector
- Requires no external API keys
- Gives you complete data privacy
Prerequisites
- Python 3.8+
- Docker and Docker Compose
- At least 8GB RAM (16GB recommended)
- Ollama installed (ollama.ai)
Architecture
+------------------+
| Your App |
+--------+---------+
|
+--------v---------+
| Memoid API |
+--------+---------+
|
+----+----+
| |
+---v---+ +---v---+
|Ollama | |Postgres|
| LLM | |pgvector|
+-------+ +--------+ Step 1: Install Ollama
Download and install Ollama:
# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Or download from https://ollama.ai for Windows Pull the required models:
# LLM for text generation
ollama pull llama3.1:8b
# Embedding model
ollama pull nomic-embed-text Verify Ollama is running:
curl http://localhost:11434/api/tags Step 2: Set Up PostgreSQL with pgvector
Create a docker-compose.yml:
version: '3.8'
services:
postgres:
image: pgvector/pgvector:pg16
environment:
POSTGRES_USER: memoid
POSTGRES_PASSWORD: memoid
POSTGRES_DB: memoid
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata: Start PostgreSQL:
docker-compose up -d Step 3: Configure Memoid for Local Use
Create a configuration file config.py:
MEMOID_CONFIG = {
"llm": {
"provider": "ollama",
"config": {
"model": "llama3.1:8b",
"temperature": 0.1,
"max_tokens": 2000,
"ollama_base_url": "http://localhost:11434"
}
},
"embedder": {
"provider": "ollama",
"config": {
"model": "nomic-embed-text",
"ollama_base_url": "http://localhost:11434"
}
},
"vector_store": {
"provider": "pgvector",
"config": {
"host": "localhost",
"port": 5432,
"user": "memoid",
"password": "memoid",
"database": "memoid",
"embedding_model_dims": 768 # nomic-embed-text dimensions
}
}
} Step 4: Build the Local Memory System
Create local_memory.py:
import requests
import psycopg2
from psycopg2.extras import RealDictCursor
import json
import uuid
class LocalMemorySystem:
def __init__(self, config):
self.ollama_url = config["llm"]["config"]["ollama_base_url"]
self.llm_model = config["llm"]["config"]["model"]
self.embed_model = config["embedder"]["config"]["model"]
# Connect to PostgreSQL
pg_config = config["vector_store"]["config"]
self.conn = psycopg2.connect(
host=pg_config["host"],
port=pg_config["port"],
user=pg_config["user"],
password=pg_config["password"],
database=pg_config["database"]
)
self._init_db()
def _init_db(self):
"""Initialize the database schema."""
with self.conn.cursor() as cur:
# Enable pgvector extension
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
# Create memories table
cur.execute("""
CREATE TABLE IF NOT EXISTS memories (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id VARCHAR(255),
memory TEXT NOT NULL,
embedding vector(768),
metadata JSONB DEFAULT '{}',
created_at TIMESTAMP DEFAULT NOW()
)
""")
# Create index for vector similarity search
cur.execute("""
CREATE INDEX IF NOT EXISTS memories_embedding_idx
ON memories USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100)
""")
self.conn.commit()
def _get_embedding(self, text: str) -> list:
"""Generate embedding using Ollama."""
response = requests.post(
f"{self.ollama_url}/api/embeddings",
json={
"model": self.embed_model,
"prompt": text
}
)
return response.json()["embedding"]
def _generate_response(self, prompt: str) -> str:
"""Generate text using Ollama LLM."""
response = requests.post(
f"{self.ollama_url}/api/generate",
json={
"model": self.llm_model,
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]
def _extract_facts(self, text: str) -> list:
"""Extract facts from text using LLM."""
prompt = f"""Extract key facts from this conversation as a JSON array.
Each fact should be a single, standalone statement.
Text: {text}
Return only the JSON array, no other text.
Example: ["User likes coffee", "User works as an engineer"]"""
response = self._generate_response(prompt)
try:
# Parse JSON from response
import re
json_match = re.search(r'[.*]', response, re.DOTALL)
if json_match:
return json.loads(json_match.group())
except:
pass
return [text] # Fallback: store original text
def add(self, messages: list, user_id: str, metadata: dict = None):
"""Add memories from a conversation."""
# Combine messages into text
text = "\n".join(
f"{m['role']}: {m['content']}"
for m in messages
)
# Extract facts
facts = self._extract_facts(text)
stored = []
with self.conn.cursor() as cur:
for fact in facts:
embedding = self._get_embedding(fact)
memory_id = str(uuid.uuid4())
cur.execute("""
INSERT INTO memories (id, user_id, memory, embedding, metadata)
VALUES (%s, %s, %s, %s, %s)
""", (memory_id, user_id, fact, embedding, json.dumps(metadata or {})))
stored.append({"id": memory_id, "memory": fact})
self.conn.commit()
return {"memories": stored}
def search(self, query: str, user_id: str, limit: int = 5, threshold: float = 0.7):
"""Search for similar memories."""
embedding = self._get_embedding(query)
with self.conn.cursor(cursor_factory=RealDictCursor) as cur:
cur.execute("""
SELECT
id,
memory,
metadata,
1 - (embedding <=> %s::vector) as score
FROM memories
WHERE user_id = %s
AND 1 - (embedding <=> %s::vector) > %s
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (embedding, user_id, embedding, threshold, embedding, limit))
results = cur.fetchall()
return {"results": [dict(r) for r in results]}
def get_all(self, user_id: str, limit: int = 100):
"""Get all memories for a user."""
with self.conn.cursor(cursor_factory=RealDictCursor) as cur:
cur.execute("""
SELECT id, memory, metadata, created_at
FROM memories
WHERE user_id = %s
ORDER BY created_at DESC
LIMIT %s
""", (user_id, limit))
results = cur.fetchall()
return {"results": [dict(r) for r in results]}
def delete(self, memory_id: str):
"""Delete a specific memory."""
with self.conn.cursor() as cur:
cur.execute("DELETE FROM memories WHERE id = %s", (memory_id,))
self.conn.commit() Step 5: Create a Chat Application
Create chat.py:
from config import MEMOID_CONFIG
from local_memory import LocalMemorySystem
import requests
class LocalChatbot:
def __init__(self):
self.memory = LocalMemorySystem(MEMOID_CONFIG)
self.ollama_url = MEMOID_CONFIG["llm"]["config"]["ollama_base_url"]
self.model = MEMOID_CONFIG["llm"]["config"]["model"]
def chat(self, user_id: str, message: str) -> str:
# Search for relevant memories
memories = self.memory.search(message, user_id, limit=5)
context = "\n".join(
f"- {m['memory']}"
for m in memories.get("results", [])
)
# Build prompt with context
prompt = f"""You are a helpful assistant with memory of past conversations.
Relevant memories:
{context if context else "No relevant memories yet."}
User: {message}
Assistant:"""
# Generate response
response = requests.post(
f"{self.ollama_url}/api/generate",
json={
"model": self.model,
"prompt": prompt,
"stream": False
}
)
answer = response.json()["response"]
# Store the conversation
self.memory.add(
messages=[
{"role": "user", "content": message},
{"role": "assistant", "content": answer}
],
user_id=user_id
)
return answer
def main():
bot = LocalChatbot()
user_id = "local_user"
print("Local AI Chatbot (powered by Ollama)")
print("=" * 40)
print("Type 'quit' to exit, 'memories' to see stored memories\n")
while True:
message = input("You: ").strip()
if message.lower() == "quit":
break
if message.lower() == "memories":
memories = bot.memory.get_all(user_id)
print("\nStored Memories:")
for m in memories.get("results", []):
print(f" - {m['memory']}")
print()
continue
if message:
response = bot.chat(user_id, message)
print(f"Bot: {response}\n")
if __name__ == "__main__":
main() Running the System
- Start PostgreSQL:
docker-compose up -d - Ensure Ollama is running:
ollama serve - Run the chatbot:
python chat.py Performance Tuning
Model Selection
| Model | RAM Required | Speed | Quality |
|---|---|---|---|
| llama3.1:8b | 8GB | Fast | Good |
| llama3.1:70b | 48GB | Slow | Excellent |
| mistral:7b | 6GB | Fast | Good |
| phi3:mini | 4GB | Very Fast | Moderate |
Embedding Models
| Model | Dimensions | Quality |
|---|---|---|
| nomic-embed-text | 768 | Good |
| mxbai-embed-large | 1024 | Better |
| snowflake-arctic-embed | 1024 | Best |
Database Optimization
-- Increase work memory for better search
SET work_mem = '256MB';
-- Tune index parameters for your data size
CREATE INDEX ON memories
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000); -- Increase for larger datasets Docker Deployment
Create a complete docker-compose.yml for production:
version: '3.8'
services:
postgres:
image: pgvector/pgvector:pg16
environment:
POSTGRES_USER: memoid
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-memoid}
POSTGRES_DB: memoid
volumes:
- pgdata:/var/lib/postgresql/data
restart: unless-stopped
ollama:
image: ollama/ollama
volumes:
- ollama:/root/.ollama
ports:
- "11434:11434"
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
app:
build: .
environment:
OLLAMA_URL: http://ollama:11434
POSTGRES_HOST: postgres
depends_on:
- postgres
- ollama
restart: unless-stopped
volumes:
pgdata:
ollama: Security Considerations
- Network Isolation: Keep services on internal network
- Database Credentials: Use strong passwords
- Model Selection: Review model licenses for commercial use
- Data Encryption: Enable PostgreSQL SSL for production
Next Steps
- Add GPU acceleration for faster inference
- Implement model fine-tuning with your data
- Set up backup and recovery for PostgreSQL
- Add monitoring with Prometheus/Grafana