Can I use Ollama for RAG (Retrieval-Augmented Generation)?

Yes. Ollama works well for local RAG pipelines. Use nomic-embed-text or mxbai-embed-large (both available via ollama pull) for local embeddings with no API cost. Store vectors in ChromaDB. Then use any Ollama chat model (Llama 3.3, Mistral, etc.) to generate answers. The entire pipeline — embeddings, vector store, and generation — runs on your machine with zero API costs and complete data privacy.

Ollama Tutorial: Run LLMs Locally for Free — Python Guide (2026)

Q: What is Ollama and how does it work?

Ollama is an open-source tool that lets you download and run large language models (LLMs) on your own machine — no internet required after the initial download, no API key, no usage costs. It wraps popular models like Llama 3.3, Mistral, Phi-4, and Qwen 2.5 in a simple CLI and REST API. Ollama handles model quantization automatically, so a 70B parameter model can run on a MacBook with 32GB RAM. It exposes a local API at localhost:11434 that is compatible with the OpenAI Python SDK.

Q: What hardware do I need to run Ollama?

For 7B–8B parameter models (Llama 3.3 8B, Mistral 7B): 8GB RAM minimum, no GPU required — runs on CPU. For 14B–32B models: 16GB RAM, GPU recommended. For 70B models: 32–64GB RAM or a GPU with 24GB+ VRAM. On Apple Silicon Macs (M1/M2/M3/M4), the unified memory architecture means even the base 8GB M1 chip runs 7B models smoothly. On Windows and Linux, a mid-range NVIDIA GPU (RTX 3060 or better) speeds things up significantly but is not required.

Q: Which is the best Ollama model for coding?

For coding tasks, qwen2.5-coder:7b is the best model available on Ollama in 2026 — it outperforms Llama 3.3 on most coding benchmarks while using the same amount of RAM. For general coding with good reasoning, deepseek-r1:8b (a reasoning model) is a strong alternative. For completions only (Copilot-style), starcoder2:7b is optimized for that use case. Pull any of these with: ollama pull qwen2.5-coder.

Q: Is Ollama compatible with the OpenAI Python SDK?

Yes. Ollama exposes an OpenAI-compatible REST API at http://localhost:11434/v1. You can use the official OpenAI Python SDK by setting base_url='http://localhost:11434/v1' and api_key='ollama' (any string works). This means any code written for OpenAI works with Ollama by changing two lines. The same applies to LangChain, LlamaIndex, and any library that supports OpenAI-compatible endpoints.

Q: How fast is Ollama compared to GPT-4o?

On Apple Silicon or a modern NVIDIA GPU, Ollama running Llama 3.3 8B generates 40–80 tokens/second — roughly the same perceived speed as GPT-4o streaming over the internet. On CPU-only hardware it drops to 5–15 tokens/second, which feels slow for long outputs. Quality-wise, Llama 3.3 8B is competitive with GPT-3.5 for most tasks and about 15–20% behind GPT-4o on complex reasoning. For simple tasks (summarization, extraction, classification), the local model is often sufficient.

Q: What is the difference between Ollama and LM Studio?

Ollama is a command-line tool optimized for developers and server use — ideal for Python scripts, API servers, and automation. LM Studio is a desktop GUI application aimed at non-technical users who want to chat with models through a visual interface. Both run models locally. Ollama is better for building applications; LM Studio is better for casual use. Many developers use both: LM Studio to explore models, Ollama to deploy them in code.

Last Updated: June 2026 · 13 min read

Quick Answer

Ollama lets you run Llama 3.3, Mistral, Qwen 2.5, and 100+ other open-source LLMs entirely on your own machine — no API key, no cloud bill, no data leaving your network. Install in 2 minutes, pull a model, call it from Python using the OpenAI SDK (it's API-compatible). 8GB RAM handles 7B models; 16GB handles up to 32B. This guide covers installation, Python integration, streaming, structured output, local RAG, and tool calling.

GPT-4o costs money. Claude costs money. Every API call is logged, every token billed.

For development work, experimentation, privacy-sensitive data, and high-volume pipelines — paying per token stops making sense fast.

Ollama fixes this. It's an open-source runtime that downloads, manages, and runs state-of-the-art LLMs on your own hardware. The models run locally. The API is OpenAI-compatible. Your data never leaves your machine.

This guide goes from zero to a working Python integration — including streaming, structured JSON output, a local RAG pipeline, and tool calling.

Why Run an LLM Locally in 2026?

Reason	Detail
Cost	Zero per-token cost. Run millions of requests for free.
Privacy	Medical records, legal docs, internal IP — nothing leaves your machine.
Latency	No network round trip. On Apple Silicon, 40–80 tok/sec is common.
No rate limits	No 429 errors. No throttling. No quota resets.
Experimentation	Try 50 prompts in 5 minutes without watching a billing meter.
Offline	Works on a plane, on a ship, in a datacenter with no outbound internet.

The tradeoff: local 8B models are noticeably below GPT-4o on hard reasoning tasks. For simple tasks — summarization, extraction, classification, drafting, Q&A over your own docs — the gap is small enough not to matter.

Step 1 — Install Ollama

macOS / Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com. Runs natively on Windows 11.

Verify:

ollama --version
# ollama version 0.6.x

Ollama starts a background service automatically. The local API runs at http://localhost:11434.

Step 2 — Pull Your First Model

ollama pull llama3.3

This downloads the 4.7GB quantized version of Llama 3.3 8B (Meta's best open-source model as of 2026). First pull takes a few minutes — subsequent starts are instant from local cache.

Other models worth pulling:

ollama pull mistral          # 4.1GB — fast, great for instruction following
ollama pull qwen2.5-coder    # 4.7GB — best local model for coding tasks
ollama pull phi4             # 9.1GB — Microsoft's 14B, punches above its weight
ollama pull deepseek-r1:8b  # 4.9GB — reasoning model, thinks before answering
ollama pull nomic-embed-text # 274MB — embeddings for local RAG (no GPU needed)

List what you've pulled:

ollama list

Test it immediately in the terminal:

ollama run llama3.3
# > Hello! How can I help you today?

Press Ctrl+D to exit the interactive session.

Step 3 — Call Ollama from Python

Ollama exposes an OpenAI-compatible REST API. Use the official OpenAI Python SDK — just change the base URL:

pip install openai

from openai import OpenAI

# Point the OpenAI SDK at your local Ollama instance
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required by the SDK, value doesn't matter
)

response = client.chat.completions.create(
    model="llama3.3",
    messages=[
        {"role": "system", "content": "You are a concise technical assistant."},
        {"role": "user", "content": "Explain what a transformer model is in 3 sentences."}
    ]
)

print(response.choices[0].message.content)

That's it. If you have existing OpenAI code, swap two lines and your entire codebase runs locally.

Step 4 — Streaming Responses

Streaming prints tokens as they generate rather than waiting for the full response — much better UX for long outputs:

stream = client.chat.completions.create(
    model="llama3.3",
    messages=[{"role": "user", "content": "Write a Python function to parse JWT tokens."}],
    stream=True
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

print()  # newline after stream completes

Streaming works identically to the OpenAI API — same SDK, same interface, same code.

Step 5 — Structured JSON Output

For pipelines that need machine-readable output, force the model to return valid JSON:

import json

response = client.chat.completions.create(
    model="llama3.3",
    messages=[
        {
            "role": "system",
            "content": "You are a data extraction assistant. Always respond with valid JSON only. No explanation."
        },
        {
            "role": "user",
            "content": """Extract the following from this job posting:
- job_title
- company_name  
- location
- salary_range (null if not mentioned)
- remote (true/false)

Job posting:
Senior Python Engineer at DataFlow Inc. — Remote OK. San Francisco, CA. $180k-$220k.
"""
        }
    ],
    response_format={"type": "json_object"}  # supported in Ollama 0.5+
)

data = json.loads(response.choices[0].message.content)
print(data)
# {'job_title': 'Senior Python Engineer', 'company_name': 'DataFlow Inc.',
#  'location': 'San Francisco, CA', 'salary_range': '$180k-$220k', 'remote': True}

Tip: If the model keeps wrapping JSON in markdown code fences, add "Return only raw JSON, no markdown" to the system prompt.

Step 6 — Build a Fully Local RAG Pipeline

Combine Ollama's embedding model with ChromaDB and an Ollama chat model to build a RAG pipeline that costs exactly $0 and never sends data to any cloud:

pip install chromadb ollama

import ollama
import chromadb

# --- INDEXING ---

def embed_local(texts: list[str]) -> list[list[float]]:
    """Embed texts using Ollama's local embedding model."""
    embeddings = []
    for text in texts:
        result = ollama.embeddings(model="nomic-embed-text", prompt=text)
        embeddings.append(result["embedding"])
    return embeddings


def index_documents(chunks: list[str], sources: list[str]):
    client = chromadb.PersistentClient(path="./local_chroma")
    try:
        client.delete_collection("local_rag")
    except Exception:
        pass

    collection = client.create_collection(
        "local_rag",
        metadata={"hnsw:space": "cosine"}
    )

    embeddings = embed_local(chunks)

    collection.add(
        ids=[str(i) for i in range(len(chunks))],
        documents=chunks,
        embeddings=embeddings,
        metadatas=[{"source": s} for s in sources]
    )
    print(f"Indexed {len(chunks)} chunks locally.")


# --- QUERYING ---

def local_rag_query(question: str, top_k: int = 4) -> str:
    client = chromadb.PersistentClient(path="./local_chroma")
    collection = client.get_collection("local_rag")

    # Embed the question locally
    query_embedding = ollama.embeddings(
        model="nomic-embed-text", prompt=question
    )["embedding"]

    # Retrieve relevant chunks
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "metadatas"]
    )

    context = "\n\n".join(results["documents"][0])

    # Generate answer locally
    response = ollama.chat(
        model="llama3.3",
        messages=[
            {
                "role": "system",
                "content": "Answer using only the provided context. If the answer isn't there, say so."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )

    return response["message"]["content"]


# Example
if __name__ == "__main__":
    chunks = [
        "Our refund policy allows returns within 30 days of purchase.",
        "To reset your password, go to Settings > Security > Reset Password.",
        "Premium plans include unlimited API access and priority support.",
        "Business hours are Monday to Friday, 9 AM to 6 PM IST."
    ]
    sources = ["policy.txt", "help.txt", "pricing.txt", "contact.txt"]

    index_documents(chunks, sources)

    print(local_rag_query("How do I reset my password?"))
    print(local_rag_query("What does the premium plan include?"))

This is completely private. Zero API calls. Zero cost. Runs on an 8GB MacBook.

For the full RAG implementation with chunking and PDF support, see our RAG pipeline Python tutorial.

Step 7 — Tool Calling (Function Calling)

Ollama supports tool calling on models that have been instruction-tuned for it (Llama 3.3, Mistral, Qwen 2.5, Phi-4):

import json

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["city"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="llama3.3",
    messages=[{"role": "user", "content": "What's the weather in Mumbai right now?"}],
    tools=tools,
    tool_choice="auto"
)

# Check if model wants to call a tool
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    function_name = tool_call.function.name
    arguments = json.loads(tool_call.function.arguments)
    print(f"Tool called: {function_name}")
    print(f"Arguments: {arguments}")
    # {'city': 'Mumbai', 'unit': 'celsius'}

    # Your code would now call the actual weather API here
    # Then pass the result back to the model for a final response

Tool calling works the same way as OpenAI's function calling — same format, same SDK, local execution.

Model Comparison: Which Ollama Model for What?

Task	Best Model	RAM Needed	Why
General chat / Q&A	`llama3.3`	8GB	Most balanced open-source model
Coding / debugging	`qwen2.5-coder`	8GB	Purpose-built for code, beats Llama on coding benchmarks
Complex reasoning	`deepseek-r1:8b`	8GB	Thinks before answering, much better on logic/math
Fast responses	`mistral`	8GB	Slightly smaller, faster output
Highest quality (local)	`phi4`	16GB	Microsoft 14B model, excellent instruction following
Local embeddings	`nomic-embed-text`	1GB	Fast, accurate, perfect for RAG

Using Ollama with the Native Python Library

Alternatively, skip the OpenAI SDK and use Ollama's own Python client:

pip install ollama

import ollama

# Simple chat
response = ollama.chat(
    model="llama3.3",
    messages=[{"role": "user", "content": "What is Retrieval-Augmented Generation?"}]
)
print(response["message"]["content"])

# Streaming
for chunk in ollama.chat(
    model="llama3.3",
    messages=[{"role": "user", "content": "Write a haiku about Python."}],
    stream=True
):
    print(chunk["message"]["content"], end="", flush=True)

# Generate (completion, not chat)
response = ollama.generate(model="llama3.3", prompt="The capital of France is")
print(response["response"])

The native library is slightly simpler for pure Ollama use. The OpenAI SDK is better when you want your code to work against both OpenAI and Ollama without changes.

Serve Ollama as a Multi-User API

By default Ollama only accepts connections from localhost. To expose it on your network (for a team or a remote server):

OLLAMA_HOST=0.0.0.0 ollama serve

Now any machine on your network can call http://your-server-ip:11434/v1. Useful for: - A shared local LLM for your development team - A private GPU server that your laptops call - Running Ollama on a cloud VM while calling it from anywhere

For production multi-user deployment, put Nginx in front with rate limiting and optional auth headers.

Common Issues and Fixes

Model is slow on CPU

By default Ollama uses CPU if no GPU is detected. On CPU, expect 5–15 tokens/second for 7B models — slow but functional.

Fix: On Apple Silicon, Ollama automatically uses the Neural Engine — you don't need to do anything. On NVIDIA, ensure CUDA drivers are installed and Ollama detects them:

ollama run llama3.3
# Look for: "using GPU" in the startup log

`connection refused` on port 11434

Ollama isn't running.

Fix:

ollama serve  # starts the server in foreground
# or
ollama list   # this also starts the server as a side effect

Model gives incoherent output

The model is running at too low a precision for your hardware — quantization level mismatch.

Fix: Pull the q4_K_M variant explicitly which is the best quality/speed balance:

ollama pull llama3.3:8b-instruct-q4_K_M

Out of memory crash

You pulled a model too large for your RAM.

Fix: Use a smaller quantization. Rule of thumb: model file size ≈ 1.25× the RAM needed at runtime.

Frequently Asked Questions

What is Ollama and how does it work?

Ollama is an open-source tool that downloads and runs large language models entirely on your own machine. It wraps models like Llama 3.3, Mistral, and Qwen 2.5 in a REST API at localhost:11434. No internet needed after the initial model download, no API key, no usage cost. The API is OpenAI-compatible, so existing Python code works with a two-line change.

What hardware do I need to run Ollama?

8GB RAM runs 7B–8B models (Llama 3.3 8B, Mistral 7B) comfortably on CPU. 16GB handles 14B–32B models. A GPU is optional but speeds things up: NVIDIA RTX 3060+ or any Apple Silicon M-series chip accelerates generation to 40–80 tokens/second. Models larger than your available RAM will be slow or won't load.

Which is the best Ollama model for coding?

qwen2.5-coder:7b is the strongest coding model on Ollama in 2026 — outperforms Llama 3.3 on most coding benchmarks at the same RAM cost (8GB). For debugging complex multi-file issues, deepseek-r1:8b (a reasoning model) is worth trying. Pull with: ollama pull qwen2.5-coder.

Is Ollama compatible with the OpenAI Python SDK?

Yes. Set base_url="http://localhost:11434/v1" and api_key="ollama" when creating the OpenAI client. All SDK features work — chat, streaming, structured output, tool calling — exactly as with the real OpenAI API.

Can I use Ollama for RAG?

Yes. Use ollama pull nomic-embed-text for local embeddings, ChromaDB as the vector store, and any Ollama chat model for generation. The full pipeline is free, private, and runs on an 8GB laptop. See the RAG pipeline Python tutorial for the full implementation.

How fast is Ollama compared to GPT-4o?

On Apple Silicon or a GPU-equipped machine: 40–80 tokens/second, roughly the same perceived speed as GPT-4o streaming. On CPU only: 5–15 tokens/second, noticeably slower for long outputs. Quality-wise, Llama 3.3 8B matches GPT-3.5 for most common tasks and is 15–20% behind GPT-4o on hard reasoning. For most developer tasks the gap is acceptable, especially at zero cost.

What is the difference between Ollama and LM Studio?

Ollama is a CLI and server tool built for developers — ideal for Python scripts, API servers, and CI pipelines. LM Studio is a desktop GUI for non-technical users who want to chat visually with models. Both run models locally. Use Ollama when you're building applications; use LM Studio when you just want to explore models interactively. Many developers use both.

Conclusion

Ollama makes running a local LLM as simple as ollama pull + three lines of Python. For development workflows, privacy-sensitive pipelines, and high-volume tasks where cloud API costs add up — it's the right default in 2026.

What you can do right now: - Simple chat: swap your OpenAI base URL, zero other code changes - Streaming: identical API to the cloud, instant results - Structured output: JSON mode works on Llama 3.3 and Qwen 2.5 - Local RAG: nomic-embed-text + ChromaDB + any chat model = fully private document Q&A - Tool calling: function calling works on instruction-tuned models

For heavier workloads — complex reasoning, long documents, highest accuracy — GPT-4o or Claude still win. But for the 80% of tasks where a local 8B model is good enough, you no longer need to pay for it.

Want to connect your local Ollama model to external tools and APIs? Read our guide on building MCP servers in Python to expose your Ollama backend as a tool Claude can call — the best of both worlds.

Need help building a production AI pipeline using local or cloud LLMs? SolutionGigs connects you with vetted AI engineers — free to post.

Mohammed Yaseen

Founder, SolutionGigs

Mohammed builds AI-powered tools and data pipelines. He has shipped production LLM systems on both cloud and local infrastructure, including privacy-first RAG pipelines using Ollama and open-source models. LinkedIn →

Ollama Tutorial: Run LLMs Locally for Free — Python Guide (2026)

Ollama Tutorial: Run LLMs Locally for Free — Python Guide (2026)

Why Run an LLM Locally in 2026?

Step 1 — Install Ollama

Step 2 — Pull Your First Model

Step 3 — Call Ollama from Python

Step 4 — Streaming Responses

Step 5 — Structured JSON Output

Step 6 — Build a Fully Local RAG Pipeline

Step 7 — Tool Calling (Function Calling)

Model Comparison: Which Ollama Model for What?

Using Ollama with the Native Python Library

Serve Ollama as a Multi-User API

Common Issues and Fixes

Model is slow on CPU

`connection refused` on port 11434

Model gives incoherent output

Out of memory crash

Frequently Asked Questions

What is Ollama and how does it work?

What hardware do I need to run Ollama?

Which is the best Ollama model for coding?

Is Ollama compatible with the OpenAI Python SDK?

Can I use Ollama for RAG?

How fast is Ollama compared to GPT-4o?

What is the difference between Ollama and LM Studio?

Conclusion

Try it yourself — free & unlimited

Comments

Ollama Tutorial: Run LLMs Locally for Free — Python Guide (2026)

Ollama Tutorial: Run LLMs Locally for Free — Python Guide (2026)

Why Run an LLM Locally in 2026?

Step 1 — Install Ollama

Step 2 — Pull Your First Model

Step 3 — Call Ollama from Python

Step 4 — Streaming Responses

Step 5 — Structured JSON Output

Step 6 — Build a Fully Local RAG Pipeline

Step 7 — Tool Calling (Function Calling)

Model Comparison: Which Ollama Model for What?

Using Ollama with the Native Python Library

Serve Ollama as a Multi-User API

Common Issues and Fixes

Model is slow on CPU

connection refused on port 11434

Model gives incoherent output

Out of memory crash

Frequently Asked Questions

What is Ollama and how does it work?

What hardware do I need to run Ollama?

Which is the best Ollama model for coding?

Is Ollama compatible with the OpenAI Python SDK?

Can I use Ollama for RAG?

How fast is Ollama compared to GPT-4o?

What is the difference between Ollama and LM Studio?

Conclusion

Try it yourself — free & unlimited

Comments

`connection refused` on port 11434