Running LLMs Locally with Ollama: Complete Setup Guide

Running LLMs Locally with Ollama: Complete Setup Guide

Ollama is the easiest way to run large language models locally. Think "Docker for LLMs" — download a model, run a command, and you've got a local AI assistant. No cloud APIs, no data leaks, just your hardware.

This guide covers everything: installation, model selection, performance tuning, and integrating with your favorite tools.

Why Ollama?

One-command model downloads (no model file hunting)

Automatic GPU acceleration (CUDA, Metal, ROCm)

OpenAI-compatible API (drop-in replacement for ChatGPT)

Cross-platform (macOS, Linux, Windows)

Model library with 100+ pre-configured models

Installation (3 Minutes)

macOS

brew install ollama

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download installer: https://ollama.com/download/windows

Verify Installation

ollama --version
# Output: ollama version 0.6.2

Your First Model (30 Seconds)

# Download and run Mistral 7B (4.1GB)
ollama run mistral

# You'll see:
# pulling manifest
# pulling layers... 100% [████████████]
# success

# Now chat:
>>> Write a Python function to calculate Fibonacci

That's it. Ollama handles:

  • Model download (with progress bar)
  • GPU detection and setup
  • Server initialization
  • Chat interface

Choosing the Right Model

Beginner-Friendly Models

Advanced Models (Better Quality, More Resources)

Browse All Models

ollama list  # Installed models
ollama search <keyword>  # Search library

Or visit: https://ollama.com/library

Model Management

Download Models

# Pull specific version
ollama pull llama3.3:70b-instruct-q4_K_M

# Pull latest (default)
ollama pull mistral

List Installed Models

ollama list

Delete Models

ollama rm mistral:7b

Check Disk Usage

du -sh ~/.ollama/models/*

Running Models

Interactive Chat

ollama run llama3.3

>>> Hello!
Hello! How can I assist you today?

>>> /bye  # Exit chat

API Server Mode

# Start server (runs in background)
ollama serve

# Test API
curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

One-Shot Prompts

ollama run mistral "Explain quantum computing in 50 words"

Performance Tuning

GPU Acceleration

macOS (Metal): Automatic, no config needed

Linux (NVIDIA): Requires CUDA drivers

Linux (AMD): Requires ROCm

Check GPU usage:

# NVIDIA
nvidia-smi

# AMD
rocm-smi

# macOS
sudo powermetrics --samplers gpu_power

Adjust Context Window

# Increase from default 2048 to 8192 tokens
ollama run mistral --num_ctx 8192

Larger contexts = more RAM, slower inference. Find your sweet spot.

Quantization (Speed vs Quality)

Models come in different quantization levels:

Example:

# Default (Q4_K_M)
ollama pull mistral

# High quality (Q8)
ollama pull mistral:7b-instruct-q8_0

Integrating with Tools

OpenAI-Compatible API

Ollama's API mimics OpenAI's format:

import openai

client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but ignored
)

response = client.chat.completions.create(
    model="mistral",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Works with:

  • LangChain
  • LlamaIndex
  • Anything built for OpenAI API

VS Code / Cursor

Install Continue.dev extension:

// ~/.continue/config.json
{
  "models": [
    {
      "title": "Ollama Mistral",
      "provider": "ollama",
      "model": "mistral"
    }
  ]
}

Now you have local code autocomplete!

LiteLLM (Unified Router)

pip install litellm

litellm --model ollama/mistral

Routes requests to Ollama, OpenAI, Claude, etc. via one API.

n8n Automation

Use Ollama node in n8n for AI-powered workflows:

  1. Add "Ollama" node
  2. Set base URL: http://localhost:11434
  3. Select model: mistral
  4. Connect to your automation

Common Issues & Fixes

"Model not found"

# Re-pull the model
ollama pull mistral

Slow inference on macOS

Check Activity Monitor → GPU → ensure "Ollama" is using GPU.

Out of memory

# Use smaller model or lower quant
ollama pull mistral:7b-instruct-q4_0

Port already in use

# Change default port (11434)
OLLAMA_PORT=8080 ollama serve

Models won't download (firewall)

# Use HTTP proxy
export HTTP_PROXY=http://proxy.example.com:8080
ollama pull mistral

Advanced: Custom Models

Import GGUF Files

# Create Modelfile
cat > Modelfile <<EOF
FROM ./custom-model.gguf
PARAMETER temperature 0.7
SYSTEM You are a helpful assistant.
EOF

# Build custom model
ollama create mymodel -f Modelfile

# Run it
ollama run mymodel

Fine-Tuned Models

If you've fine-tuned a model:

  1. Export to GGUF format
  2. Import via Modelfile (above)
  3. Optionally push to Ollama registry

Security & Privacy

All inference happens locally — no data sent to cloud

Models stored in `~/.ollama/models/` — encrypted at rest (if disk encryption enabled)

API runs on localhost by default (not exposed to internet)

Pro tip: Run Ollama in a VM or container for extra isolation.

Comparing Ollama to Alternatives

Ollama wins on ease of use. For raw speed, use vLLM. For GUI, try LM Studio.

Next Steps

  1. Try different models: Compare Mistral, Llama, Gemma on your tasks
  2. Integrate with tools: VS Code, n8n, LangChain
  3. Optimize performance: Test different quants, adjust context size
  4. Build something: Local chatbot, code assistant, document analyzer

Resources


Got stuck? Drop a comment or ping me — I've run Ollama on everything from Raspberry Pis to H100 clusters.

(Affiliate disclosure: Some links may include referral codes. I only recommend tools I actually use.)

Read more