● Complete Field Guide 2026
Running Open Source
AI Models
From zero to production — every method, every tool, and everything you need to run powerful AI models on your own hardware or server.
6Setup Methods
15+Tools Covered
0$Minimum Cost
You don't need expensive hardware to get started. Here's what different machines can realistically run:
| Machine |
RAM |
GPU |
Can Run |
Difficulty |
| Budget Laptop |
8 GB |
None (CPU only) |
1B–3B models (slow) |
Easy |
| Modern Laptop |
16 GB |
None / iGPU |
7B–8B models (decent) |
Easy |
| MacBook Pro M-series |
16–36 GB |
Apple Silicon |
8B–34B models (fast) |
Easy |
| PC + RTX 3060/4060 |
16 GB+ |
8–12 GB VRAM |
13B–34B models (fast) |
Medium |
| PC + RTX 3090/4090 |
32 GB+ |
24 GB VRAM |
70B+ models |
Medium |
| Cloud / VPS (GPU) |
Any |
Rented A100/H100 |
Any model |
Advanced |
💡
Apple Silicon tip: M-series Macs use unified memory — 16 GB RAM is shared between CPU and GPU. This makes them exceptional for local AI inference and often outperforms equivalently priced PC setups for running 7B–13B models.
1
Install Ollama — the easiest local runner
Ollama lets you download and run models with a single command. Supports macOS, Linux, and Windows. No Python, no config files.
# macOS / Linux — install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Llama 3.2 (3B — fast on any laptop)
ollama run llama3.2
# Pull and run Mistral 7B
ollama run mistral
# Pull and run Llama 3.1 8B
ollama run llama3.1:8b
# List all downloaded models
ollama list
2
Add a Chat UI on top of Ollama
Ollama runs in the terminal, but you can add a beautiful web UI for a ChatGPT-like experience. Open WebUI is the most popular option.
# Requires Docker installed. Runs a local web UI at localhost:3000
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
ℹ
LM Studio is the best option if you want a standalone desktop app — no terminal needed. Download, pick a model, and start chatting. Available for Mac, Windows, Linux.
3
Call the local model from your code
Ollama exposes a local REST API at localhost:11434. You can hit it directly or use the OpenAI-compatible endpoint.
# pip install ollama
import ollama
response = ollama.chat(
model='llama3.1:8b',
messages=[{'role': 'user', 'content': 'Explain neural networks simply'}]
)
print(response['message']['content'])
// Works like the OpenAI SDK — just change the baseURL
import OpenAI from 'openai'
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama' // required but ignored
})
const res = await client.chat.completions.create({
model: 'mistral',
messages: [{ role: 'user', content: 'What is a neural network?' }]
})
console.log(res.choices[0].message.content)
4
Download models directly from Hugging Face
For more control, download GGUF-format models directly. GGUF is a compressed format that runs efficiently on CPU + RAM. Use llama.cpp to run them.
⚠
When choosing a GGUF model, look for Q4_K_M quantization — it's the best balance between model quality and file size. Q2 is smaller but noticeably worse. Q8 is near-full quality but nearly twice the size.
Perfect for experimenting with models before committing to a local setup. No installation, works in any browser.
Someone else hosts and maintains the model. You just call an API. Ideal for MVPs and startups.
1
Pick an inference provider
All providers below are OpenAI-compatible — meaning your existing OpenAI code works with a simple baseURL swap.
2
Call any open source model via API
Example using Together AI — the pattern is identical across all providers. Just swap the baseURL and model name.
# pip install together
from together import Together
client = Together(api_key="your_together_api_key")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=512
)
print(response.choices[0].message.content)
# pip install groq
from groq import Groq
client = Groq(api_key="your_groq_api_key")
chat = client.chat.completions.create(
messages=[{"role": "user", "content": "Explain RAG in one paragraph"}],
model="llama-3.1-8b-instant",
)
print(chat.choices[0].message.content)
3
Use a unified SDK for all providers
LiteLLM lets you switch between 100+ providers with one consistent interface. Saves you from rewriting code when providers change.
# pip install litellm
from litellm import completion
# Switch model with one string change
response = completion(
model="together_ai/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
# model="ollama/llama3.1" ← local
# model="groq/llama-3.1-8b-instant" ← groq
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
1
Rent a VPS or GPU cloud instance
A VPS gives you a remote Linux machine you fully control. For CPU-only inference, any basic VPS works. For GPU inference, use a dedicated GPU cloud.
2
Deploy Ollama on your VPS
Install Ollama on any Ubuntu/Debian VPS and expose it over the network. Then connect your apps to it remotely.
# SSH into your VPS, then:
curl -fsSL https://ollama.com/install.sh | sh
# Set Ollama to listen on all interfaces
export OLLAMA_HOST=0.0.0.0:11434
ollama serve &
# Pull a model
ollama pull llama3.1:8b
# Set as systemd service so it restarts on reboot
sudo systemctl enable ollama
sudo systemctl start ollama
⚠
Never expose port 11434 publicly without authentication. Use a firewall (ufw) and either an nginx reverse proxy with basic auth, or a VPN like Tailscale to access it privately.
3
Hybrid: local model + globally accessible app
Run the model on your local machine (free, private) and host the app/API on a VPS (accessible worldwide). Connect them with a secure tunnel.
# Start Ollama locally
ollama serve
# In another terminal, expose it via ngrok
ngrok http 11434
# ngrok gives you a public URL like:
# https://abc123.ngrok.io ← your app calls this
4
Containerize with Docker for reproducibility
Use Docker to package your app alongside the model server. Makes deployment, updates, and scaling dramatically easier.
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
restart: always
myapp:
build: .
ports:
- "3000:3000"
environment:
- OLLAMA_URL=http://ollama:11434
depends_on:
- ollama
volumes:
ollama_data:
Start with these — they're well-documented, widely supported, and run on most hardware.
llama3.2:3b
~2 GB
Best for low-RAM machines. Surprisingly capable for its size. Great starting point.
8 GB RAM min
llama3.1:8b
~5 GB
Sweet spot. Fast, capable, runs on any 16 GB laptop. Best all-around choice for beginners.
16 GB RAM min
mistral:7b
~4.1 GB
Excellent instruction following. Fast on CPU. Popular for coding and writing tasks.
16 GB RAM min
phi3:mini
~2.2 GB
Microsoft's small but mighty model. Exceptional reasoning-to-size ratio. Great for coding.
8 GB RAM min
qwen2.5:7b
~4.7 GB
Alibaba's model. Multilingual, strong at code, beats Llama 3.1 8B on many benchmarks.
16 GB RAM min
llama3.1:70b
~40 GB
Near GPT-4 quality for many tasks. Needs a good GPU or Apple Silicon M2/M3 Max/Ultra.
48+ GB RAM or GPU VRAM
deepseek-coder-v2
~9 GB
Best open source model for code generation. Rivals GPT-4o for coding tasks.
16 GB VRAM GPU
RAG Systems
- Chat with your own documents
- Vector DB + LLM pipeline
- Use LangChain or LlamaIndex
- Works with local models
LangChain Docs ↗
AI Agents
- Models that take actions
- Tool use, web browsing
- Use AutoGen or CrewAI
- Works locally with Ollama
CrewAI Docs ↗
Cloud Scale
- Kubernetes + vLLM
- Auto-scaling inference
- High traffic production
- Use Modal or Replicate
Modal (serverless GPU) ↗
Key Learning Resources
📚
Documentation & Communities
Quick Decision Guide — Which Method is Right for You?
Just want to try models
→ Browser Playground
Privacy + offline use
→ Local (Ollama)
Build an MVP fast
→ Managed API
Full control + scale
→ VPS + Docker
Custom behavior
→ Fine-tuning
High traffic product
→ Cloud + vLLM