Open Source AI Models — Complete Guide
● Complete Field Guide 2026

Running Open Source
AI Models

From zero to production — every method, every tool, and everything you need to run powerful AI models on your own hardware or server.

6Setup Methods
15+Tools Covered
0$Minimum Cost
00

Prerequisites & Hardware Reality Check

You don't need expensive hardware to get started. Here's what different machines can realistically run:

Machine RAM GPU Can Run Difficulty
Budget Laptop 8 GB None (CPU only) 1B–3B models (slow) Easy
Modern Laptop 16 GB None / iGPU 7B–8B models (decent) Easy
MacBook Pro M-series 16–36 GB Apple Silicon 8B–34B models (fast) Easy
PC + RTX 3060/4060 16 GB+ 8–12 GB VRAM 13B–34B models (fast) Medium
PC + RTX 3090/4090 32 GB+ 24 GB VRAM 70B+ models Medium
Cloud / VPS (GPU) Any Rented A100/H100 Any model Advanced
💡 Apple Silicon tip: M-series Macs use unified memory — 16 GB RAM is shared between CPU and GPU. This makes them exceptional for local AI inference and often outperforms equivalently priced PC setups for running 7B–13B models.
01

Run Models Locally — No Internet Required

1

Install Ollama — the easiest local runner

Ollama lets you download and run models with a single command. Supports macOS, Linux, and Windows. No Python, no config files.

Terminal
# macOS / Linux — install Ollama curl -fsSL https://ollama.com/install.sh | sh # Pull and run Llama 3.2 (3B — fast on any laptop) ollama run llama3.2 # Pull and run Mistral 7B ollama run mistral # Pull and run Llama 3.1 8B ollama run llama3.1:8b # List all downloaded models ollama list
2

Add a Chat UI on top of Ollama

Ollama runs in the terminal, but you can add a beautiful web UI for a ChatGPT-like experience. Open WebUI is the most popular option.

Docker — Open WebUI
# Requires Docker installed. Runs a local web UI at localhost:3000 docker run -d -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui \ ghcr.io/open-webui/open-webui:main
LM Studio is the best option if you want a standalone desktop app — no terminal needed. Download, pick a model, and start chatting. Available for Mac, Windows, Linux.
3

Call the local model from your code

Ollama exposes a local REST API at localhost:11434. You can hit it directly or use the OpenAI-compatible endpoint.

Python
# pip install ollama import ollama response = ollama.chat( model='llama3.1:8b', messages=[{'role': 'user', 'content': 'Explain neural networks simply'}] ) print(response['message']['content'])
JavaScript / Node.js
// Works like the OpenAI SDK — just change the baseURL import OpenAI from 'openai' const client = new OpenAI({ baseURL: 'http://localhost:11434/v1', apiKey: 'ollama' // required but ignored }) const res = await client.chat.completions.create({ model: 'mistral', messages: [{ role: 'user', content: 'What is a neural network?' }] }) console.log(res.choices[0].message.content)
4

Download models directly from Hugging Face

For more control, download GGUF-format models directly. GGUF is a compressed format that runs efficiently on CPU + RAM. Use llama.cpp to run them.

When choosing a GGUF model, look for Q4_K_M quantization — it's the best balance between model quality and file size. Q2 is smaller but noticeably worse. Q8 is near-full quality but nearly twice the size.
02

Browser Playgrounds — Zero Setup

Perfect for experimenting with models before committing to a local setup. No installation, works in any browser.

03

Managed Inference APIs — Build Without Infrastructure

Someone else hosts and maintains the model. You just call an API. Ideal for MVPs and startups.

1

Pick an inference provider

All providers below are OpenAI-compatible — meaning your existing OpenAI code works with a simple baseURL swap.

2

Call any open source model via API

Example using Together AI — the pattern is identical across all providers. Just swap the baseURL and model name.

Python — Together AI
# pip install together from together import Together client = Together(api_key="your_together_api_key") response = client.chat.completions.create( model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo", messages=[{"role": "user", "content": "Hello!"}], max_tokens=512 ) print(response.choices[0].message.content)
Python — Groq (ultra-fast)
# pip install groq from groq import Groq client = Groq(api_key="your_groq_api_key") chat = client.chat.completions.create( messages=[{"role": "user", "content": "Explain RAG in one paragraph"}], model="llama-3.1-8b-instant", ) print(chat.choices[0].message.content)
3

Use a unified SDK for all providers

LiteLLM lets you switch between 100+ providers with one consistent interface. Saves you from rewriting code when providers change.

Python — LiteLLM
# pip install litellm from litellm import completion # Switch model with one string change response = completion( model="together_ai/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo", # model="ollama/llama3.1" ← local # model="groq/llama-3.1-8b-instant" ← groq messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content)
04

VPS & Own Server — Full Ownership

1

Rent a VPS or GPU cloud instance

A VPS gives you a remote Linux machine you fully control. For CPU-only inference, any basic VPS works. For GPU inference, use a dedicated GPU cloud.

2

Deploy Ollama on your VPS

Install Ollama on any Ubuntu/Debian VPS and expose it over the network. Then connect your apps to it remotely.

bash — VPS Setup
# SSH into your VPS, then: curl -fsSL https://ollama.com/install.sh | sh # Set Ollama to listen on all interfaces export OLLAMA_HOST=0.0.0.0:11434 ollama serve & # Pull a model ollama pull llama3.1:8b # Set as systemd service so it restarts on reboot sudo systemctl enable ollama sudo systemctl start ollama
Never expose port 11434 publicly without authentication. Use a firewall (ufw) and either an nginx reverse proxy with basic auth, or a VPN like Tailscale to access it privately.
3

Hybrid: local model + globally accessible app

Run the model on your local machine (free, private) and host the app/API on a VPS (accessible worldwide). Connect them with a secure tunnel.

bash — Expose local Ollama via ngrok
# Start Ollama locally ollama serve # In another terminal, expose it via ngrok ngrok http 11434 # ngrok gives you a public URL like: # https://abc123.ngrok.io ← your app calls this
4

Containerize with Docker for reproducibility

Use Docker to package your app alongside the model server. Makes deployment, updates, and scaling dramatically easier.

docker-compose.yml
services: ollama: image: ollama/ollama ports: - "11434:11434" volumes: - ollama_data:/root/.ollama restart: always myapp: build: . ports: - "3000:3000" environment: - OLLAMA_URL=http://ollama:11434 depends_on: - ollama volumes: ollama_data:
05

Recommended Open Source Models

Start with these — they're well-documented, widely supported, and run on most hardware.

llama3.2:3b ~2 GB Best for low-RAM machines. Surprisingly capable for its size. Great starting point. 8 GB RAM min
llama3.1:8b ~5 GB Sweet spot. Fast, capable, runs on any 16 GB laptop. Best all-around choice for beginners. 16 GB RAM min
mistral:7b ~4.1 GB Excellent instruction following. Fast on CPU. Popular for coding and writing tasks. 16 GB RAM min
phi3:mini ~2.2 GB Microsoft's small but mighty model. Exceptional reasoning-to-size ratio. Great for coding. 8 GB RAM min
qwen2.5:7b ~4.7 GB Alibaba's model. Multilingual, strong at code, beats Llama 3.1 8B on many benchmarks. 16 GB RAM min
llama3.1:70b ~40 GB Near GPT-4 quality for many tasks. Needs a good GPU or Apple Silicon M2/M3 Max/Ultra. 48+ GB RAM or GPU VRAM
deepseek-coder-v2 ~9 GB Best open source model for code generation. Rivals GPT-4o for coding tasks. 16 GB VRAM GPU
06

Advanced Paths

Fine-tuning
  • Train on your own data
  • Customize behavior and tone
  • Requires GPU (RTX 3090+)
  • Use Unsloth or LoRA
unsloth (fast fine-tuning) ↗
RAG Systems
  • Chat with your own documents
  • Vector DB + LLM pipeline
  • Use LangChain or LlamaIndex
  • Works with local models
LangChain Docs ↗
AI Agents
  • Models that take actions
  • Tool use, web browsing
  • Use AutoGen or CrewAI
  • Works locally with Ollama
CrewAI Docs ↗
Cloud Scale
  • Kubernetes + vLLM
  • Auto-scaling inference
  • High traffic production
  • Use Modal or Replicate
Modal (serverless GPU) ↗

Key Learning Resources

Quick Decision Guide — Which Method is Right for You?

Just want to try models → Browser Playground
Privacy + offline use → Local (Ollama)
Build an MVP fast → Managed API
Full control + scale → VPS + Docker
Custom behavior → Fine-tuning
High traffic product → Cloud + vLLM