Running Open Source AI Models

00

Prerequisites & Hardware Reality Check

You don't need expensive hardware to get started. Here's what different machines can realistically run:

Machine	RAM	GPU	Can Run	Difficulty
Budget Laptop	8 GB	None (CPU only)	1B–3B models (slow)	Easy
Modern Laptop	16 GB	None / iGPU	7B–8B models (decent)	Easy
MacBook Pro M-series	16–36 GB	Apple Silicon	8B–34B models (fast)	Easy
PC + RTX 3060/4060	16 GB+	8–12 GB VRAM	13B–34B models (fast)	Medium
PC + RTX 3090/4090	32 GB+	24 GB VRAM	70B+ models	Medium
Cloud / VPS (GPU)	Any	Rented A100/H100	Any model	Advanced

💡 Apple Silicon tip: M-series Macs use unified memory — 16 GB RAM is shared between CPU and GPU. This makes them exceptional for local AI inference and often outperforms equivalently priced PC setups for running 7B–13B models.

01

Run Models Locally — No Internet Required

1

Install Ollama — the easiest local runner

Ollama lets you download and run models with a single command. Supports macOS, Linux, and Windows. No Python, no config files.

ollama.com ↗ GitHub ↗ Model Library ↗

Terminal
# macOS / Linux — install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Llama 3.2 (3B — fast on any laptop)
ollama run llama3.2

# Pull and run Mistral 7B
ollama run mistral

# Pull and run Llama 3.1 8B
ollama run llama3.1:8b

# List all downloaded models
ollama list

2

Add a Chat UI on top of Ollama

Ollama runs in the terminal, but you can add a beautiful web UI for a ChatGPT-like experience. Open WebUI is the most popular option.

Open WebUI ↗ LM Studio ↗ Jan.ai (Desktop App) ↗

Docker — Open WebUI
# Requires Docker installed. Runs a local web UI at localhost:3000
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

ℹ LM Studio is the best option if you want a standalone desktop app — no terminal needed. Download, pick a model, and start chatting. Available for Mac, Windows, Linux.

3

Call the local model from your code

Ollama exposes a local REST API at localhost:11434. You can hit it directly or use the OpenAI-compatible endpoint.

Ollama API Docs ↗ OpenAI Compatibility ↗

Python
# pip install ollama
import ollama

response = ollama.chat(
    model='llama3.1:8b',
    messages=[{'role': 'user', 'content': 'Explain neural networks simply'}]
)
print(response['message']['content'])

JavaScript / Node.js
// Works like the OpenAI SDK — just change the baseURL
import OpenAI from 'openai'

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama'  // required but ignored
})

const res = await client.chat.completions.create({
  model: 'mistral',
  messages: [{ role: 'user', content: 'What is a neural network?' }]
})
console.log(res.choices[0].message.content)

4

Download models directly from Hugging Face

For more control, download GGUF-format models directly. GGUF is a compressed format that runs efficiently on CPU + RAM. Use llama.cpp to run them.

GGUF Models on HF ↗ llama.cpp GitHub ↗ GGUF Format Docs ↗

⚠ When choosing a GGUF model, look for Q4_K_M quantization — it's the best balance between model quality and file size. Q2 is smaller but noticeably worse. Q8 is near-full quality but nearly twice the size.

02

Browser Playgrounds — Zero Setup

Perfect for experimenting with models before committing to a local setup. No installation, works in any browser.

Hugging Face Spaces Free

Thousands of community-hosted demos. Try any model instantly in your browser.

huggingface.co/spaces ↗

Google Colab Free GPU

Free T4 GPU access in a Jupyter notebook. Great for running larger models and experiments.

colab.research.google.com ↗

LMSYS Chatbot Arena Compare

Chat with dozens of open source models side-by-side. Great for benchmarking.

chat.lmsys.org ↗

Perplexity Labs Fast

Hosted open source models (Llama, Mistral) with no signup required. Very fast inference.

labs.perplexity.ai ↗

Groq Console Ultra Fast

LPU hardware delivers blazing fast inference on Llama, Mixtral, Gemma. Free tier available.

console.groq.com ↗

OpenRouter Chat 100+ Models

One interface for 100+ models including all major open source options.

openrouter.ai/chat ↗

03

Managed Inference APIs — Build Without Infrastructure

Someone else hosts and maintains the model. You just call an API. Ideal for MVPs and startups.

1

Pick an inference provider

All providers below are OpenAI-compatible — meaning your existing OpenAI code works with a simple baseURL swap.

Together AI ↗ Fireworks AI ↗ Groq API ↗ OpenRouter ↗ Replicate ↗

2

Call any open source model via API

Example using Together AI — the pattern is identical across all providers. Just swap the baseURL and model name.

Together AI Docs ↗ Fireworks Docs ↗

Python — Together AI
# pip install together
from together import Together

client = Together(api_key="your_together_api_key")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=512
)
print(response.choices[0].message.content)

Python — Groq (ultra-fast)
# pip install groq
from groq import Groq

client = Groq(api_key="your_groq_api_key")

chat = client.chat.completions.create(
    messages=[{"role": "user", "content": "Explain RAG in one paragraph"}],
    model="llama-3.1-8b-instant",
)
print(chat.choices[0].message.content)

3

Use a unified SDK for all providers

LiteLLM lets you switch between 100+ providers with one consistent interface. Saves you from rewriting code when providers change.

LiteLLM Docs ↗ GitHub ↗

Python — LiteLLM
# pip install litellm
from litellm import completion

# Switch model with one string change
response = completion(
    model="together_ai/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
    # model="ollama/llama3.1"         ← local
    # model="groq/llama-3.1-8b-instant" ← groq
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

04

VPS & Own Server — Full Ownership

1

Rent a VPS or GPU cloud instance

A VPS gives you a remote Linux machine you fully control. For CPU-only inference, any basic VPS works. For GPU inference, use a dedicated GPU cloud.

Hostinger VPS ↗ DigitalOcean ↗ Vast.ai (cheap GPU) ↗ RunPod ↗ Lambda Labs ↗

2

Deploy Ollama on your VPS

Install Ollama on any Ubuntu/Debian VPS and expose it over the network. Then connect your apps to it remotely.

bash — VPS Setup
# SSH into your VPS, then:
curl -fsSL https://ollama.com/install.sh | sh

# Set Ollama to listen on all interfaces
export OLLAMA_HOST=0.0.0.0:11434
ollama serve &

# Pull a model
ollama pull llama3.1:8b

# Set as systemd service so it restarts on reboot
sudo systemctl enable ollama
sudo systemctl start ollama

⚠ Never expose port 11434 publicly without authentication. Use a firewall (ufw) and either an nginx reverse proxy with basic auth, or a VPN like Tailscale to access it privately.

3

Hybrid: local model + globally accessible app

Run the model on your local machine (free, private) and host the app/API on a VPS (accessible worldwide). Connect them with a secure tunnel.

Tailscale (VPN mesh) ↗ ngrok (tunnel) ↗ Cloudflare Tunnel ↗

bash — Expose local Ollama via ngrok
# Start Ollama locally
ollama serve

# In another terminal, expose it via ngrok
ngrok http 11434

# ngrok gives you a public URL like:
# https://abc123.ngrok.io  ← your app calls this

4

Containerize with Docker for reproducibility

Use Docker to package your app alongside the model server. Makes deployment, updates, and scaling dramatically easier.

Ollama Docker Image ↗ Install Docker ↗

docker-compose.yml
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    restart: always

  myapp:
    build: .
    ports:
      - "3000:3000"
    environment:
      - OLLAMA_URL=http://ollama:11434
    depends_on:
      - ollama

volumes:
  ollama_data:

05

Recommended Open Source Models

Start with these — they're well-documented, widely supported, and run on most hardware.

llama3.2:3b ~2 GB Best for low-RAM machines. Surprisingly capable for its size. Great starting point. 8 GB RAM min

llama3.1:8b ~5 GB Sweet spot. Fast, capable, runs on any 16 GB laptop. Best all-around choice for beginners. 16 GB RAM min

mistral:7b ~4.1 GB Excellent instruction following. Fast on CPU. Popular for coding and writing tasks. 16 GB RAM min

phi3:mini ~2.2 GB Microsoft's small but mighty model. Exceptional reasoning-to-size ratio. Great for coding. 8 GB RAM min

qwen2.5:7b ~4.7 GB Alibaba's model. Multilingual, strong at code, beats Llama 3.1 8B on many benchmarks. 16 GB RAM min

llama3.1:70b ~40 GB Near GPT-4 quality for many tasks. Needs a good GPU or Apple Silicon M2/M3 Max/Ultra. 48+ GB RAM or GPU VRAM

deepseek-coder-v2 ~9 GB Best open source model for code generation. Rivals GPT-4o for coding tasks. 16 GB VRAM GPU

Full Ollama Model Library ↗ Open LLM Leaderboard ↗ LM Arena Benchmarks ↗

06

Advanced Paths

Fine-tuning

Train on your own data
Customize behavior and tone
Requires GPU (RTX 3090+)
Use Unsloth or LoRA

unsloth (fast fine-tuning) ↗

RAG Systems

Chat with your own documents
Vector DB + LLM pipeline
Use LangChain or LlamaIndex
Works with local models

LangChain Docs ↗

AI Agents

Models that take actions
Tool use, web browsing
Use AutoGen or CrewAI
Works locally with Ollama

CrewAI Docs ↗

Cloud Scale

Kubernetes + vLLM
Auto-scaling inference
High traffic production
Use Modal or Replicate

Modal (serverless GPU) ↗

Key Learning Resources

📚

Documentation & Communities

Hugging Face Docs ↗ Ollama Full Docs ↗ r/LocalLLaMA Community ↗ Ollama Discord ↗ vLLM (production serving) ↗

Quick Decision Guide — Which Method is Right for You?

Just want to try models → Browser Playground

Privacy + offline use → Local (Ollama)

Build an MVP fast → Managed API

Full control + scale → VPS + Docker

Custom behavior → Fine-tuning

High traffic product → Cloud + vLLM

Unlock Resource

Running Open SourceAI Models

Prerequisites & Hardware Reality Check

Run Models Locally — No Internet Required

Install Ollama — the easiest local runner

Add a Chat UI on top of Ollama

Call the local model from your code

Download models directly from Hugging Face

Browser Playgrounds — Zero Setup

Managed Inference APIs — Build Without Infrastructure

Pick an inference provider

Call any open source model via API

Use a unified SDK for all providers

VPS & Own Server — Full Ownership

Rent a VPS or GPU cloud instance

Deploy Ollama on your VPS

Hybrid: local model + globally accessible app

Containerize with Docker for reproducibility

Recommended Open Source Models

Advanced Paths

Documentation & Communities

Running Open Source
AI Models