This is Part 1 of the Self-Hosting LLMs series — a practical guide to running open-source AI on your own hardware.

Two moments usually push developers toward running their own LLMs.

The first is the bill.

You build a small weekend project — a coding assistant, a document summarizer, a chatbot for your team. You wire it up to the OpenAI API, test it out, invite a few people. Two weeks later you check your usage dashboard and see a number you weren’t expecting.

The second is the pause.

You’re about to paste something into ChatGPT — a contract, some internal code, a customer email — and a small voice asks: “wait, where does this actually go?”

These two moments — cost and privacy — are why people start looking at local LLMs. And in 2026, the answer to both is better than most people expect.

This post makes the case for self-hosting, shows you what hardware you realistically need, and gets you running a real model in about 5 minutes.

🗺️ 1. What “Self-Hosting” Actually Means

The term covers a few different setups — all of them mean you control the model, not a vendor:

Local — model runs on your laptop or desktop, right now
On-prem server — a machine in your office, home lab, or rack
Cloud self-hosted — you rent a GPU VM (Vast.ai, RunPod, AWS) and run your own model on it

All three share the same property: no third party sits between your data and the model. The rest of this series focuses primarily on the local case — a gaming PC, a workstation, or an Apple Silicon Mac — because that’s where most developers start.

💸 2. The Cost Problem

Cloud LLM APIs have gotten significantly cheaper over the past two years. But at any meaningful scale, they’re still expensive.

Current API Pricing (April 2026)

Model	Input (per 1M tokens)	Output (per 1M tokens)
Claude Sonnet 4.6	$3.00	$15.00
GPT-4o	$2.50	$10.00
Gemini 2.5 Pro	$2.00	$12.00
DeepSeek V3	$0.14	$0.28

Output tokens are typically 3–10x more expensive than input tokens, and that’s where most of your spend goes — every word the model writes is output.

What That Looks Like in Practice

A moderate developer use case — say, a coding assistant running 10 million tokens per month — looks like this:

Claude Sonnet 4.6: ~$300–600/month
GPT-4o: ~$250–500/month
Gemini 2.5 Pro: ~$200–400/month

For a side project, that’s real money. For a small team? It adds up fast. For high-volume production workloads — internal tooling, document pipelines, automated review systems — API costs often become the dominant infrastructure expense.

The Self-Hosting Math

An RTX 4090 costs about $1,600. Running Llama 3.1 70B on it, your cost per million tokens is roughly $0.01–0.02 (electricity + hardware depreciation). That’s 100–1,000x cheaper than a premium cloud API.

At 100 million tokens per month: cloud APIs cost $2,000–6,000. Self-hosted costs about $1–2.

Breakeven point:

Under 2M tokens/day → cloud APIs are likely cheaper (no idle infrastructure cost)
2–5M tokens/day → roughly equivalent once you factor in maintenance time
Over 5M tokens/day → self-hosting is dramatically cheaper

If you’re building anything beyond occasional queries, the math starts tilting toward local surprisingly quickly.

Honest caveat: Don’t forget your time. Getting a local LLM running is easy (seriously, 5 minutes). Keeping it running in production, monitoring it, updating models — that takes real engineering effort. For small teams without DevOps capacity, APIs stay cheaper even at medium scale.

🔒 3. The Privacy Problem

This one is subtler, and more important for some use cases than others.

What Happens to Your Prompts?

Consumer products (ChatGPT Plus, Claude Pro, Gemini Advanced): Training is typically enabled by default. You need to actively opt out, and opt-out only applies going forward. Your past conversations may already have been used.

APIs are better, but not zero-risk:

OpenAI API: not used for training by default; data retained 30 days for abuse monitoring
Anthropic API: not used for training; retained 7 days
Both offer “Zero Data Retention” agreements for enterprise customers

The practical question is: are you comfortable with your prompts sitting on US servers, even briefly, even if they’re not used for training?

For many use cases, the answer is “sure, fine.” For others it’s not:

Healthcare: Sending patient information to a third-party US-hosted API can violate HIPAA data handling requirements
Legal: Client communications, contract details, case strategy — subject to attorney-client privilege and confidentiality
Finance: Trading research, internal financial projections, M&A details — heavily regulated
European businesses: GDPR requires that personal data either stays in the EU or is sent somewhere with adequate protections. Sending it to a US API via standard terms is a grey area, and EU AI Act enforcement (August 2026) makes this harder to ignore

The Only True Guarantee

If the model runs on your machine, your data never leaves your machine. No terms of service, no data retention policy, no regional compliance issue. It’s a simple guarantee that no cloud provider can match.

For most developers building personal tools, this doesn’t matter much. For anyone handling sensitive data — or building tools for clients who do — it often matters a lot.

⚡ 4. Control, Reliability, and Freedom

Cost and privacy get the attention, but there’s a third category: control.

No Rate Limits

Cloud APIs throttle you. When you’re running a batch job at 2am, or your product suddenly gets popular, or you’re testing something intensively — rate limits kick in and slow you down. With a local model, the only limit is your hardware.

No Outages

OpenAI, Anthropic, and Google all have status pages. They all have incidents. If your product depends on their API, their outage is your outage.

No Silent Model Changes

Cloud providers update their models regularly. Sometimes gpt-4o today behaves differently than gpt-4o three months ago — same name, different model, different behavior. This breaks evals, changes output formats, surprises users.

With a local model, the file on your disk doesn’t change unless you change it. You can pin to an exact version forever.

Works Offline

Plane, train, remote location, network outage, air-gapped environment — a local model doesn’t care.

🧠 5. But Is It Actually Good Enough?

Fair question. A year ago the honest answer was “for some things.” Today it’s closer to “for most things.”

Coding — The Gap Has Closed

Benchmark	Cloud leader	Best open-source	Gap
HumanEval (coding)	GPT-4o: 80.5%	DeepSeek-Coder-V2: 82.6%	None — open source leads
SWE-bench (real GitHub issues)	Claude: 80.9%	DeepSeek: 78%	Small
General reasoning (MMLU)	GPT-4o: 88.7%	Llama 3.1 70B: 86.0%	Small
Complex math	GPT-4o: leading	Llama 3.1 70B: behind	Noticeable

For everyday coding tasks — completing functions, writing tests, explaining code, generating documentation — local models running at Q4 quantization are genuinely competitive with cloud APIs. DeepSeek-Coder-V2 running locally on a single GPU benchmarks above GPT-4o on HumanEval.

Where Cloud Still Leads

Complex multi-step reasoning — frontier models still have an edge on hard math and science problems
Very long context — 128K+ context with consistent quality is still cloud’s strength
Multimodal tasks — understanding images and documents is more mature in cloud models

If your use case is: write code, summarize documents, answer questions, draft emails, build internal tools — a local 70B model will surprise you. If your use case is: solve graduate-level math problems or analyze dozens of images per request — cloud still has the edge.

💻 6. What Hardware Do You Actually Need?

Less than you think.

The key resource is VRAM (video RAM on your GPU). That’s what limits which models you can run and how fast they go. Part 3 of this series goes deep on quantization and VRAM math — but here’s the practical summary:

What you have	What you can run	Quality
Any Mac M1/M2/M3/M4, 16GB	13B models, comfortably	Great
Any Mac M1/M2/M3/M4, 8GB	7B models	Good
Gaming PC, RTX 3060 12GB	13B at Q4, 7B easily	Great
Gaming PC, RTX 4070 Ti 16GB	30B at Q4	Great
Gaming PC, RTX 4090 24GB	70B at Q4	Excellent
CPU only, no GPU	7B models, slowly	Usable

Apple Silicon is a special case. Unlike PCs where the GPU has its own dedicated VRAM pool, Apple’s M-series chips use unified memory — the same memory pool is shared between CPU, GPU, and Neural Engine. This means a 16GB MacBook Pro has 16GB available for model loading, with no separate VRAM limit. It’s one of the best consumer options for local LLMs today.

The honest minimum: A 7B model at Q4 quantization runs on almost anything built in the last 5 years. It won’t run fast on a laptop CPU, but it runs. For a meaningful experience, aim for 8GB+ of VRAM or an Apple Silicon Mac with 16GB.

🧰 7. The 4 Tools You’ll Encounter

The ecosystem has converged around four main options. We’ll go deep on each in Part 2, but here’s the lay of the land:

Ollama — One command downloads and runs any model. OpenAI-compatible API built in. The right starting point for almost everyone.

LM Studio — Ollama-level simplicity with a polished desktop UI. Great for people who prefer a GUI over a terminal.

llama.cpp — The C++ engine that powers most local inference under the hood. Maximum portability, CPU-friendly, runs on anything. More setup, but the most flexible.

vLLM — Built for production. Designed to serve many concurrent users with maximum throughput. Overkill for personal use; exactly right for team deployments.

🚀 8. Your First 5 Minutes

Theory aside — let’s run something.

Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download the installer from https://ollama.com

Run Your First Model

# Llama 3.2 3B — small, fast, runs on almost any hardware (~2 GB)
ollama run llama3.2:3b

Ollama downloads the model automatically on first run. Then you get an interactive prompt:

>>> What's the difference between a list and a tuple in Python?

Great question! The key differences are:

1. **Mutability** — lists are mutable (you can change them after creation),
   tuples are immutable (fixed once created).

2. **Syntax** — lists use square brackets [1, 2, 3],
   tuples use parentheses (1, 2, 3).

3. **Performance** — tuples are slightly faster to iterate over
   and use less memory.

4. **Use case** — use lists when data will change;
   use tuples for fixed data like coordinates or config values.

That’s a real local LLM running on your hardware. No API key, no billing, no data leaving your machine.

Or Use the API

If you want to integrate with existing code:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required field, value ignored locally
)

response = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[{"role": "user", "content": "Write a Python function to flatten a nested list."}],
)

print(response.choices[0].message.content)

This is the standard OpenAI Python client, pointed at a local server instead of api.openai.com. Any code that works with the OpenAI API works here with a one-line URL change.

🎯 Is Self-Hosting Right for You?

Yes, if:

You’re building something that touches sensitive data
You’re running high token volumes and the bill is growing
You want a stable, offline-capable development environment
You have a gaming PC or Apple Silicon Mac sitting idle

Not yet, if:

You need cutting-edge reasoning or vision capabilities
You’re a small team with no DevOps capacity and low token volumes
You need 100K+ token context windows reliably

The good news: you don’t have to choose permanently. Many developers use local models for development and testing, then route production traffic to a cloud API for the tasks where quality matters most. The OpenAI-compatible API standard makes switching between local and cloud trivially easy — same code, different URL.

📚 What’s Next

This series walks through everything you need to go from curious to confident:

Part 1: Why Self-Host? ← you are here
Part 2: Framework Showdown — Ollama, LM Studio, llama.cpp, vLLM (coming soon)
Part 3: Quantization Explained — How a 70B Model Fits on Your Laptop
Part 4: Open Source Model Landscape — Benchmarks and Picking the Right One (coming soon)

If you want to understand why a 70B model fits on a 24GB GPU, or what “Q4_K_M” means on a Hugging Face download page — jump to Part 3 now. Otherwise, Part 2 will walk you through the four main tools in detail.

Questions or corrections? Leave a comment below, or find me on GitHub at @scrowten.

🏠 Why Run Your Own LLM? The Case for Self-Hosting in 2026