I’ve been tinkering with LiteLLM at home, and pairing it with my OpenWebUI setup has unlocked something surprisingly practical: real budgets I can enforce automatically so there are no surprise bills. LiteLLM gives me one endpoint for many providers, per-key rate limits that fit each person in my household, and a clean way to bring local models into the same flow. In this post I break down how it works, why it matters for home-brew projects or small teams, and how to deploy it with Docker, end to end. I’ll also show simple recipes for monthly caps, per-key tokens-per-minute or requests-per-minute, and routing that prefers cheap or local models first.
I’ve been playing with LiteLLM, and when paired with my OpenWebUI it enables me to set a monthly budget for AI usage with the simplicity of being able to share with my family and not worrying about overpaying. OpenWebUI talks to LiteLLM using an OpenAI-compatible (or Anthropic) endpoint, so everyone can use one URL and their own API key (or one key with limits). If someone hits their limit, LiteLLM blocks further paid usage and can route them to a local model so the chat keeps working without new spend.
What is LiteLLM?
LiteLLM is a lightweight proxy with an OpenAI-compatible API on the front and many providers on the back. You define providers and local deployments in a config file and expose a single URL. LiteLLM can:
- Enforce per-key budgets that reset on a schedule, with requests failing once a cap is reached. Postgres is used to store keys, spend logs, and budget settings.
- Unify multiple providers and local models under one endpoint, including OpenAI, Azure, Anthropic, Bedrock, and Ollama for local models.
- Apply per-key RPM and TPM limits, so each API key gets its own speed and size envelope. These checks run before routing and are tracked across instances with Redis when needed.
- Route intelligently across deployments and even enforce budgets per provider, which helps you keep OpenAI or Anthropic spend inside limits while still failing over to other options if allowed.
Examples
- Family access with a hard cap
One LiteLLM endpoint, everyone gets their own API key. I set a small monthly budget and per-key RPM so kids (or wife in my case) cannot spam requests. When the cap is reached, LiteLLM stops paid calls and we continue on a local model behind the same endpoint. - Side project with “cheap-first” routing
Local Ollama models handle drafts. If latency spikes or prompts need better reasoning, the request can overflow to a cloud model, but only while the provider’s budget is not exhausted. - Small-team gateway
Each teammate’s key has different TPM and RPM limits. If one person runs heavy jobs, it does not starve everyone else. Budgets and usage stats live in Postgres so you can audit who spent what.
Metaphors
You know I love using metaphors, it helps wrap my brain around a particular concept.
Think of LiteLLM like a power strip with individual fuses. Every plug is a provider or a local model, but the strip decides where to draw power from and trips the fuse for any outlet that exceeds its budget or rate. You still get light in the room, and you never melt the wiring.
Practical Use Cases
- One endpoint for the whole house with personal keys and a shared monthly cap.
- Unified dev endpoint that points your CLI tools, notebooks, and apps to the same URL while you swap providers behind the scenes.
- Cost-aware experiments where you cap OpenAI usage daily and fall back to local models if you cross the threshold.
How to Set Up LiteLLM (Docker)
Below is a compact, reproducible home-lab deployment that gives you:
- LiteLLM proxy on port 4000 (I placed a load balancer in front to add SSL and point to 443 for internal use only, smart!)
- Postgres for keys, budgets, and usage
- Redis for cross-instance TPM/RPM and provider-budget tracking
- Ollama for local models
- OpenWebUI that talks to LiteLLM through an OpenAI-compatible URL
Step 1: Create docker-compose.yaml
services:
db:
image: postgres:16
environment:
- POSTGRES_DB=litellm
- POSTGRES_USER=llmproxy
- POSTGRES_PASSWORD=change-me
volumes:
- db_data:/var/lib/postgresql/data
ollama:
image: ollama/ollama:latest
ports: ["11434:11434"]
volumes:
- ollama_data:/root/.ollama
litellm:
image: ghcr.io/berriai/litellm:main-stable
depends_on: [db, ollama]
ports: ["4000:4000"]
environment:
- LITELLM_MASTER_KEY=sk-admin-master-change-me
- UI_USERNAME=admin
- UI_PASSWORD=password
- STORE_MODEL_IN_DB=true
- DATABASE_URL=postgresql://llmproxy:change-me@db:5432/litellm
volumes:
- ./config.yaml:/app/config.yaml:ro
command: ["--config", "/app/config.yaml", "--port", "4000"]
volumes:
db_data:
ollama_data:
Step 2: Create config.yaml file
A bit annoying if you ask me, because regardless I want to handle my own models via database, but seems like its needed or else app won’t start without specifying a model.
model_list:
- model_name: claude-sonnet-4
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: change-me
litellm_settings:
set_verbose: true
drop_params: true
Step 3: Start your docker container
docker compose up -d
Step 4: Access LiteLLM through the browser
Go to http://localhost:4000 to go into the main page OR go to http://localhost:4000/ui and log in using the credentials specified in your docker-compose.yaml to access the dashboard and be able to create keys, connect models and all the fun stuff!
Extras for home-brew and small enterprise
Feature checklist to highlight
- Virtual keys with per-key budgets and resets
Perfect for family sharing or per-team projects. Keys can be created, rotated, and deleted via API, and usage is tracked in Postgres. - Per-key RPM and TPM
Gives each user a tailored ceiling. Great for throttling scripts or classroom setups. - Provider-level budgets
Cap OpenAI or Anthropic spend per day or month and auto-reset. LiteLLM will skip a provider that is over budget and choose another if available. - Wildcard routing and provider unification
Proxy entire providers without enumerating every model, and keep clients calling the same friendly names. - Local models via Ollama
Private, fast iteration and a safety net when budgets are exhausted.
How to apply it beyond the basics
- Per-person keys
Generate a key per family member or teammate and set differentrpm_limit
andtpm_limit
values. This also keeps spend attribution clean in the DB. - Cheap-first routing
Put local or cheaper models first inmodel_list
, then define paid fallbacks. Use provider budgets to make sure overflows never blow your cap.
Quick troubleshooting
- Budget not enforced
Confirm you are running with a PostgresDATABASE_URL
and that the key was created with amax_budget
and abudget_duration
if you need monthly resets. - Rate limits inconsistent across replicas
Add Redis and setrouter_settings.redis_host
andredis_port
. This synchronizes per-minute counters across LiteLLM instances.
My Conclusion
LiteLLM turns a messy mix of cloud and local models into a single, OpenAI-compatible endpoint that you can safely share with family or a small team. You can mint per-user API keys, give each one its own requests-per-minute or tokens-per-minute limits, and set real monthly budgets that the proxy enforces, so paid usage stops the moment a cap is hit. Pairing it with OpenWebUI keeps the experience simple, because OpenWebUI already speaks to OpenAI-compatible backends, and you can bring in local models like Ollama under the same roof for a no-cost fallback when the budget is exhausted.
If you want predictable AI at home or in a small office, deploy LiteLLM with Docker, back it with Postgres for key and spend tracking, add Redis if you plan to scale out, and hand out virtual keys with per-key limits. Start everyone on the local model for day-to-day tasks, allow overflow to a paid provider while there is budget left, and let provider or key budgets draw the line for you. The result is a single endpoint that grows with your needs while keeping cost and traffic firmly under control.