Architecture¶
This page explains how m-gpux works internally — from CLI structure to the LLM API server proxy layer.
Project Structure¶
m_gpux/
├── __init__.py # Package version
├── main.py # CLI entrypoint, command registration, welcome screen, top-level stop
└── commands/
├── __init__.py
├── account.py # Profile CRUD (add/list/switch/remove)
├── billing.py # Usage aggregation and billing dashboard links
├── hub.py # Interactive GPU runtime launcher (Jupyter/script/shell)
├── serve.py # LLM API deployment, auth proxy, API key management
├── load.py # Live GPU hardware metrics probe
└── _metrics_snippet.py # GPU metrics code injected into generated Modal scripts
How it works¶
CLI Framework¶
m-gpux is built on Typer with Rich for terminal output. The entrypoint is m_gpux.main:app, registered as the m-gpux console script in pyproject.toml.
Each command module (account, billing, hub, serve, load) defines its own typer.Typer() app, which is attached to the main app via app.add_typer().
Profile Management¶
Profiles are stored in ~/.modal.toml using the tomlkit library. Each section represents a profile:
[personal]
token_id = "ak-..."
token_secret = "as-..."
[work]
token_id = "ak-..."
token_secret = "as-..."
When switching profiles, m-gpux calls modal profile activate <name> to set the active profile for subsequent Modal CLI commands.
Script Generation (Hub & Serve)¶
Both hub and serve deploy follow the same pattern:
- Collect parameters via interactive prompts (GPU, action, model, etc.)
- Generate a Python script (
modal_runner.py) from a template with string substitution - Show the script for review (syntax-highlighted with Rich)
- Execute via
modal run(hub) ormodal deploy(serve)
The generated script is fully transparent — users can edit it before execution.
LLM API Server Architecture¶
The serve deploy command creates a unique two-process architecture inside a single Modal container:
┌─────────────────────────────────────────────────────────────┐
│ Modal Container (GPU) │
│ │
│ ┌──────────────────────┐ ┌───────────────────────────┐ │
│ │ Auth Proxy (8000) │───▶│ vLLM Backend (8001) │ │
│ │ FastAPI + uvicorn │ │ OpenAI-compatible API │ │
│ │ │ │ │ │
│ │ • Bearer token auth │ │ • /v1/chat/completions │ │
│ │ • 401/403 responses │ │ • /v1/models │ │
│ │ • Streaming proxy │ │ • /v1/completions │ │
│ │ • /health endpoint │ │ • Model loading/serving │ │
│ └──────────────────────┘ └───────────────────────────┘ │
│ ▲ │
│ │ Port 8000 (Modal web_server) │
└───────────┼─────────────────────────────────────────────────┘
│
┌──────┴──────┐
│ Internet │
│ Clients │
└─────────────┘
Why two processes?¶
Modal's @modal.web_server(port=8000) requires a server to be listening on the specified port quickly (startup timeout). vLLM takes several minutes to load large models. By starting FastAPI first (instant) and vLLM second (background), the container passes Modal's health probe immediately.
Auth Flow¶
Client Request
│
▼
┌─ Auth Middleware ──────────────────────────────────┐
│ 1. Is path /health, /, /docs, /openapi.json, │
│ /stats? │
│ → YES: Pass through (no auth) │
│ → NO: Check Authorization header │
│ │
│ 2. No "Bearer" prefix? │
│ → 401 {"error": "Missing API key..."} │
│ │
│ 3. Key doesn't match? │
│ → 403 {"error": "Invalid API key"} │
│ │
│ 4. Too many concurrent requests (>150)? │
│ → 429 {"error": "Too many concurrent..."} │
│ │
│ 5. Valid key + capacity available │
│ → Proxy to vLLM on localhost:8001 (with retry) │
└────────────────────────────────────────────────────┘
Proxy Resilience¶
The auth proxy includes several features to prevent request loss under load:
| Feature | Details |
|---|---|
| Retry with backoff | 3 attempts with 1s→2s→4s delays on ConnectError, TimeoutException, RemoteProtocolError, ReadError |
| Backpressure (429) | Tracks in-flight requests; rejects with 429 when >150 concurrent |
| Internal streaming | Non-stream requests use httpx.stream() internally to collect chunks — keeps connection alive during long inference (prevents timeout on 5+ min completions) |
| Stream error recovery | Streaming responses catch mid-stream errors and yield a proper SSE error event instead of leaving truncated responses (prevents TransferEncodingError) |
| Metrics tracking | Every request's latency, status, token usage tracked for /stats and dashboard |
Streaming¶
For streaming requests ("stream": true), the proxy uses httpx.AsyncClient.stream() to forward vLLM's SSE chunks directly to the client in real-time. This means:
- Time-to-first-token (TTFT) is the same as hitting vLLM directly
- No buffering — chunks are yielded as they arrive
text/event-streamcontent type is preserved
Model Caching¶
Two Modal Volumes persist model weights across deployments and cold starts:
| Volume | Mount Path | Contents |
|---|---|---|
m-gpux-hf-cache |
/root/.cache/huggingface |
HuggingFace model downloads |
m-gpux-vllm-cache |
/root/.cache/vllm |
vLLM compiled model artifacts |
First deploy: downloads weights from HuggingFace (5-15 min for large models). Subsequent deploys: weights are already cached (cold start ~1-3 min).
Container Lifecycle¶
| Setting | Behavior |
|---|---|
min_containers=0 |
Scale to zero when idle. Cold start on first request (~3-5 min). |
min_containers=1 |
One container always running. Response time ~0.9s. Costs GPU time continuously. |
scaledown_window=5min |
After last request, container stays warm for 5 minutes before scaling down. |
timeout=24h |
Max container lifetime before forced restart. |
max_inputs=200 |
Up to 200 concurrent requests per container (via @modal.concurrent). |
API Key Storage¶
~/.m-gpux/
└── api_keys.json
[
{
"name": "production",
"key": "sk-mgpux-8b37a191eb046c68...",
"created": "2026-04-12T10:30:00",
"active": true
}
]
Keys are generated with secrets.token_hex(24) (48 hex characters), prefixed with sk-mgpux-.
Only the first active key is embedded in the deployed Modal script. To use a different key, revoke the old one and redeploy.
App Discovery (stop command)¶
The top-level m-gpux stop command finds running apps by:
- Running
modal app list --jsonfor each profile - Filtering apps where
descriptionstarts withm-gpuxandstateisdeployedorrunning - Presenting them in a selection table
- Calling
modal app stop <app_id>on selected apps
The --all flag scans every profile in ~/.modal.toml. Without it, only the current active profile is scanned.
GPU Metrics¶
The _metrics_snippet.py module provides GPU monitoring functions that are injected into generated Modal scripts. These functions use nvidia-smi to report:
- GPU utilization percentage
- VRAM usage (used / total)
- GPU temperature
- Power consumption
The load probe command connects to a running container to display these metrics live.