Architecture¶
This page explains how m-gpux is put together: the CLI/plugin system, the generated Modal scripts, the Web Bash terminal path, and the LLM API server proxy layer.
Project Structure¶
m-gpux is organized around a small core plus feature plugins. The CLI entrypoint discovers plugins and attaches their Typer commands to the root app.
m_gpux/
__init__.py
main.py # CLI entrypoint and plugin registration
core/
console.py # shared Rich console
gpus.py # GPU and CPU catalogs
metrics.py # metrics snippet injected into generated scripts
plugin.py # PluginBase, registry, entry-point discovery
profiles.py # Modal profile loading, switching, selection
runner.py # generated script execution helpers
state.py # local sessions, presets, and other state files
ui.py # interactive arrow-key menu
plugins/
account/ # profile CRUD
billing/ # usage reports and billing links
compose/ # Docker Compose analysis, deployment, and sync
dev/ # persistent Modal dev container command
host/ # ASGI, WSGI, and static hosting
hub/ # Jupyter, scripts, Web Bash, vLLM launchers
info/ # CLI/package diagnostics
load/ # GPU hardware probe
preset/ # reusable workload presets
serve/ # LLM API deployment and API key management
sessions/ # local session tracking and workspace pull
stop/ # discover and stop running apps
video/ # text-to-video workflows
vision/ # image-classification workflows
CLI Framework¶
The root command is m_gpux.main:app, registered in pyproject.toml as the m-gpux console script. The root app stays intentionally small: it loads built-in plugins, discovers third-party plugins through the m_gpux.plugins entry-point group, and lets each plugin register its own command tree.
This keeps the feature surface easy to extend:
- Core modules own shared behavior such as profiles, generated-script execution, GPU catalogs, and UI helpers.
- Plugins own user-facing workflows such as
hub,serve,vision,host, andbilling. - Local state in
~/.m-gpuxtracks sessions, API keys, and workload presets. - Third-party packages can add commands without editing the root CLI.
Profile Management¶
Profiles are stored in ~/.modal.toml. Each section represents one Modal identity:
[personal]
token_id = "ak-..."
token_secret = "as-..."
[work]
token_id = "ak-..."
token_secret = "as-..."
When a workflow needs a target account, m-gpux either uses the selected profile or activates it through:
modal profile activate <name>
The hub can also offer an automatic profile choice when enough billing information is available.
Generated Script Pattern¶
Most workflows follow the same transparent execution pattern:
- Collect parameters through prompts or command flags.
- Generate a
modal_runner.pyfile from a Python template. - Show the script for review before execution.
- Run it with
modal run, or deploy it withmodal deploy.
This pattern is used by hub, host, compose, vision, video, and serve. The generated script is intentionally editable so users can inspect Modal decorators, dependencies, timeout settings, volumes, and uploaded paths before committing to a run.
Compose Deployment Architecture¶
compose adds a bridge layer between Docker Compose and Modal.
The workflow:
- Finds and parses a local Compose file.
- Detects services, commands, ports, environment references, and common infra images.
- Generates a temporary Modal app that either runs services as subprocesses in one container or provisions a VM-oriented container flow.
- Attaches a workspace volume so local project files and later sync operations can reuse the same working directory.
This keeps Compose as the local source of truth while still giving users a reviewable Modal script before launch.
Web Bash Terminal¶
The Web Bash shell uses ttyd as the browser terminal bridge. The default path is deliberately close to VS Code's integrated terminal:
The important choices are:
ttydlaunchesbash --logindirectly by default.- The prompt is simple ASCII, not a powerline or emoji-heavy prompt.
- Browser scrollback is enabled with a larger 10,000-line buffer and wheel scrolling.
/workspaceis backed by a Modal Volume for Jupyter, Web Bash, and interactive terminals, with auto-commit roughly every 20 seconds.tmuxis still installed and configured, but users start it manually when they need detachable sessions.- Terminal rendering options favor stable glyph layout and repaint behavior.
This reduces the common browser-terminal problems where tmux status lines, unicode prompt glyphs, or font-width mismatches make characters appear overwritten.
Vision Training Architecture¶
vision train packages a local dataset into the Modal container with Image.add_local_dir, then runs a PyTorch image-classification training loop on the selected GPU.
The generated training app includes:
- Dataset layout validation for
train/,val/, optionaltest/, or a single folder of class subdirectories. - TorchVision model initialization with optional pretrained weights.
- Configurable optimizer, scheduler, augmentation, mixed precision, early stopping, and gradient accumulation.
- Persistent checkpoint and metrics storage in a Modal Volume.
Downstream commands reuse the same artifacts:
vision predictloads checkpoints and class labels to classify new local images.vision evaluateloads checkpoints and computes accuracy, top-k accuracy, confusion matrix, macro F1, and per-class metrics.vision exportwrites deployment artifacts such as ONNX, TorchScript, labels, and export summaries.
LLM API Server Architecture¶
serve deploy creates a two-process architecture inside one Modal GPU container. The proxy starts quickly on port 8000, which satisfies Modal's web server startup probe. vLLM starts in the background on port 8001 and can take longer to load model weights.
Why Two Processes?¶
Large models may take minutes to load. Modal's @modal.web_server(port=8000) expects a listening web server within the startup timeout. If vLLM owned the public port directly, slow model loading could look like a failed startup.
The split solves that:
- FastAPI starts immediately and exposes
/health. - vLLM loads in the background on
localhost:8001. - The proxy returns clear loading or retry behavior instead of losing requests.
- The public API gets auth, metrics, and backpressure before GPU inference work begins.
Auth Flow¶
The proxy applies these gates before forwarding a request:
- Public paths such as
/health,/docs,/openapi.json, and/statscan pass without an API key. - API paths require an
Authorization: Bearer ...header. - Missing keys return
401. - Invalid keys return
403. - Excess concurrent requests return
429. - Valid requests are proxied to vLLM with retry and streaming support.
Proxy Resilience¶
The auth proxy includes several behaviors to reduce request loss under load:
| Feature | Details |
|---|---|
| Retry with backoff | Retries transient connect, timeout, protocol, and read errors. |
| Backpressure | Tracks in-flight requests and rejects excess work with 429. |
| Internal streaming | Keeps long non-stream responses alive while collecting chunks from vLLM. |
| Stream error recovery | Converts mid-stream failures into structured SSE error events when possible. |
| Metrics tracking | Tracks latency, status, and token usage for dashboard and stats views. |
Streaming¶
For streaming requests ("stream": true), the proxy forwards vLLM's server-sent events directly to the client.
That means:
- Time-to-first-token stays close to direct vLLM access.
- Chunks are yielded as they arrive.
text/event-streamis preserved.- Client SDKs can consume the endpoint like an OpenAI-compatible API.
Model Caching¶
Two Modal Volumes persist model artifacts across deployments and cold starts:
| Volume | Mount Path | Contents |
|---|---|---|
m-gpux-hf-cache |
/root/.cache/huggingface |
HuggingFace model downloads |
m-gpux-vllm-cache |
/root/.cache/vllm |
vLLM compiled model artifacts |
The first deploy may spend several minutes downloading weights. Later deploys reuse the cached model data, so startup is usually much faster.
Container Lifecycle¶
| Setting | Behavior |
|---|---|
min_containers=0 |
Scale to zero when idle. First request after idle may cold start. |
min_containers=1 |
Keep one container warm for lower first-token latency. |
scaledown_window=5min |
Keep the container warm briefly after the last request. |
timeout=24h |
Maximum function lifetime before restart. |
max_inputs=200 |
Allow many concurrent inputs per container through Modal concurrency. |
API Key Storage¶
API keys are stored locally:
~/.m-gpux/
api_keys.json
Example:
[
{
"name": "production",
"key": "sk-mgpux-8b37a191eb046c68...",
"created": "2026-04-12T10:30:00",
"active": true
}
]
Keys are generated with secrets.token_hex(24) and prefixed with sk-mgpux-.
Only the active key selected during serve deploy is embedded in the generated Modal script. If you rotate keys, redeploy the server.
App Discovery¶
m-gpux stop finds running apps by:
- Running
modal app list --jsonfor one profile or every configured profile. - Filtering apps whose description starts with
m-gpuxand whose state isdeployedorrunning. - Presenting the matches in a selection table.
- Calling
modal app stop <app_id>on selected apps.
Use m-gpux stop --all when you want to scan every profile in ~/.modal.toml.
GPU Metrics¶
The shared metrics snippet uses nvidia-smi to report:
- GPU utilization percentage.
- VRAM usage.
- GPU temperature.
- Power consumption.
Generated Modal scripts can print one-time metrics at startup, and the load probe command can run a live hardware check for a selected GPU.