Getting Started¶
This guide gets you from zero to running GPU workloads in under 5 minutes.
Prerequisites¶
| Requirement | Why |
|---|---|
| Python 3.10+ | Runtime for the CLI |
| Modal account | You need token_id and token_secret — get them at modal.com/settings |
modal CLI |
Must be installed and available in PATH (pip install modal) |
Install¶
From PyPI (recommended)¶
pip install m-gpux
From source (development)¶
git clone https://github.com/PuxHocDL/m-gpux.git
cd m-gpux
pip install -e .
Verify the install:
m-gpux --help
You should see the welcome screen with Quick Actions.
Step 1: Add your first profile¶
m-gpux account add
You will be prompted for:
| Field | Description | Where to find it |
|---|---|---|
| Profile name | A label like personal or work |
You choose this |
| Token ID | Modal API token ID | modal.com/settings → API Tokens |
| Token Secret | Modal API token secret | Same page, shown once at creation |
Profiles are stored in ~/.modal.toml. You can add as many as you need.
Step 2: Verify your profiles¶
m-gpux account list
You'll see a Rich table showing all configured profiles, with the active one marked.
Step 3: Launch a GPU session¶
m-gpux hub
The interactive hub walks you through:
- Select a profile (if multiple exist)
- Pick a GPU — from T4 (16GB, cheapest) to B200 (latest gen)
- Pick an action:
- Jupyter Lab — opens a GPU-backed notebook in your browser
- Run Python script — execute a local
.pyfile on the GPU - Web Bash shell — interactive terminal in the browser
- Review the generated script — the hub creates a transparent
modal_runner.py - Press Enter to launch, or edit the script first
What happens under the hood
The hub generates a Modal deployment script (modal_runner.py) with your chosen GPU and action, then runs modal run modal_runner.py. The script is fully editable — you can add pip packages, change timeouts, or customize the container image before launching.
Step 4: Check your costs¶
m-gpux billing usage --days 30 --all
This aggregates usage across all your configured profiles for the last 30 days.
To open the Modal billing dashboard in your browser:
m-gpux billing open
Deploy an LLM API¶
Turn any HuggingFace model into a production OpenAI-compatible API with authentication.
1. Create an API key¶
m-gpux serve keys create --name my-key
This generates a sk-mgpux-... key and stores it in ~/.m-gpux/api_keys.json.
Save your key
You can always retrieve it later with m-gpux serve keys show my-key, but treat it like a password.
2. Deploy with the interactive wizard¶
m-gpux serve deploy
The 5-step wizard guides you through:
| Step | What you choose | Default |
|---|---|---|
| 1. Model | 11 presets or custom HuggingFace ID | Qwen/Qwen2.5-7B-Instruct |
| 2. GPU | T4 through B200 | Based on model size |
| 3. Context length | Max tokens (lower = less VRAM) | 4096 |
| 4. Keep warm | 0 = scale-to-zero, 1 = always on |
1 |
| 5. API key | Select existing or auto-create | First active key |
After confirmation, m-gpux deploys a Modal app with:
- Auth proxy (FastAPI on port 8000) — validates Bearer tokens, returns 401/403 for invalid auth
- vLLM backend (port 8001) — serves the model with OpenAI-compatible API
- Shared caches — HuggingFace and vLLM model weights persist in Modal Volumes
3. Test your endpoint¶
# Health check (no auth required)
curl https://<workspace>--m-gpux-llm-api-serve.modal.run/health
# List models (requires auth)
curl https://<workspace>--m-gpux-llm-api-serve.modal.run/v1/models \
-H "Authorization: Bearer sk-mgpux-..."
# Chat completion
curl https://<workspace>--m-gpux-llm-api-serve.modal.run/v1/chat/completions \
-H "Authorization: Bearer sk-mgpux-..." \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen2.5-7B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'
4. Use with OpenAI Python SDK¶
from openai import OpenAI
client = OpenAI(
base_url="https://<workspace>--m-gpux-llm-api-serve.modal.run/v1",
api_key="sk-mgpux-...",
)
# Non-streaming
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Explain quantum computing in 3 sentences."}],
)
print(response.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Write a haiku about GPUs."}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")
5. Warm up the server¶
If you set keep_warm=0 (scale-to-zero), the first request triggers a cold start (~3-5 min). Pre-warm it:
m-gpux serve warmup --url https://<workspace>--m-gpux-llm-api-serve.modal.run
6. Stop when done¶
m-gpux serve stop # stop the LLM API server
m-gpux stop --all # stop ALL running m-gpux apps across profiles
Typical daily workflow¶
┌─ Morning ───────────────────────────────────┐
│ m-gpux account switch work │
│ m-gpux hub → Jupyter on A100 │
│ ... train / experiment ... │
├─ Afternoon ─────────────────────────────────┤
│ m-gpux serve deploy → API for team │
│ m-gpux billing usage --days 1 │
├─ Evening ───────────────────────────────────┤
│ m-gpux stop --all │
│ m-gpux billing usage --days 1 --all │
└─────────────────────────────────────────────┘