Tuna — Deploy and Serve LLM Models on GPU Infrastructure

Tuna is a hybrid GPU inference orchestrator. It lets you deploy, serve, and manage LLM models (Llama, Qwen, Mistral, DeepSeek, Gemma, and any HuggingFace model) on serverless GPUs from Modal, RunPod, Cerebrium, Google Cloud Run, Baseten, or Azure Container Apps, with optional spot instance fallback on AWS via SkyPilot. Every deployment gets an OpenAI-compatible /v1/chat/completions endpoint.

The key idea: serverless GPUs handle requests immediately (fast cold start, pay-per-second) while a cheaper spot GPU boots in the background. Once spot is ready, traffic shifts there. If spot gets preempted, traffic falls back to serverless automatically. This gives you 3–5x cost savings over pure serverless with zero downtime.

Quick Start — Deploy a Model in 3 Commands

# 1. Install tuna
uv pip install tandemn-tuna

# 2. Deploy a model (auto-picks cheapest serverless provider for the GPU)
tuna deploy --model Qwen/Qwen3-0.6B --gpu L4 --service-name my-llm

# 3. Query your endpoint (shown in deploy output)
curl http://<router-ip>:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello!"}]}'

For serverless-only (no spot, no AWS needed):

tuna deploy --model Qwen/Qwen3-0.6B --gpu L4 --serverless-only

All Commands

`tuna deploy` — Launch a model on GPU

Deploy a model across serverless + spot infrastructure. This is the main command.

tuna deploy --model <HuggingFace-model-ID> --gpu <GPU> [options]

Required arguments:

--model — HuggingFace model ID (e.g., Qwen/Qwen3-0.6B, meta-llama/Llama-3-70b)
--gpu — GPU type (e.g., T4, L4, L40S, A100, H100, B200)

Common options:

--service-name — Name for the deployment (auto-generated if omitted)
--serverless-provider — Force a specific provider: modal, runpod, cloudrun, baseten, azure, cerebrium (default: cheapest available)
--serverless-only — Serverless only, no spot backend or router (no AWS needed)
--gpu-count — Number of GPUs (default: 1)
--tp-size — Tensor parallel size (default: 1)
--max-model-len — Max sequence length (default: 4096)
--spots-cloud — Cloud for spot GPUs: aws or azure (default: aws)
--region — Cloud region for spot instances
--concurrency — Override serverless concurrency limit
--no-scale-to-zero — Keep at least 1 spot replica running
--public — Make endpoint publicly accessible (no auth)
--scaling-policy — Path to YAML with scaling parameters

Provider-specific options:

--gcp-project, --gcp-region — For Cloud Run
--azure-subscription, --azure-resource-group, --azure-region, --azure-environment — For Azure

Examples:

# Deploy Llama 3 on Modal with hybrid spot
tuna deploy --model meta-llama/Llama-3-8b --gpu A100 --serverless-provider modal

tandemn-tuna

Install via CLI (Recommended)

Tuna — Deploy and Serve LLM Models on GPU Infrastructure

Quick Start — Deploy a Model in 3 Commands

All Commands

`tuna deploy` — Launch a model on GPU

Metadata

tandemn-tuna

Install via CLI (Recommended)

Tuna — Deploy and Serve LLM Models on GPU Infrastructure

Quick Start — Deploy a Model in 3 Commands

All Commands

tuna deploy — Launch a model on GPU

Metadata

`tuna deploy` — Launch a model on GPU