model-serving-stack

Production LLM serving infrastructure using Triton Inference Server, vLLM, and Ray Serve with OpenAI-compatible endpoints. Includes DCGM GPU autoscaling, Grafana monitoring, and BentoML packaging.

Stack

Triton Inference Server — multi-framework model serving
vLLM — high-throughput LLM serving with PagedAttention
Ray Serve — scalable model deployment with autoscaling
BentoML — portable model packaging and deployment
DCGM — GPU metrics and autoscaling signals

Structure

model-serving-stack/
├── triton/              # Triton model repo + config
├── vllm/               # vLLM server configs and scripts
├── ray_serve/          # Ray Serve deployment definitions
├── bentoml/            # BentoML service and packaging
├── autoscaling/        # HPA manifests driven by DCGM metrics
├── monitoring/         # Grafana dashboards + Prometheus rules
├── deploy/             # Docker Compose + Kubernetes manifests
├── api/                # OpenAI-compatible FastAPI gateway
├── configs/            # Model and serving configs
├── evals/              # Latency/throughput benchmarks
├── tests/              # Unit + integration tests
└── docs/               # Architecture docs

Quick Start

cp .env.template .env
# Fill in NGC_API_KEY, MODEL_PATH, etc.
docker compose -f deploy/docker-compose.yml up

Requirements

NVIDIA GPU with CUDA 12+
Docker + NVIDIA Container Toolkit
Kubernetes (optional, for production)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

model-serving-stack

Stack

Structure

Quick Start

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.cursor/agents		.cursor/agents
.github/workflows		.github/workflows
api		api
autoscaling		autoscaling
bentoml		bentoml
configs		configs
deploy		deploy
docs		docs
evals		evals
kubernetes		kubernetes
monitoring		monitoring
notebooks		notebooks
ray_serve		ray_serve
results		results
tests		tests
triton		triton
vllm		vllm
.cursorrules		.cursorrules
.env.template		.env.template
.gitignore		.gitignore
README.md		README.md
TASKS.md		TASKS.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

model-serving-stack

Stack

Structure

Quick Start

Requirements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages