返回 Skill 列表
extension
分类: 内容与媒体无需 API Key

llm-serving-patterns

大语言模型推理基础设施,服务框架(vLLM、TGI、TensorRT-LLM),量化技术,批处理策略以及流式响应模式。在设计大语言模型服务基础设施、优化推理延迟或扩展大语言模型部署时使用。

person作者: jakexiaohubgithub

LLM Serving Patterns

When to Use This Skill

Use this skill when:

  • Designing LLM inference infrastructure
  • Choosing between serving frameworks (vLLM, TGI, TensorRT-LLM)
  • Implementing quantization for production deployment
  • Optimizing batching and throughput
  • Building streaming response systems
  • Scaling LLM deployments cost-effectively

Keywords: LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding

LLM Serving Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         LLM Serving Stack                           │
├─────────────────────────────────────────────────────────────────────┤
│  Clients (API, Chat UI, Agents)                                     │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │              Load Balancer / API Gateway                     │   │
│  │  • Rate limiting  • Authentication  • Request routing        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                   Inference Server                           │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │   │
│  │  │  Request    │  │  Batching   │  │  KV Cache           │  │   │
│  │  │  Queue      │──▶│  Engine     │──▶│  Management        │  │   │
│  │  └─────────────┘  └─────────────┘  └─────────────────────┘  │   │
│  │       │                                      │               │   │
│  │       ▼                                      ▼               │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │              Model Execution Engine                  │    │   │
│  │  │  • Tensor operations  • Attention  • Token sampling │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  └─────────────────────────────────────────────────────────────┘   │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    GPU/TPU Cluster                           │   │
│  │  • Model sharding  • Tensor parallelism  • Pipeline parallel │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Serving Framework Comparison

| Framework | Strengths | Best For | Considerations | | --------- | --------- | -------- | -------------- | | vLLM | PagedAttention, high throughput, continuous batching | General LLM serving, high concurrency | Python-native, active community | | TGI (Text Generation Inference) | Production-ready, Hugging Face integration | Enterprise deployment, HF models | Rust backend, Docker-first | | TensorRT-LLM | NVIDIA optimization, lowest latency | NVIDIA GPUs, latency-critical | NVIDIA-only, complex setup | | Triton Inference Server | Multi-model, multi-framework | Heterogeneous model serving | Enterprise complexity | | Ollama | Simple local deployment | Development, edge deployment | Limited scaling features | | llama.cpp | CPU inference, quantization | Resource-constrained, edge | C++ integration required |

Framework Selection Decision Tree

Need lowest latency on NVIDIA GPUs?
├── Yes → TensorRT-LLM
└── No
    └── Need high throughput with many concurrent users?
        ├── Yes → vLLM (PagedAttention)
        └── No
            └── Need enterprise features + HF integration?
                ├── Yes → TGI
                └── No
                    └── Simple local/edge deployment?
                        ├── Yes → Ollama or llama.cpp
                        └── No → vLLM (general purpose)

Quantization Techniques

Precision Levels

| Precision | Bits | Memory Reduction | Quality Impact | Use Case | | --------- | ---- | ---------------- | -------------- | -------- | | FP32 | 32 | Baseline | None | Training, reference | | FP16/BF16 | 16 | 2x | Minimal | Standard serving | | INT8 | 8 | 4x | Low | Production serving | | INT4 | 4 | 8x | Moderate | Resource-constrained | | INT2 | 2 | 16x | Significant | Experimental |

Quantization Methods

| Method | Description | Quality | Speed | | ------ | ----------- | ------- | ----- | | PTQ (Post-Training Quantization) | Quantize after training, no retraining | Good | Fast to apply | | QAT (Quantization-Aware Training) | Simulate quantization during training | Better | Requires training | | GPTQ | One-shot weight quantization | Very good | Moderate | | AWQ (Activation-aware Weight Quantization) | Preserves salient weights | Excellent | Moderate | | GGUF/GGML | llama.cpp format, CPU-optimized | Good | Very fast inference | | SmoothQuant | Migrates difficulty to weights | Excellent | Moderate |

Quantization Selection

Quality vs. Efficiency Trade-off:

Quality ────────────────────────────────────────────▶ Efficiency
   │                                                      │
   │  FP32    FP16    INT8+AWQ   INT8+GPTQ   INT4   INT2  │
   │   ○───────○────────○──────────○──────────○──────○    │
   │   │       │        │          │          │      │    │
   │  Best   Great    Good      Good       Fair   Poor   │
   │                                                      │

Batching Strategies

Static Batching

Request 1: [tokens: 100] ─┐
Request 2: [tokens: 50]  ─┼──▶ [Batch: pad to 100] ──▶ Process ──▶ All complete
Request 3: [tokens: 80]  ─┘

Problem: Short requests wait for long ones (head-of-line blocking)

Continuous Batching (Preferred)

Time ──────────────────────────────────────────────────────────▶

Req 1: [████████████████████████████████] ──▶ Complete
Req 2: [████████████] ──▶ Complete ──▶ Req 4 starts [████████████████]
Req 3: [████████████████████] ──▶ Complete ──▶ Req 5 starts [████████]

• New requests join batch as others complete
• No padding waste
• Optimal GPU utilization

Batching Parameters

| Parameter | Description | Trade-off | | --------- | ----------- | --------- | | max_batch_size | Maximum concurrent requests | Memory vs. throughput | | max_waiting_tokens | Tokens before forcing batch | Latency vs. throughput | | max_num_seqs | Maximum sequences in batch | Memory vs. concurrency |

KV Cache Management

The KV Cache Problem

Attention: Q × K^T × V

For each token generated:
• Must recompute attention with ALL previous tokens
• K and V tensors grow with sequence length
• Memory: O(batch_size × seq_len × num_layers × hidden_dim)

Example (70B model, 4K context):
• KV cache per request: ~8GB
• 10 concurrent requests: ~80GB GPU memory

PagedAttention (vLLM Innovation)

Traditional KV Cache:
┌──────────────────────────────────────────┐
│ Request 1 KV Cache (contiguous, fixed)   │ ← Wastes memory
├──────────────────────────────────────────┤
│ Request 2 KV Cache (contiguous, fixed)   │
├──────────────────────────────────────────┤
│ FRAGMENTED/WASTED SPACE                  │
└──────────────────────────────────────────┘

PagedAttention:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ R1 │ R2 │ R1 │ R3 │ R2 │ R1 │ R3 │ R2 │  ← Pages allocated on demand
└────┴────┴────┴────┴────┴────┴────┴────┘
• Non-contiguous memory allocation
• Near-zero memory waste
• 2-4x higher throughput

KV Cache Optimization Strategies

| Strategy | Description | Memory Savings | | -------- | ----------- | -------------- | | Paged Attention | Virtual memory for KV cache | ~50% reduction | | Prefix Caching | Reuse KV cache for common prefixes | System prompt: 100% | | Quantized KV Cache | INT8/FP8 for KV values | 50-75% reduction | | Sliding Window | Limited attention context | Linear memory | | MQA/GQA | Grouped query attention | Architecture-dependent |

Streaming Response Patterns

Server-Sent Events (SSE)

Client                                Server
   │                                     │
   │──── GET /v1/chat/completions ──────▶│
   │      (stream: true)                 │
   │                                     │
   │◀──── HTTP 200 OK ───────────────────│
   │      Content-Type: text/event-stream│
   │                                     │
   │◀──── data: {"token": "Hello"} ──────│
   │◀──── data: {"token": " world"} ─────│
   │◀──── data: {"token": "!"} ──────────│
   │◀──── data: [DONE] ──────────────────│
   │                                     │

SSE Benefits:

  • HTTP/1.1 compatible
  • Auto-reconnection support
  • Simple to implement
  • Wide client support

WebSocket Streaming

Client                                Server
   │                                     │
   │──── WebSocket Upgrade ─────────────▶│
   │◀──── 101 Switching Protocols ───────│
   │                                     │
   │──── {"prompt": "Hello"} ───────────▶│
   │                                     │
   │◀──── {"token": "Hi"} ───────────────│
   │◀──── {"token": " there"} ───────────│
   │◀──── {"token": "!"} ────────────────│
   │◀──── {"done": true} ────────────────│
   │                                     │

WebSocket Benefits:

  • Bidirectional communication
  • Lower latency
  • Better for chat applications
  • Connection persistence

Streaming Implementation Considerations

| Aspect | SSE | WebSocket | | ------ | --- | --------- | | Reconnection | Built-in | Manual | | Scalability | Per-request | Connection pool | | Load Balancing | Standard HTTP | Sticky sessions | | Firewall/Proxy | Usually works | May need config | | Best For | One-way streaming | Interactive chat |

Speculative Decoding

Concept

Standard Decoding:
Large Model: [T1] → [T2] → [T3] → [T4] → [T5]
             10ms   10ms   10ms   10ms   10ms = 50ms total

Speculative Decoding:
Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms)
                      │
                      ▼
Large Model: [Verify T1-T5 in one pass] (15ms)
             Accept: T1, T2, T3 ✓  Reject: T4, T5 ✗
                      │
                      ▼
             [Generate T4, T5 correctly]

Total: ~25ms (2x speedup if 60% acceptance)

Speculative Decoding Trade-offs

| Factor | Impact | | ------ | ------ | | Draft model quality | Higher match rate = more speedup | | Draft model size | Larger = better quality, slower | | Speculation depth | More tokens = higher risk/reward | | Verification cost | Must be < sequential generation |

Scaling Strategies

Horizontal Scaling

┌─────────────────────────────────────────────────────────┐
│                    Load Balancer                        │
│         (Round-robin, Least-connections)                │
└─────────────────────────────────────────────────────────┘
         │              │              │
         ▼              ▼              ▼
    ┌─────────┐    ┌─────────┐    ┌─────────┐
    │ vLLM    │    │ vLLM    │    │ vLLM    │
    │ Node 1  │    │ Node 2  │    │ Node 3  │
    │ (GPU×4) │    │ (GPU×4) │    │ (GPU×4) │
    └─────────┘    └─────────┘    └─────────┘

Model Parallelism

| Strategy | Description | Use Case | | -------- | ----------- | -------- | | Tensor Parallelism | Split layers across GPUs | Single large model | | Pipeline Parallelism | Different layers on different GPUs | Very large models | | Data Parallelism | Same model, different batches | High throughput |

Tensor Parallelism (TP=4):
┌─────────────────────────────────────────┐
│              Layer N                     │
│  GPU0   │   GPU1   │   GPU2   │   GPU3  │
│  25%    │   25%    │   25%    │   25%   │
└─────────────────────────────────────────┘

Pipeline Parallelism (PP=4):
GPU0: Layers 0-7
GPU1: Layers 8-15
GPU2: Layers 16-23
GPU3: Layers 24-31

Latency Optimization Checklist

Pre-deployment

  • [ ] Choose appropriate quantization (INT8 for production)
  • [ ] Enable continuous batching
  • [ ] Configure KV cache size appropriately
  • [ ] Set optimal batch size for hardware
  • [ ] Enable prefix caching for system prompts

Runtime

  • [ ] Monitor GPU memory utilization
  • [ ] Track p50/p95/p99 latencies
  • [ ] Measure time-to-first-token (TTFT)
  • [ ] Monitor tokens-per-second (TPS)
  • [ ] Set appropriate timeouts

Infrastructure

  • [ ] Use fastest available interconnect (NVLink, InfiniBand)
  • [ ] Minimize network hops
  • [ ] Place inference close to users (edge)
  • [ ] Consider dedicated inference hardware

Cost Optimization

Cost Drivers

| Factor | Impact | Optimization | | ------ | ------ | ------------ | | GPU hours | Highest | Quantization, batching | | Memory | High | PagedAttention, KV cache optimization | | Network | Medium | Response compression, edge deployment | | Storage | Low | Model deduplication |

Cost Estimation Formula

Monthly Cost =
  (Requests/month) × (Avg tokens/request) × (GPU-seconds/token) × ($/GPU-hour)
  ─────────────────────────────────────────────────────────────────────────────
                                    3600

Example:
• 10M requests/month
• 500 tokens average
• 0.001 GPU-seconds/token (optimized)
• $2/GPU-hour

Cost = (10M × 500 × 0.001 × 2) / 3600 = $2,778/month

Common Patterns

Multi-model Routing

┌─────────────────────────────────────────────────────────┐
│                     Router                              │
│  • Classify request complexity                          │
│  • Route to appropriate model                           │
└─────────────────────────────────────────────────────────┘
         │              │              │
         ▼              ▼              ▼
    ┌─────────┐    ┌─────────┐    ┌─────────┐
    │ Small   │    │ Medium  │    │ Large   │
    │ Model   │    │ Model   │    │ Model   │
    │ (7B)    │    │ (13B)   │    │ (70B)   │
    │ Fast    │    │ Balanced│    │ Quality │
    └─────────┘    └─────────┘    └─────────┘

Caching Strategies

| Cache Type | What to Cache | TTL | | ---------- | ------------- | --- | | Prompt cache | Common system prompts | Long | | KV cache | Prefix tokens | Session | | Response cache | Exact query matches | Varies | | Embedding cache | Document embeddings | Long |

Related Skills

  • ml-system-design - End-to-end ML pipeline design
  • rag-architecture - Retrieval-augmented generation patterns
  • vector-databases - Vector search for LLM context
  • ml-inference-optimization - General inference optimization
  • estimation-techniques - Capacity planning for LLM systems

Version History

  • v1.0.0 (2025-12-26): Initial release - LLM serving patterns for systems design interviews

Last Updated

Date: 2025-12-26