返回 Skill 列表
extension
分类: 开发与工程无需 API Key

go-performance-best-practices

Go性能优化指南,涵盖性能分析、内存分配、GC调优、并发处理、PGO(基于配置文件的优化)、以及I/O等方面。当编写、审查或针对性能优化Go代码时应使用此技能。适用于涉及服务响应慢、高延迟、高内存使用率、内存泄漏、goroutine泄漏、GC压力大、CPU性能分析、pprof分析、减少内存分配、sync.Pool使用、互斥锁竞争、HTTP客户端调优、基于配置文件的优化、GOMEMLIMIT调整、Go 1.24特性、Swiss Tables或者任何与Go性能调查相关的任务。

person作者: jakexiaohubgithub

Go Performance Best Practices

Comprehensive performance optimization guide for Go codebases. Contains 41 rules across 8 categories with real-world benchmarks, BOMvault-specific examples, and proven optimization patterns from 10+ years of production experience.

When to Apply

Reference these guidelines when:

  • Writing or refactoring Go code
  • Tuning latency, throughput, allocation rate, or GC behavior
  • Investigating performance regressions
  • Reviewing code for performance issues
  • Debugging memory leaks or goroutine leaks
  • Optimizing containerized services (ECS, Kubernetes)

The Performance Optimization Workflow

Phase 1: Measure First (Don't Guess)

Never optimize without data. The #1 mistake is optimizing based on intuition.

# Step 1: Establish baseline with benchmarks
go test -bench=. -benchmem -count=5 ./... | tee baseline.txt

# Step 2: Generate CPU profile for hot paths
go test -bench=BenchmarkCriticalPath -cpuprofile=cpu.prof
go tool pprof -http=:8080 cpu.prof

# Step 3: Generate heap profile for allocations
go test -bench=BenchmarkCriticalPath -memprofile=heap.prof
go tool pprof -http=:8080 heap.prof

# Step 4: Check allocation counts (correlates with latency)
go tool pprof -alloc_objects heap.prof

Key pprof views: | View | Use For | |------|---------| | top | Quick ranking of hot functions | | list funcname | Line-by-line attribution | | web | Visual call graph | | flame | Flame graph for deep call stacks | | peek funcname | Callers and callees |

Phase 2: Identify the Bottleneck

Use the right profile for the right problem:

| Symptom | Profile Type | pprof Flag | | ---------------------------------- | ------------ | ----------------------------------- | | High CPU usage | CPU | -cpuprofile | | High memory usage | Heap (inuse) | -memprofile + -inuse_space | | High allocation rate / GC pressure | Heap (alloc) | -memprofile + -alloc_objects | | Goroutine leaks | Goroutine | runtime/pprof.Lookup("goroutine") | | Lock contention | Mutex | -mutexprofile | | Blocking operations | Block | -blockprofile |

Quick diagnosis commands:

# CPU: What's using the most cycles?
go tool pprof -top cpu.prof

# Memory: What's consuming the most heap?
go tool pprof -top -inuse_space heap.prof

# Allocations: What's creating the most objects?
go tool pprof -top -alloc_objects heap.prof

# Compare before/after
go tool pprof -base baseline.prof optimized.prof

Phase 3: Apply Targeted Optimization

Match the symptom to the optimization category:

| Symptom | Category | Key Rules | | ------------------- | -------------- | ------------------------------------------------------ | | CPU-bound | Work Avoidance | work-cache-*, work-short-circuit-* | | Memory-bound | Allocation | alloc-preallocate-*, alloc-copy-to-avoid-retention | | GC pauses | GC Tuning | gc-set-gomemlimit, gc-use-sync-pool | | I/O latency | I/O | io-buffered-io, io-reuse-http-client | | Lock contention | Concurrency | conc-reduce-lock-contention, conc-use-atomics | | Goroutine explosion | Concurrency | conc-limit-goroutines, conc-bounded-channels |

Phase 4: Verify Improvement

# Run benchmark again
go test -bench=. -benchmem -count=5 ./... | tee optimized.txt

# Compare results
benchstat baseline.txt optimized.txt

# Verify no regressions in other benchmarks

Success criteria:

  • Measurable improvement (not just "feels faster")
  • No regressions in other areas
  • Code remains readable and maintainable
  • Changes are justified by data

Common Optimization Scenarios

Scenario 1: High Latency / Slow Response Times

Symptoms: P99 latency spikes, slow API responses, timeouts

Diagnosis:

# CPU profile during slow requests
curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof -http=:8080 cpu.prof

Common causes and fixes:

| Cause | Indicator | Fix | | -------------------- | ---------------------------- | ---------------------------------------------------- | | JSON encoding | encoding/json in top | Use json.NewEncoder streaming, consider jsoniter | | Regex compilation | regexp.Compile in hot path | Cache compiled regex at init | | Slice/map scanning | Loops in profile | Convert to map lookup | | String concatenation | + operator in loops | Use strings.Builder | | Excessive logging | Logger in top | Reduce log level in hot path |

Scenario 2: High Memory Usage / OOM Kills

Symptoms: Container OOM killed, memory growing over time, swap thrashing

Diagnosis:

# Heap profile
curl http://localhost:8080/debug/pprof/heap > heap.prof
go tool pprof -inuse_space -top heap.prof

# Check for memory leaks (growing allocations)
go tool pprof -alloc_space -top heap.prof

Common causes and fixes:

| Cause | Indicator | Fix | | ------------------------- | ----------------------------- | -------------------------- | | Large slice retention | append with small subslices | copy() to new slice | | Unbounded caches | Map growing without eviction | Add LRU/TTL eviction | | io.ReadAll on large files | Large []byte allocations | Stream with io.Copy | | String/[]byte conversions | runtime.stringtoslicebyte | Stay in one domain | | Goroutine leaks | Goroutine count growing | Check context cancellation |

Scenario 3: High GC Pressure / CPU Spent in GC

Symptoms: gc_pause_seconds high, runtime.mallocgc in CPU profile

Diagnosis:

# Check GC stats
GODEBUG=gctrace=1 ./myservice 2>&1 | head -20

# Allocation profile
go tool pprof -alloc_objects -top heap.prof

Common causes and fixes:

| Cause | Indicator | Fix | | ------------------------ | ---------------------------- | ------------------------------------- | | Many small allocations | High alloc_objects | Use sync.Pool | | Creating slices in loops | make([]T, ...) in hot path | Preallocate or pool | | fmt.Sprintf in hot path | fmt.* allocations | Use strconv | | Interface boxing | interface{} conversions | Use generics or concrete types | | Not setting GOMEMLIMIT | Frequent GC cycles | Set GOMEMLIMIT to 80-90% of container |

Scenario 4: Goroutine Leaks / Count Growing

Symptoms: Goroutine count increases over time, eventual resource exhaustion

Diagnosis:

# Goroutine profile
curl http://localhost:8080/debug/pprof/goroutine?debug=2 > goroutine.txt
cat goroutine.txt | head -100

# Count by state
curl http://localhost:8080/debug/pprof/goroutine?debug=1 | head -50

Common causes and fixes:

| Cause | Indicator | Fix | | ----------------------- | ---------------------------------- | ------------------------------------- | | Blocked channel receive | chan receive in stack | Add timeout or close channel | | HTTP client no timeout | net/http.(*persistConn).readLoop | Set client timeout | | Ticker not stopped | time.Tick in stack | Use time.NewTicker + defer Stop() | | Context not cancelled | context.Background() everywhere | Pass and check context | | Worker pool leak | Workers waiting on closed channel | Proper shutdown signaling |

Scenario 5: Lock Contention / Serialized Execution

Symptoms: CPU not fully utilized, goroutines blocked on mutex

Diagnosis:

# Mutex profile (must be enabled)
curl http://localhost:8080/debug/pprof/mutex > mutex.prof
go tool pprof -top mutex.prof

# Block profile
curl http://localhost:8080/debug/pprof/block > block.prof
go tool pprof -top block.prof

Common causes and fixes:

| Cause | Indicator | Fix | | --------------------------- | ------------------------------ | ----------------------- | | Global mutex | Single lock in mutex profile | Shard by key | | Write lock for reads | sync.Mutex on read-heavy map | Use sync.RWMutex | | Lock held during I/O | I/O calls while holding lock | Release lock before I/O | | Atomic operations on struct | atomic.Value for config | Use atomic.Pointer[T] |


BOMvault Service Optimization Guide

License Enricher

Profile: CPU-bound, high allocation rate from parsing

Key optimizations:

  • Cache compiled SPDX license regex patterns at init
  • Pool bytes.Buffer for license text processing
  • Preallocate slice for AffectedPackages based on typical size
  • Stream large license files instead of io.ReadAll
// BOMvault license-enricher pattern
var (
    spdxRegex = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9.-]*$`)
    bufPool   = sync.Pool{New: func() any { return new(bytes.Buffer) }}
)

func (e *Enricher) ProcessLicense(data []byte) (*License, error) {
    buf := bufPool.Get().(*bytes.Buffer)
    buf.Reset()
    defer bufPool.Put(buf)
    // ... use buf for processing
}

Vulnerability Enricher

Profile: I/O-bound (NVD API), memory spikes from CVE data

Key optimizations:

  • Reuse http.Client with connection pooling
  • Stream JSON responses for large CVE feeds
  • Set GOMEMLIMIT to 80% of container memory
  • Use map for CVE ID lookups instead of slice scanning
  • Batch database inserts (100-500 per batch)
// BOMvault vulnerability-enricher pattern
var nvdClient = &http.Client{
    Timeout: 30 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 10,
        IdleConnTimeout:     90 * time.Second,
    },
}

type CVEIndex struct {
    byID map[string]*CVE  // O(1) lookup
}

Graph Ingest

Profile: Memory-bound, large SBOM processing

Key optimizations:

  • Stream SBOM JSON parsing with json.Decoder
  • Copy component slices to avoid retaining entire SBOM
  • Use GOMEMLIMIT with soft memory limit
  • Bounded worker pool for parallel component processing
  • Context timeouts for database operations
// BOMvault graph-ingest pattern
func (g *GraphIngest) ProcessSBOM(ctx context.Context, r io.Reader) error {
    dec := json.NewDecoder(r)  // Stream, don't ReadAll

    // Bounded parallelism
    sem := make(chan struct{}, 10)

    for dec.More() {
        var component Component
        if err := dec.Decode(&component); err != nil {
            return err
        }

        sem <- struct{}{}
        go func(c Component) {
            defer func() { <-sem }()
            g.processComponent(ctx, c)
        }(component)
    }
    return nil
}

Alert Writer

Profile: I/O-bound (SARIF generation), batch processing

Key optimizations:

  • Precompute report templates at startup
  • Batch writes to reduce syscalls
  • Pool buffers for SARIF report generation
  • Use strings.Builder for alert message construction
// BOMvault alert-writer pattern
var (
    reportTemplates = template.Must(template.ParseGlob("templates/*.html"))
    bufPool         = sync.Pool{New: func() any { return new(bytes.Buffer) }}
)

func (w *AlertWriter) GenerateSARIF(findings []*Finding) ([]byte, error) {
    buf := bufPool.Get().(*bytes.Buffer)
    buf.Reset()
    buf.Grow(len(findings) * 500)  // Estimate size
    defer bufPool.Put(buf)

    // Batch write to buffer, then single Write to output
}

Rule Categories by Priority

| Priority | Category | Impact | Prefix | | -------- | ----------------------------- | -------- | -------- | | 1 | Measurement & Profiling | CRITICAL | prof- | | 2 | Allocation & Data Structures | HIGH | alloc- | | 3 | Strings, Bytes & Encoding | HIGH | bytes- | | 4 | Concurrency & Synchronization | HIGH | conc- | | 5 | GC & Memory Limits | HIGH | gc- | | 6 | I/O & Networking | HIGH | io- | | 7 | Runtime & Scheduling | MEDIUM | rt- | | 8 | Work Avoidance & Caching | MEDIUM | work- |

Quick Reference

1. Measurement & Profiling (CRITICAL)

| Rule | Impact | When to Apply | | ----------------------------- | ---------- | ---------------------------------- | | prof-use-testing-benchmarks | Foundation | Always benchmark before optimizing | | prof-report-allocs | Foundation | When allocation rate matters | | prof-benchmark-timers | Foundation | When setup skews results | | prof-cpu-profile | Foundation | CPU-bound workloads | | prof-heap-profile | Foundation | Memory issues, GC pressure |

2. Allocation & Data Structures (HIGH)

| Rule | Impact | When to Apply | | ------------------------------- | ----------- | ------------------------- | | alloc-preallocate-slices | 2-10x | Known size, append loops | | alloc-preallocate-maps | 2-5x | Known cardinality | | alloc-copy-to-avoid-retention | Memory leak | Subslices of large arrays | | alloc-use-copy-builtin | 2-3x | Slice-to-slice moves | | alloc-avoid-string-byte-conv | 2x | Frequent conversions | | alloc-use-zero-value-buffers | Minor | Buffer initialization |

3. Strings, Bytes & Encoding (HIGH)

| Rule | Impact | When to Apply | | ----------------------------- | --------- | ------------------------------------------ | | bytes-use-strings-builder | 100-1000x | String concatenation loops (vs + operator) | | bytes-use-bytes-buffer | 10-100x | Byte accumulation | | bytes-grow-when-known | 2-5x | Known final size | | bytes-avoid-fmt-in-hot-path | 5-10x | Number formatting | | bytes-precompile-regexp | 10-100x | Regex in hot path |

4. Concurrency & Synchronization (HIGH)

| Rule | Impact | When to Apply | | ----------------------------- | --------------- | ----------------------- | | conc-limit-goroutines | Stability | Unbounded parallelism | | conc-bounded-channels | 2-5x | Burst absorption | | conc-use-context-cancel | Resource safety | Long-running operations | | conc-reduce-lock-contention | 2-10x | Mutex in profile | | conc-use-atomics | 5-10x | Simple counters | | conc-pass-context | Resource safety | All API boundaries |

5. GC & Memory Limits (HIGH)

| Rule | Impact | When to Apply | | ------------------------ | ------------------- | ------------------------ | | gc-set-gomemlimit | OOM prevention | Containerized apps | | gc-tune-gogc | CPU/memory tradeoff | GC overhead visible | | gc-use-sync-pool | 10-50x | Short-lived buffers | | gc-reset-before-put | Memory leak | Pooled objects with refs | | gc-avoid-pooling-large | Memory | Large objects (>32KB) |

6. I/O & Networking (HIGH)

| Rule | Impact | When to Apply | | ------------------------ | ----------- | ------------------------ | | io-buffered-io | 10x | Unbuffered file I/O | | io-stream-large-bodies | O(1) memory | Large HTTP bodies | | io-reuse-http-client | 7-10x | Multiple HTTP requests | | io-tune-transport | 2-5x | High concurrency HTTP | | io-set-timeouts | Stability | All HTTP servers/clients |

7. Runtime & Scheduling (MEDIUM)

| Rule | Impact | When to Apply | | ------------------------- | ------------- | -------------------- | | rt-avoid-busy-loop | 100x CPU | Polling loops | | rt-stop-tickers | Resource leak | time.NewTicker usage | | rt-set-gomaxprocs | Container CPU | Docker/ECS/K8s | | rt-use-timeout-contexts | Stability | External calls |

8. Work Avoidance & Caching (MEDIUM)

| Rule | Impact | When to Apply | | --------------------------- | ------------ | --------------------------- | | work-cache-compiled-regex | 10-100x | Regex in request path | | work-cache-lookups | O(1) vs O(n) | Repeated containment checks | | work-batch-small-writes | 3-10x | Many small writes | | work-precompute-templates | 10-100x | Template in request path | | work-short-circuit-common | 2-10x | Common trivial inputs |


Decision Trees

"My service is slow"

Is it CPU-bound? (CPU near 100%)
├── Yes → Profile CPU
│   ├── Hot function is I/O → Check io-* rules
│   ├── Hot function is encoding → Check bytes-* rules
│   ├── Hot function is your code → Check work-* rules
│   └── Hot function is GC → Check gc-* rules
└── No → Profile for blocking
    ├── Mutex contention → Check conc-reduce-lock-contention
    ├── Channel blocking → Check conc-bounded-channels
    ├── Network I/O → Check io-* rules
    └── Disk I/O → Check io-buffered-io

"My service uses too much memory"

Is memory growing over time?
├── Yes (leak) →
│   ├── Goroutine count growing → Check context cancellation
│   ├── Map growing → Add eviction/TTL
│   ├── Slice retention → Use copy() for subslices
│   └── Pooled object refs → Reset before Put
└── No (steady but high) →
    ├── Large allocations → Stream instead of ReadAll
    ├── Many small allocations → Use sync.Pool
    ├── High peak usage → Set GOMEMLIMIT
    └── Buffer reallocation → Preallocate with known size

"My service has GC problems"

Is GC taking too much CPU?
├── Yes →
│   ├── Many objects → Pool short-lived objects
│   ├── Large heap → Set GOMEMLIMIT higher
│   └── Frequent cycles → Increase GOGC (200-400)
└── No, but pauses are long →
    ├── Large heap → Reduce allocation rate
    └── Pointer-heavy structures → Consider flat arrays

Profiling Cheat Sheet

Enable pprof in Production

import _ "net/http/pprof"

func main() {
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()
    // ... rest of app
}

Common pprof Commands

# Interactive mode
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
go tool pprof http://localhost:6060/debug/pprof/heap

# Web UI (recommended)
go tool pprof -http=:8080 cpu.prof

# Command-line analysis
go tool pprof -top cpu.prof
go tool pprof -list=FunctionName cpu.prof
go tool pprof -png -output=profile.png cpu.prof

# Compare profiles
go tool pprof -base before.prof after.prof

# Allocation analysis
go tool pprof -alloc_objects heap.prof  # Count of allocations
go tool pprof -alloc_space heap.prof    # Bytes allocated
go tool pprof -inuse_objects heap.prof  # Current live objects
go tool pprof -inuse_space heap.prof    # Current memory usage

Benchmark Commands

# Run all benchmarks
go test -bench=. -benchmem ./...

# Run specific benchmark
go test -bench=BenchmarkProcess -benchmem

# Multiple runs for statistical significance
go test -bench=. -benchmem -count=10 | tee results.txt

# Compare results
go install golang.org/x/perf/cmd/benchstat@latest
benchstat before.txt after.txt

# Generate profiles from benchmarks
go test -bench=BenchmarkProcess -cpuprofile=cpu.prof -memprofile=mem.prof

Profile-Guided Optimization (PGO)

Go 1.21+ supports PGO for 2-7% performance improvement in production workloads.

PGO Workflow

# Step 1: Collect production CPU profile (30+ seconds recommended)
curl http://localhost:6060/debug/pprof/profile?seconds=60 > default.pgo

# Step 2: Place profile in package directory
cp default.pgo ./cmd/myservice/default.pgo

# Step 3: Build with PGO (auto-detects default.pgo)
go build ./cmd/myservice

# Step 4: Verify PGO was applied
go build -gcflags="-d=pgo" ./cmd/myservice 2>&1 | grep "PGO"

Best practices:

  • Collect profiles under realistic production load
  • Re-collect profiles periodically (weekly/monthly)
  • PGO improves inlining and devirtualization decisions
  • Works best for CPU-bound workloads

PGO Impact by Workload Type

| Workload Type | Expected Improvement | Notes | | ----------------- | -------------------- | --------------------------------------- | | HTTP services | 2-4% | Helps with routing, JSON, template code | | GRPC services | 3-5% | Protocol buffer encoding benefits | | CLI tools | 2-3% | Shorter startup time | | Computation-heavy | 5-7% | Best for math, parsing, encoding |

Go 1.24 Features (January 2025+)

Go 1.24 introduces significant runtime improvements:

Swiss Tables for Maps

Maps now use Swiss Tables internally for ~10% faster operations on average:

// No code changes required - automatic in Go 1.24+
m := make(map[string]int)  // Uses Swiss Tables internally

Impact: Lookup and iteration 10-30% faster depending on workload.

testing.B.Loop for Benchmarks

New idiomatic benchmark pattern (Go 1.24+):

// Go 1.23 and earlier
func BenchmarkProcess(b *testing.B) {
    for i := 0; i < b.N; i++ {
        process()
    }
}

// Go 1.24+ (preferred)
func BenchmarkProcess(b *testing.B) {
    for b.Loop() {
        process()
    }
}

Benefits: Avoids common mistakes with benchmark timers, cleaner syntax.

Version Compatibility Table

| Feature | Minimum Go Version | Impact | | ----------------------- | ------------------ | ------------------ | | Generics | 1.18 | Type-safe pools | | GOMEMLIMIT | 1.19 | OOM prevention | | PGO | 1.21 | 2-7% | | maps stdlib package | 1.21 | Clone, Keys | | slices stdlib package | 1.21 | Sort, Clone | | sync.OnceFunc | 1.21 | Lazy init | | cmp package | 1.21 | Generic compare | | log/slog | 1.21 | Structured logs | | Swiss Tables (maps) | 1.24 | 10% faster maps | | testing.B.Loop | 1.24 | Cleaner benchmarks |

References

Full Compiled Document

For the complete guide with all rules expanded: AGENTS.md