返回 Skill 列表
extension
分类: 内容与媒体无需 API Key

mechinterp-decoder

分析SAE解码器权重 - 输出影响、特征重要性及解码器相似性

person作者: jakexiaohubgithub

MechInterp Decoder

Analyze SAE features through their decoder weights. This skill answers: "What does this feature RECOMMEND?" rather than "What activates this feature?"

Purpose

Decoder analysis provides a complementary perspective to activation analysis:

| Analysis Type | Question Answered | |---------------|-------------------| | Activation (overview, sweeps) | "What inputs activate this feature?" | | Decoder (this skill) | "What outputs does this feature promote?" |

For diffuse or heterogeneous features where activation analysis shows multiple modes, decoder analysis often reveals the unifying concept.

When to Use

Use this skill when:

  1. Activation analysis is inconclusive - Multiple modes or no clear pattern
  2. Feature appears heterogeneous - Different builds activate it for different reasons
  3. Looking for "what does it recommend" - Shift from inputs to outputs
  4. Checking AP level preferences - Does feature prefer low-AP (_3, _6) vs high-AP (_57)?
  5. Finding similar features - Cluster features by decoder similarity

Commands

Output Influence

Show what tokens a feature promotes (positive contribution) or suppresses (negative contribution):

cd /root/dev/SplatNLP

# Basic output influence
poetry run python -m splatnlp.mechinterp.cli.decoder_cli output-influence \
    --feature-id 13934 \
    --model ultra

# JSON output
poetry run python -m splatnlp.mechinterp.cli.decoder_cli output-influence \
    --feature-id 13934 \
    --model ultra \
    --format json

# More tokens
poetry run python -m splatnlp.mechinterp.cli.decoder_cli output-influence \
    --feature-id 13934 \
    --model ultra \
    --top-k 25

Sample Output:

## Feature 13934 Output Influence (ultra)

### Tokens This Feature PROMOTES

| Token | Contribution | Family | AP Level |
|-------|--------------|--------|----------|
| respawn_punisher | +0.232 | respawn_punisher | binary |
| comeback | +0.159 | comeback | binary |
| quick_super_jump_6 | +0.155 | quick_super_jump | 6 |
| intensify_action_3 | +0.140 | intensify_action | 3 |
| ink_saver_main_6 | +0.128 | ink_saver_main | 6 |

### Tokens This Feature SUPPRESSES

| Token | Contribution | Family | AP Level |
|-------|--------------|--------|----------|
| run_speed_up_57 | -0.301 | run_speed_up | 57 |
| quick_respawn_57 | -0.247 | quick_respawn | 57 |
| swim_speed_up_57 | -0.209 | swim_speed_up | 57 |

### Interpretation
- **Top promoted**: respawn_punisher (+0.232)
- **Top suppressed**: run_speed_up_57 (-0.301)
- **Pattern**: Promotes low-AP tokens, suppresses high-AP stacking

Weight Percentile

Check how important a feature is by its decoder weight magnitude:

poetry run python -m splatnlp.mechinterp.cli.decoder_cli weight-percentile \
    --feature-id 13934 \
    --model ultra

Sample Output:

## Feature 13934 Decoder Weight (ultra)

- **Magnitude**: 2.3456
- **Percentile**: 78.5%
- **Total features**: 24576

Interpretation:

  • High percentile (>90%): Feature has strong output influence
  • Low percentile (<10%): Feature has weak output influence
  • Note: Low-magnitude features may still be important for specific tokens

Similar Features (by Decoder)

Find features with similar decoder patterns (what they recommend):

poetry run python -m splatnlp.mechinterp.cli.decoder_cli similar \
    --feature-id 13934 \
    --model ultra \
    --top-k 10

Sample Output:

## Features Similar to 13934 (ultra)

| Feature ID | Cosine Similarity |
|------------|-------------------|
| 13892 | 0.9234 |
| 14501 | 0.8876 |
| 12044 | 0.8521 |

Experiment Runner

For programmatic use or integration with runner_cli:

# Create spec file
cat > decoder_spec.json << 'EOF'
{
  "type": "decoder_output_analysis",
  "feature_id": 13934,
  "model_type": "ultra",
  "variables": {
    "top_k_promoted": 15,
    "top_k_suppressed": 15,
    "group_by_family": true,
    "include_ap_level": true
  }
}
EOF

# Run via runner CLI
poetry run python -m splatnlp.mechinterp.cli.runner_cli \
    --spec-path decoder_spec.json

Interpretation Guide

AP Level Patterns

| Pattern | Meaning | |---------|---------| | Promotes _3, _6; Suppresses _51, _57 | "Use balanced spread, not stacking" | | Promotes _57; Suppresses low AP | "Heavy stacking is the goal" | | Promotes binary (RP, CB, OG) | "These specific abilities are key" | | Mixed AP levels promoted | "Ability presence matters, not amount" |

Common Feature Types

| Output Pattern | Feature Type | |----------------|--------------| | Single family promoted | Family detector (e.g., SCU detector) | | Low-AP promoted, high-AP suppressed | "Balanced utility recommendation" | | Binary abilities promoted | "Build style marker" (aggressive, defensive) | | Death perks promoted (QR, SS, CB) | "Death-tolerant" archetype | | Death perks suppressed | "Death-averse" archetype |

Integration with Investigation Workflow

Decoder analysis fits into the investigation workflow as follows:

1. Overview (mechinterp-overview)
   ↓
2. Hypothesis formation
   ↓
3. 1D Sweeps (mechinterp-runner)
   ↓
4. Core Coverage Check ← NEW: Catch tail markers
   ↓
5. If diffuse/heterogeneous:
   → Decoder Output Analysis ← THIS SKILL
   ↓
6. Label formulation

Example: Feature 13934 (from investigation log)

Problem: Activation analysis showed two opposite modes (RP anchor vs Zombie builds).

Solution: Decoder analysis revealed unifying pattern:

PROMOTES: low-AP utility (_3, _6 tokens)
SUPPRESSES: heavy stacking (_51, _57 tokens)

→ Feature recommends "balanced utility spread" regardless of death strategy

Key Insight: Different builds (RP vs Zombie) activate the feature because they share a NEED (balanced utility), not a BUILD pattern.

See Also

  • mechinterp-overview: Initial feature assessment
  • mechinterp-runner: Run experiments (including core_coverage_analysis, decoder_output_analysis)
  • mechinterp-investigator: Full investigation workflow
  • mechinterp-labeler: Save labels after investigation