返回 Skill 列表
extension
分类: 其它无需 API Key

huawei-cloud-ascend-op-mfu-calculator

计算MFU(机器浮点利用率),针对Ascend NPU上的matmul/GEMM/FlashAttention等算子,提供清晰的公式和推导过程。

person作者: huaweiclouddevhubclawhub

Huawei Cloud Ascend Operator MFU Calculator

Overview

This skill calculates MFU (Machine FLOP Utilization) for operators like matmul/GEMM/FlashAttention on Ascend NPU, providing clear formulas and derivation process.

Architecture: Input Validation → FLOPs Calculation → Achieved TFLOPs/s → MFU Calculation → Result Analysis

Related Skills:

  • huawei-cloud-ascend-profiler-db-explorer - Profiling data analysis for operator performance data

Prerequisites

  1. Python 3.8+ installed
  2. Basic understanding of FLOPs calculation concepts

Usage Scenarios

Typical Problem Scenarios:

  • Evaluating how well an operator utilizes Ascend NPU compute power
  • Comparing performance of different operator implementations
  • Identifying optimization opportunities for matrix operations

Typical User Utterances:

  • "Calculate MFU for my GEMM operator"
  • "What's the machine FLOP utilization for FlashAttention?"
  • "Analyze my matmul operator performance efficiency"

Workflow

  1. Input Collection: Gather operator parameters (matrix dimensions, data types, execution time)
  2. FLOPs Calculation: Compute theoretical FLOPs for the operation
  3. Achieved Performance: Calculate achieved TFLOPs/s from execution time
  4. MFU Calculation: Apply formula MFU = Achieved FLOPs / Peak FLOPs
  5. Result Analysis: Provide interpretation and optimization suggestions

MFU Calculation Formula

MFU = (Achieved FLOPs / Peak FLOPs) × 100%

Where:

  • Achieved FLOPs = Operation FLOPs / Execution Time
  • Peak FLOPs = Hardware-specific peak performance (e.g., Ascend 910B: 256 TFLOPs for FP16)

Reference Documents

| Document | Description | | -------- | ----------- | | Ascend 910B Series Technical Specifications | Official Ascend 910B series product specifications | | MFU Calculation Methodology | Detailed MFU calculation formulas and examples | | FlashAttention Technical Paper | Original FlashAttention research paper |

Enhanced Features

Intelligent Bottleneck Diagnoser

  • AI-powered bottleneck diagnosis that analyzes profiling data to identify root causes automatically
  • Classifies bottlenecks into categories: memory-bound, compute-bound, communication-bound, or operator-fallback
  • Provides actionable optimization recommendations with priority ranking
  • Includes pattern matching for known performance anti-patterns

Parameter Confirmation

| Parameter | Description | Required | |-----------|-------------|----------| | operator | Operator type (matmul/flash_attention/gemm, etc.) | Yes | | flops | Theoretical FLOPs of the operator | Yes | | time_ms | Operator execution time (milliseconds) | Yes | | peak_tflops | Hardware peak computing power (TFLOPS) | Yes | | device | NPU device type (910B/910, etc.) | No |