Huawei Cloud Ascend Operator MFU Calculator

Overview

This skill calculates MFU (Machine FLOP Utilization) for operators like matmul/GEMM/FlashAttention on Ascend NPU, providing clear formulas and derivation process.

Architecture: Input Validation → FLOPs Calculation → Achieved TFLOPs/s → MFU Calculation → Result Analysis

Related Skills:

huawei-cloud-ascend-profiler-db-explorer - Profiling data analysis for operator performance data

Prerequisites

Python 3.8+ installed
Basic understanding of FLOPs calculation concepts

Usage Scenarios

Typical Problem Scenarios:

Evaluating how well an operator utilizes Ascend NPU compute power
Comparing performance of different operator implementations
Identifying optimization opportunities for matrix operations

Typical User Utterances:

"Calculate MFU for my GEMM operator"
"What's the machine FLOP utilization for FlashAttention?"
"Analyze my matmul operator performance efficiency"

Workflow

Input Collection: Gather operator parameters (matrix dimensions, data types, execution time)
FLOPs Calculation: Compute theoretical FLOPs for the operation
Achieved Performance: Calculate achieved TFLOPs/s from execution time
MFU Calculation: Apply formula MFU = Achieved FLOPs / Peak FLOPs
Result Analysis: Provide interpretation and optimization suggestions

MFU Calculation Formula

MFU = (Achieved FLOPs / Peak FLOPs) × 100%

Where:

Achieved FLOPs = Operation FLOPs / Execution Time
Peak FLOPs = Hardware-specific peak performance (e.g., Ascend 910B: 256 TFLOPs for FP16)

Reference Documents

| Document | Description | | -------- | ----------- | | Ascend 910B Series Technical Specifications | Official Ascend 910B series product specifications | | MFU Calculation Methodology | Detailed MFU calculation formulas and examples | | FlashAttention Technical Paper | Original FlashAttention research paper |

Enhanced Features

Intelligent Bottleneck Diagnoser

AI-powered bottleneck diagnosis that analyzes profiling data to identify root causes automatically
Classifies bottlenecks into categories: memory-bound, compute-bound, communication-bound, or operator-fallback
Provides actionable optimization recommendations with priority ranking
Includes pattern matching for known performance anti-patterns

Parameter Confirmation

| Parameter | Description | Required | |-----------|-------------|----------| | operator | Operator type (matmul/flash_attention/gemm, etc.) | Yes | | flops | Theoretical FLOPs of the operator | Yes | | time_ms | Operator execution time (milliseconds) | Yes | | peak_tflops | Hardware peak computing power (TFLOPS) | Yes | | device | NPU device type (910B/910, etc.) | No |