b200-tuning

Category: Development & Coding | Uploader: dsl-learndsl-learn | Downloads: 0 | Version: v1.0(Latest)

B200 / Blackwell (sm_100a) performance tuning for Triton and CUTLASS kernels. Use when the user asks to optimize, benchmark, or tune a kernel for B200, mentions TFLOPS targets, tile sizes, pipeline stages, TMA, WGMMA, or asks why a kernel is slow on Blackwell. --- # B200 Performance Tuning ## Hardware reference — NVIDIA B200 (sm_100a) | Property | Value | |---|---| | BF16 compute | ~2.25 PFlops (WGMMA / tcgen05) | | SRAM per SM | 228 KB | | L2 cache | 96 MB | | HBM3e bandwidth | 8 TB/s | | Warp groups per SM | 4 × 128 threads | | TMA | Hardware async copy with bounds checking | | WGMMA atom (BF16) | `64×N×K`, K multiple of 16 | ## Triton ### Starting autotune configs ```python triton.Config({"BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64}, num_warps=8, num_stages=4), triton.Config({"BLOCK_M": 256, "BLOCK_N": 128, "BLOCK_K": 64}, num_warps=8, num_stages=4), triton.Config({"BLOCK_M": 256, "BLOCK_N": 256, "BLOCK_K": 64}, num_warps=16, num_stages=4), triton.Config({"BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64}, num_warps=16, num_stages=5), triton.Config({"BLOCK_M": 256, "BLOCK_N": 128, "BLOCK_K": 64}, num_warps=16, num_stages=5), ``` | Parameter | Guidance | |---|---| | `BLOCK_K` | 64 (= 128-byte cache line for bf16) | | `num_warps` | 16 → 4 warpgroups → full SM occupancy | | `num_stages` | 4–5: deep pipeline hides HBM latency | | `key` | Always include problem-size dims, e.g. `key=["M", "N", "K"]` | ### TMA Use `tl.make_tensor_descriptor` + `tl.load_tensor_descriptor` for **every** global access. Benefits: hardware bounds checking (no mask overhead), async prefetch, coalesced access independent of tile alignment. ### Fused epilogue Do residual add / bias / activation inside the same kernel before the final store — avoids a separate memory round-trip. ### SMEM budget check ``` (BLOCK_M × BLOCK_K + BLOCK_N × BLOCK_K) × 2 bytes × num_stages ≤ 228 KB ```

Changelog: Source: GitHub https://github.com/dsl-learn/kernel-to-sol

Directory Structure

Current level: .claude/skills/b200-tuning/

SKILL.md

Login to download/like/favorite ❤ 5 | ★ 0
Comments 0

Please login before commenting.

No comments yet. Be the first one!