batch-sweep

Category: Data & AI | Uploader: AMD-AGIAMD-AGI | Downloads: 0 | Version: v1.0(Latest)

Four sweep operations: (1) Model perf sweep — find optimal batch size / TGS for a model. Use for: sweep batch size, tune TGS, benchmark throughput, find optimal config. (2) Node perf sweep — compare per-node GPU performance to find outliers. Use for: check nodes, node performance, find slow node, compare nodes. (3) Node network health sweep — detect inter-node network issues via multi-node bisection. Use for: network health, IB issues, RCCL problems, node pair testing, isolate network problem. (4) Model sweep — run all model configs on one or two commits. Use for: regression test, validate commit, test all models, smoke test, CI, compare branches.

Changelog: Source: GitHub https://github.com/AMD-AGI/maxtext-slurm

Directory Structure

Current level: tree/main/skills/batch-sweep/

  • 📄 SKILL.md 23.4 KB

SKILL.md

Login to download/like/favorite ❤ 27 | ★ 0
Comments 0

Please login before commenting.

Loading comments...