Coordinates LiveBench benchmark runs. Reads a pre-built manifest, dispatches one solver per question by sequential index using batch_async, then scores all answers via score-run.sh and produces the benchmark report. Use when running LiveBench evaluations. Do NOT use for MMLU-Pro or general tasks.
Add a new SWE benchmark task from a real GitHub bug-fix. Use when the user provides a GitHub issue or PR URL and wants to add it to the bench-swe pipeline.
Run scalex performance benchmarks using hyperfine against the scala3 repo. Use this skill whenever the user asks to benchmark scalex, measure performance, profile index/query times, compare before/after performance of a change, or mentions "benchmark", "perf", "how fast", "timing", "hyperfine". Also use after implementing performance improvements to verify the gains.
search and analyze llm benchmark results within a fixed benchmark universe, then produce evidence-based model strength and weakness reports or domain-leader summaries. use when comparing a model across benchmarks, ranking the best models by domain, explaining what a benchmark measures, checking predecessor-vs-current progress, or writing benchmark reports that must prioritize exact model version, evaluation date, benchmark variant, score semantics, sub-scores, and benchmark defect warnings. works with browser, web, and multimodal extraction for text, table, canvas, or image-only leaderboards.
Plan, execute, and analyze LLM serving benchmarks across vLLM and SGLang configurations. Use when user says "run benchmarks", "benchmark config", "QPS sweep", "compare serving configs", "custbench", "benchmark analysis", or "generate benchmark report". Do NOT use for writing Terraform (use terraform-automation), deployment validation (use deployment-orchestrator), or GPU hardware diagnostics (use gpu-infra-troubleshooting).