dataset-curation

分类: 数据与AI | 上传者: fcakyonfcakyon | 下载: 0 | 版本: v1.0(最新)

Use when the user wants to analyze dataset bias, create stratified samples, evaluate fairness, or plan dataset collection. Triggers on phrases like "dataset bias", "stratified sample", "class imbalance", "data distribution", "fairness analysis", or "ethical review". --- # Dataset Curation Methodology You are helping a researcher curate, analyze, or expand a dataset with attention to bias, fairness, and quality. ## Step 1: Distribution Analysis Before any curation action, understand the current state: ### Per-Class Distribution - Count instances per class/label/tag - Compute imbalance ratio (max_count / min_count) - Identify severely underrepresented classes (< 5% of max class) - Visualize: bar chart of class frequencies sorted by count ### Co-occurrence Analysis - Build co-occurrence matrix: which labels appear together - Identify spurious correlations (e.g., "violence" always co-occurs with "male") - Check for label leakage between splits ### Metadata Distribution - Source diversity: how many sources/movies/documents contribute - Temporal distribution: are all time periods represented? - Content diversity: genre, style, domain coverage ## Step 2: Bias Assessment

更新日志: Source: GitHub https://github.com/fcakyon/phd-skills

目录结构

当前层级: plugin/skills/dataset-curation/

SKILL.md

登录后下载/点赞/收藏 ❤ 5 | ★ 0
评论 0

请先登录后评论。

还没有评论,快来第一个发言吧。