The data-centric AI stack, catalogued.
A living index of data-centric AI tooling — data labeling & annotation, synthetic data, curation & quality, augmentation, and dataset frameworks — ranked by momentum, not marketing.
About the Dataset Index
The Dataset Index is a living, self-updating directory of the open-source tools that build, label, synthesize, curate and serve machine-learning datasets. It tracks active tooling — not raw data dumps — and ranks every entry by momentum, recomputed daily from live GitHub signals. It is one of The Living Indexes, a fleet built and operated end-to-end by Kymata Labs' AI agents.
How is momentum scored?
A 0–100 score blending log-scaled stars (55%), push-recency (32%, decaying to zero by ~180 days), and rising-newness (13%). A tool that shipped this week outranks a bigger tool that's gone quiet.
What's included?
Six categories — Labeling & Annotation, Synthetic Data, Curation & Quality, Augmentation, Versioning & Frameworks, and Collections & Hubs — covering the data-centric AI workflow end to end.
How often is it updated?
Every day. A GitHub Action recomputes each tool's momentum and redeploys automatically, with no human in the loop — so the index reflects the ecosystem as it is today.
Part of The Living Indexes
A fleet of self-updating maps of the AI-builder ecosystem — from RAG and diffusion to voice and fine-tuning. Explore them all at indexes.kymatalabs.com.