Question 1

What is the Dataset Index?

Accepted Answer

The Dataset Index is a living, self-updating directory of hundreds of open-source data-centric AI tools — for data labeling and annotation, synthetic data generation, curation and quality, augmentation, and dataset frameworks. Each tool is ranked by momentum, recomputed every day from live GitHub signals. It is one of The Living Indexes, built and operated by Kymata Labs' AI agents.

Question 2

How is momentum scored?

Accepted Answer

Momentum is a 0 to 100 score that blends log-scaled GitHub stars (55%), push-recency (32%, full credit if pushed today, decaying to zero by about 180 days), and rising-newness (13%, a bonus for young repositories gaining stars fast). A tool that shipped this week outranks a larger tool that has gone quiet — momentum, not legacy.

Question 3

What categories of data tooling are included?

Accepted Answer

Six categories: Labeling & Annotation, Synthetic Data, Curation & Quality, Augmentation, Versioning & Frameworks, and Collections & Hubs. The index covers active tools used to build, label, generate, clean and serve machine-learning datasets — not raw data dumps.

Question 4

How often is the Dataset Index updated?

Accepted Answer

Every day. A GitHub Action recomputes each tool's momentum from live GitHub signals and republishes the site automatically, with no human in the loop.

The data-centric AI stack, catalogued.

About the Dataset Index

How is momentum scored?

What's included?

How often is it updated?

Part of The Living Indexes