# The Dataset Index > The living index of data-centric AI tooling — data labeling & annotation, synthetic > data, curation & quality, augmentation, and dataset frameworks — ranked daily by GitHub momentum. Updated: 2026-06-13T11:32:07.713997+00:00 Tools indexed: 249 ## Top data-centric AI tools by momentum - [HumanSignal/label-studio](https://github.com/HumanSignal/label-studio) — momentum 87, ⭐27590 — Labeling & Annotation — Label Studio is a multi-type data labeling and annotation tool with standardized output format - [dolthub/dolt](https://github.com/dolthub/dolt) — momentum 86, ⭐23412 — Versioning & Frameworks — Dolt – Git for Data - [huggingface/datasets](https://github.com/huggingface/datasets) — momentum 86, ⭐21619 — Collections & Hubs — 🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data m - [joke2k/faker](https://github.com/joke2k/faker) — momentum 85, ⭐19272 — Synthetic Data — Faker is a Python package that generates fake data for you. - [stefan-jansen/machine-learning-for-trading](https://github.com/stefan-jansen/machine-learning-for-trading) — momentum 85, ⭐19113 — Synthetic Data — Code for Machine Learning for Algorithmic Trading, 2nd edition. - [cvat-ai/cvat](https://github.com/cvat-ai/cvat) — momentum 84, ⭐16060 — Labeling & Annotation — Computer Vision Annotation Tool (CVAT) is a leading platform for building high-quality visual datase - [treeverse/dvc](https://github.com/treeverse/dvc) — momentum 83, ⭐15675 — Versioning & Frameworks — 🦉 Data Versioning and ML Experiments - [simonw/datasette](https://github.com/simonw/datasette) — momentum 82, ⭐11185 — Versioning & Frameworks — An open source multi-tool for exploring and publishing data - [voxel51/fiftyone](https://github.com/voxel51/fiftyone) — momentum 82, ⭐10776 — Versioning & Frameworks — Refine high-quality datasets and visual AI models - [datajuicer/data-juicer](https://github.com/datajuicer/data-juicer) — momentum 79, ⭐6536 — Synthetic Data — Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷 - [snorkel-team/snorkel](https://github.com/snorkel-team/snorkel) — momentum 78, ⭐5975 — Labeling & Annotation — A system for quickly generating training data with weak supervision - [NVIDIA/DALI](https://github.com/NVIDIA/DALI) — momentum 78, ⭐5708 — Augmentation — A GPU-accelerated library containing highly optimized building blocks and an execution engine for da - [treeverse/lakeFS](https://github.com/treeverse/lakeFS) — momentum 78, ⭐5399 — Versioning & Frameworks — lakeFS - Data version control for your data lake | Git for data - [argilla-io/argilla](https://github.com/argilla-io/argilla) — momentum 77, ⭐4998 — Labeling & Annotation — Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets - [torchgeo/torchgeo](https://github.com/torchgeo/torchgeo) — momentum 77, ⭐4066 — Versioning & Frameworks — TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data - [ConardLi/easy-dataset](https://github.com/ConardLi/easy-dataset) — momentum 76, ⭐14450 — Versioning & Frameworks — A powerful tool for creating datasets for LLM fine-tuning 、RAG and Eval - [tensorflow/datasets](https://github.com/tensorflow/datasets) — momentum 76, ⭐4570 — Collections & Hubs — TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... - [OpenCSGs/csghub](https://github.com/OpenCSGs/csghub) — momentum 76, ⭐4173 — Versioning & Frameworks — CSGHub is a brand-new open-source platform for managing LLMs, developed by the OpenCSG team. It offe - [sdv-dev/SDV](https://github.com/sdv-dev/SDV) — momentum 76, ⭐3502 — Synthetic Data — Synthetic data generation for tabular data - [argilla-io/distilabel](https://github.com/argilla-io/distilabel) — momentum 75, ⭐3250 — Synthetic Data — Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable a - [logpai/loghub](https://github.com/logpai/loghub) — momentum 75, ⭐2740 — Collections & Hubs — A large collection of system log datasets for AI-driven log analytics [ISSRE'23] - [unsplash/datasets](https://github.com/unsplash/datasets) — momentum 74, ⭐2728 — Versioning & Frameworks — 🎁 7,400,000+ Unsplash images made available for research and machine learning - [TorchIO-project/torchio](https://github.com/TorchIO-project/torchio) — momentum 74, ⭐2408 — Augmentation — Medical imaging processing for AI applications. - [cuevhv/mamma](https://github.com/cuevhv/mamma) — momentum 74, ⭐529 — Synthetic Data — Official code for MAMMA: Markerless Accurate Multi-person Motion Acquisition. - [synthetichealth/synthea](https://github.com/synthetichealth/synthea) — momentum 73, ⭐3185 — Synthetic Data — Synthetic Patient Population Simulator - [imaNNeo/fl_chart](https://github.com/imaNNeo/fl_chart) — momentum 72, ⭐7531 — Versioning & Frameworks — FL Chart is a highly customizable Flutter chart library that supports Line Chart, Bar Chart, Pie Cha - [diffgram/diffgram](https://github.com/diffgram/diffgram) — momentum 72, ⭐1906 — Labeling & Annotation — The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human - [bespokelabsai/curator](https://github.com/bespokelabsai/curator) — momentum 72, ⭐1686 — Synthetic Data — Synthetic data curation for post-training and structured data extraction - [intellicia-public/parastore](https://github.com/intellicia-public/parastore) — momentum 72, ⭐607 — Synthetic Data — Draw a store, generate LLM personas, and watch them shop — an isometric 3D sandbox for synthetic-con - [doccano/doccano](https://github.com/doccano/doccano) — momentum 71, ⭐10673 — Labeling & Annotation — Open source annotation tool for machine learning practitioners. - [hitsz-ids/synthetic-data-generator](https://github.com/hitsz-ids/synthetic-data-generator) — momentum 71, ⭐2422 — Synthetic Data — SDG is a specialized framework designed to generate high-quality structured tabular data. - [sdv-dev/CTGAN](https://github.com/sdv-dev/CTGAN) — momentum 71, ⭐1560 — Synthetic Data — Conditional GAN for generating synthetic tabular data. - [quiltdata/quilt](https://github.com/quiltdata/quilt) — momentum 71, ⭐1366 — Versioning & Frameworks — Quilt is a Scientific Data Management Platform on AWS that helps teams and AI find, trust, and reuse - [Renumics/spotlight](https://github.com/Renumics/spotlight) — momentum 70, ⭐1257 — Versioning & Frameworks — Interactively explore unstructured datasets from your dataframe. - [python-adaptive/adaptive](https://github.com/python-adaptive/adaptive) — momentum 70, ⭐1221 — Labeling & Annotation — :chart_with_upwards_trend: Adaptive: parallel active learning of mathematical functions - [GreenmaskIO/greenmask](https://github.com/GreenmaskIO/greenmask) — momentum 69, ⭐1692 — Synthetic Data — Database anonymization and test data management - [huggingface/aisheets](https://github.com/huggingface/aisheets) — momentum 69, ⭐1636 — Synthetic Data — Build, enrich, and transform datasets using AI models with no code - [Somnusochi/VLM-AutoYOLO](https://github.com/Somnusochi/VLM-AutoYOLO) — momentum 69, ⭐118 — Labeling & Annotation — AI Auto Annotation & YOLO Training Pipeline, End-to-end object detection auto-labeling and YOLO trai - [expressive-code/expressive-code](https://github.com/expressive-code/expressive-code) — momentum 68, ⭐936 — Labeling & Annotation — A text marking & annotation engine for presenting source code on the web. - [asreview/asreview](https://github.com/asreview/asreview) — momentum 68, ⭐930 — Labeling & Annotation — Active learning for systematic reviews