cabinet-data_curation
updated
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Paper
• 2507.01352
• Published
• 56
A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges
in Russian Speech Generative Models
Paper
• 2507.13563
• Published
• 53
Scaling Laws for Optimal Data Mixtures
Paper
• 2507.09404
• Published
• 37
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Paper
• 2511.14993
• Published
• 231
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Paper
• 2512.16676
• Published
• 219
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
Paper
• 2512.04324
• Published
• 155
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
Paper
• 2512.14051
• Published
• 46
DRIVE: Data Curation Best Practices for Reinforcement Learning with
Verifiable Reward in Competitive Code Generation
Paper
• 2511.06307
• Published
• 53
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
Paper
• 2511.18050
• Published
• 38
FineVision: Open Data Is All You Need
Paper
• 2510.17269
• Published
• 75
A Survey of Data Agents: Emerging Paradigm or Overstated Hype?
Paper
• 2510.23587
• Published
• 67
RAG-Anything: All-in-One RAG Framework
Paper
• 2510.12323
• Published
• 67
A Survey of Scientific Large Language Models: From Data Foundations to
Agent Frontiers
Paper
• 2508.21148
• Published
• 140
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and
Training Recipe
Paper
• 2509.18154
• Published
• 55
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Paper
• 2508.01191
• Published
• 238
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale
Pretraining
Paper
• 2508.10975
• Published
• 60
Alchemist: Turning Public Text-to-Image Data into Generative Gold
Paper
• 2505.19297
• Published
• 84
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning
Dataset and Comprehensive Benchmark
Paper
• 2509.09680
• Published
• 43
Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection
Paper
• 2512.16905
• Published
• 32
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
Paper
• 2511.12609
• Published
• 105
Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning
Paper
• 2511.16043
• Published
• 109
Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models
Paper
• 2512.00590
• Published
• 49
DeepAnalyze: Agentic Large Language Models for Autonomous Data Science
Paper
• 2510.16872
• Published
• 109
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
Paper
• 2509.12201
• Published
• 106
Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated
Data Refinement Using Contrastive Learning
Paper
• 2503.18406
• Published
• 3