cabinet-data_curation - a mangoxb Collection

mangoxb 's Collections

cabinet-data_curation

cabinet-data_curation

updated Jan 16

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Paper • 2507.01352 • Published Jul 2, 2025 • 56
A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models

Paper • 2507.13563 • Published Jul 17, 2025 • 53
Scaling Laws for Optimal Data Mixtures

Paper • 2507.09404 • Published Jul 12, 2025 • 37
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Paper • 2511.14993 • Published Nov 19, 2025 • 231
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Paper • 2512.16676 • Published Dec 18, 2025 • 219
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

Paper • 2512.04324 • Published Dec 3, 2025 • 155
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

Paper • 2512.14051 • Published Dec 16, 2025 • 46
DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

Paper • 2511.06307 • Published Nov 9, 2025 • 53
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

Paper • 2511.18050 • Published Nov 22, 2025 • 38
FineVision: Open Data Is All You Need

Paper • 2510.17269 • Published Oct 20, 2025 • 75
A Survey of Data Agents: Emerging Paradigm or Overstated Hype?

Paper • 2510.23587 • Published Oct 27, 2025 • 67
RAG-Anything: All-in-One RAG Framework

Paper • 2510.12323 • Published Oct 14, 2025 • 67
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Paper • 2508.21148 • Published Aug 28, 2025 • 140
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Paper • 2509.18154 • Published Sep 16, 2025 • 55
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Paper • 2508.01191 • Published Aug 2, 2025 • 238
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

Paper • 2508.10975 • Published Aug 14, 2025 • 60
Alchemist: Turning Public Text-to-Image Data into Generative Gold

Paper • 2505.19297 • Published May 25, 2025 • 84
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

Paper • 2509.09680 • Published Sep 11, 2025 • 43
Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

Paper • 2512.16905 • Published Dec 18, 2025 • 32
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

Paper • 2511.12609 • Published Nov 16, 2025 • 105
Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Paper • 2511.16043 • Published Nov 20, 2025 • 109
Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models

Paper • 2512.00590 • Published Nov 29, 2025 • 49
DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

Paper • 2510.16872 • Published Oct 19, 2025 • 109
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

Paper • 2509.12201 • Published Sep 15, 2025 • 106
Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning

Paper • 2503.18406 • Published Mar 24, 2025 • 3