FineVision: Open Data Is All You Need

huggingface.co/datasets/HuggingFaceM4/FineVision

Today, we release FineVision, a new multimodal dataset with 24 million samples. We created FineVision by collecting over 200 datasets containing 17M images, 89M question-answer turns, and 10B answer tokens, totaling 5TB of high-quality data. Additionally, we extensively processed all datasets to unify their format, clean them of duplicates and poor data, and rated all turns using 32B VLMs across 4 qualitative metrics with a score from 1-5 to enable the construction and study of individual training mixtures.

To enable everyone to construct state-of-the-art open Vision-Language Models (VLMs), we ran extensive ablations on FineVision, and compared it to publicly available alternatives. Models trained on FineVision lead in performance over 11 common benchmarks compared against every baseline, thanks to FineVision’s scale and diversity of data.

To use the dataset, simply load it with:

python

from datasets import load_dataset, get_dataset_config_names

# Get all subset names and load the first one
available_subsets = get_dataset_config_names('HuggingFaceM4/FineVision')
ds = load_dataset(
  'HuggingFaceM4/FineVision',
  name=available_subsets[0],
  split='train', streaming=True,
)

# Inspect the first sample
ds[0]

Why this dataset?

Even though open-weight Vision-Language Models are becoming ever more powerful, the accessibility of the training data used for these models is lagging behind. This data is often proprietary and inaccessible to the broader community. Projects like The Cauldron, LLaVa, and Cambrian aim to provide such datasets, but get quickly outpaced by the speed of the field and the emergence of novel applications for VLMs, like agentic tasks. For FineVision we set out to combine and unify existing available data sources to create a large and high-quality dataset. As a first step we need to collect and standardize the datasets.

How did we build FineVision?

FineVision was a giant act of data curation. We started by collecting publicly available datasets, and augmenting underrepresented categories. We then evaluated all datasets for duplicated data internally and benchmark contamination. This data is then cleaned and rated, before being added to the final mixture.

Data Collection

We manually collected over 200 image-text datasets from various publicly available sources and processed them to unify their formatting. On top of that, some datasets are not presented in chat form, so we converted them into question-answer pairs. In some cases, this goes as far as synthetically creating questions for all samples. Finally, we adressed underrepresented domains, such as GUI-oriented data. To fill this gap, we create and add a new dataset that was compiled from existing GUI datasets, after applying chat normalization and unifying the action space to convert their specific formats into a more general GUI action space.

FineVision Subsets - click to see more

Subset Name	Total Images	Total Samples	Total Turns	Total Question Tokens	Total Answer Tokens	Category	Source
coco_colors	118,287	118,287	118,287	1,301,157	6,376,672	Captioning & Knowledge	(1)
densefusion_1m	1,058,751	1,058,751	1,058,751	10,692,478	263,718,217	Captioning & Knowledge	(2) (Converted)
face_emotion	797	797	797	8,767	8,066	Captioning & Knowledge	(3)
google_landmarks	299,993	299,993	842,127	6,194,978	10,202,980	Captioning & Knowledge	(4) (Converted)
image_textualization(filtered)	99,573	99,573	99,573	917,577	19,374,090	Captioning & Knowledge	(5)
laion_gpt4v	9,301	9,301	9,301	93,950	1,875,283	Captioning & Knowledge	(6)
localized_narratives	199,998	199,998	199,998	2,167,179	8,021,473	Captioning & Knowledge	(7)
sharegpt4o	57,284	57,284	57,284	558,647	36,555,323	Captioning & Knowledge	(8)
sharegpt4v(coco)	50,017	50,017	50,017	460,893	9,825,387	Captioning & Knowledge	(9)
sharegpt4v(knowledge)	1,988	1,988	1,988	18,250	293,850	Captioning & Knowledge	(9)
sharegpt4v(llava)	29,986	29,986	29,986	275,783	6,175,899	Captioning & Knowledge	(9)
sharegpt4v(sam)	8,990	8,990	8,990	82,874	1,668,797	Captioning & Knowledge	(9)
textcaps	21,906	21,906	21,906	240,966	355,991	Captioning & Knowledge	(10)
chart2text	26,961	26,961	30,215	342,215	2,670,580	Chart & Table	(11)
chartqa	18,265	18,265	28,287	625,569	134,793	Chart & Table	(12)
CoSyn_400k_chart	116,814	116,814	1,085,882	17,617,591	57,641,030	Chart & Table	(13)
CoSyn_400k_table	46,518	46,518	416,519	6,280,455	23,335,054	Chart & Table	(13)
dvqa	200,000	200,000	2,325,316	44,603,372	5,477,966	Chart & Table	(14)
figureqa	100,000	100,000	1,327,368	18,515,153	2,654,736	Chart & Table	(15)
figureqa(mathv360k)	17,587	17,587	17,587	722,959	97,404	Chart & Table	(16)
finqa	5,276	5,276	6,251	5,552,943	224,015	Chart & Table	(17)
hitab	2,500	2,500	7,782	177,999	335,013	Chart & Table	(18)
lrv_chart	1,776	1,776	5,372	76,477	158,711	Chart & Table	(19)
mmc_instruct	168,178	168,178	168,178	50,008,824	74,581,055	Chart & Table	(20)
multihiertt	30,875	7,619	7,830	218,840	244,744	Chart & Table	(21)
plotqa	157,070	157,070	20,249,479	738,371,054	118,122,387	Chart & Table	(22)
robut_sqa	8,514	8,514	34,141	368,957	1,794,570	Chart & Table	(23)
robut_wikisql	74,989	74,989	86,202	1,454,920	9,276,100	Chart & Table	(23)
robut_wtq	38,246	38,246	44,096	587,040	6,415,830	Chart & Table	(23)
SynthChartNet	500,000	500,000	500,000	2,169,240	67,392,223	Chart & Table	(24)
tabmwp	22,722	22,722	23,021	639,639	1,883,243	Chart & Table	(25)
tabmwp(mathv360k)	22,452	22,452	22,452	963,498	158,042	Chart & Table	(16)
tat_dqa	2,448	2,207	13,251	320,356	1,177,852	Chart & Table	(26)
tat_qa	2,199	2,199	13,215	989,419	254,790	Chart & Table	(27)
Unichart	611,925	611,925	6,898,324	96,702,288	211,989,247	Chart & Table	(28) (Converted)
vistext	9,969	9,969	9,969	88,770	1,191,127	Chart & Table	(29)
vqaonbd	39,986	39,986	1,254,165	36,066,807	5,620,523	Chart & Table	(30)
alfworldgpt	45,073	45,073	45,073	17,864,033	6,276,573	General VQA	(31)
allava_laion	468,664	468,664	937,328	18,654,303	145,799,426	General VQA	(32)
allava_vflan	177,078	177,078	387,872	12,444,711	55,305,642	General VQA	(32)
cambrian(filtered)_processed	83,123	83,124	98,534	1,410,321	5,503,211	General VQA	(33)
chinesememe	54,212	54,212	54,212	538,938	21,122,723	General VQA	(34)
cocoqa	46,287	46,287	78,736	1,136,238	212,480	General VQA	(35)
CoSyn_400k_graphic	26,968	26,968	26,968	1,678,862	8,235,679	General VQA	(13)
datik	220,537	222,385	222,385	2,234,054	187,757,952	General VQA	(36)
datikz	47,441	47,974	48,296	441,040	59,116,193	General VQA	(36)
drivelm	90,049	4,072	161,030	2,399,362	1,431,417	General VQA	(37)
hateful_memes	8,500	8,500	8,500	128,375	17,000	General VQA	(38)
iconqa	27,307	27,307	29,841	906,877	72,492	General VQA	(39)
iconqa(mathv360k)	22,589	22,589	22,589	952,183	134,029	General VQA	(16)
idk	11,123	11,123	27,614	235,262	665,247	General VQA	(40)
indoor_qa	3,350	3,350	3,350	36,832	19,700	General VQA
LLaVA_Instruct_150K	157,710	157,710	361,405	4,412,600	28,719,278	General VQA	(41)
llavar_gpt4_20k	19,790	19,790	43,167	546,703	1,516,730	General VQA	(42)
lnqa	302,780	302,780	1,520,942	16,530,323	19,027,663	General VQA	(43)
lrv_normal(filtered)	10,489	10,489	155,269	2,108,321	3,134,247	General VQA	(44)
lvis_instruct4v	222,711	222,711	1,050,622	12,556,173	43,726,782	General VQA	(45)
mimic_cgd	141,878	70,939	141,869	1,789,740	4,304,380	General VQA	(46)
mmevol	160,215	160,215	630,441	16,203,127	50,445,237	General VQA	(47)
mmra	2,048	1,024	1,024	72,523	25,764	General VQA	(48)
nlvr2	100,852	50,426	86,373	4,629,641	172,746	General VQA	(49)
sketchyvqa	8,000	8,000	8,000	182,192	8,000	General VQA	(50)
spark	3,904	3,904	6,248	65,982	73,973	General VQA	(51)
spatialsense	10,440	10,440	17,498	200,963	418,883	General VQA	(52)
spot_the_diff	17,132	8,566	9,524	82,670	209,630	General VQA	(53)
vision_flan(filtered)	175,964	175,964	175,964	9,983,758	3,009,891	General VQA	(54)
visual7w	14,366	14,366	69,817	3,054,334	209,451	General VQA	(55)
vizwiz(mathv360k)	6,604	6,604	6,604	197,143	44,876	General VQA	(56)
vqav2	82,772	82,772	443,757	5,722,488	1,100,837	General VQA	(57)
vsr	2,157	2,157	3,354	79,596	6,708	General VQA	(58)
websight	10,000	10,000	10,000	113,114	5,237,381	General VQA	(59)
wildvision	333	333	405	50,161	72,820	General VQA	(60)
yesbut	4,318	4,318	4,318	38,365	157,229	General VQA	(61)
aguvis-stage-1	458,957	458,957	3,831,666	36,151,272	93,546,182	Grounding & Counting	(62) (Converted)
groundui	13,531	13,531	18,016	200,094	883,274	Grounding & Counting	(63)
objects365_qa	1,742,287	1,742,287	12,329,259	135,681,680	2,146,619,635	Grounding & Counting	(64) (Converted)
oodvqa	8,488	8,488	8,488	227,028	8,488	Grounding & Counting	(50)
tallyqa	98,680	98,680	183,986	2,674,306	370,282	Grounding & Counting	(65)
clevr	70,000	70,000	699,989	19,277,813	1,570,525	Mathematics	(66)
clevr_math	70,000	70,000	556,082	7,888,064	580,324	Mathematics	(16)
clevr_math(mathv360k)	5,280	5,280	5,280	174,879	27,536	Mathematics	(16)
CoSyn_400k_math	66,714	66,714	66,714	500,554	28,631,388	Mathematics	(13)
geo170k(align)	35,297	35,297	35,297	336,151	1,866,019	Mathematics	(67)
geo170k(qa)	12,101	12,101	12,101	1,254,831	1,115,242	Mathematics	(67)
geo3k	2,091	2,091	2,091	130,287	2,091	Mathematics	(68)
geometry3k(mathv360k)	9,724	9,724	9,724	541,908	69,075	Mathematics	(16)
geomverse	9,303	9,303	9,339	662,756	2,454,014	Mathematics	(69)
geoqa+(mathv360k)	17,162	17,162	17,162	1,449,094	117,740	Mathematics	(70)
geos(mathv360k)	498	498	498	32,394	3,509	Mathematics	(71)
intergps	1,280	1,280	1,760	97,799	5,280	Mathematics	(72)
mavis_math_metagen	87,348	87,348	87,348	6,668,920	5,486,485	Mathematics	(73)
mavis_math_rule_geo	99,986	99,986	99,986	8,211,079	12,535,251	Mathematics	(73)
raven	63,081	42,000	42,000	584,843	63,081	Mathematics	(74)
super_clevr(mathv360k)	8,642	8,642	8,642	307,438	44,129	Mathematics	(16)
unigeo(mathv360k)	11,949	11,949	11,949	1,011,069	81,781	Mathematics	(16)
art	5,603	5,603	5,603	56,573	283,138	Naive OCR	(75)
captcha	113,062	113,062	113,062	1,469,548	466,856	Naive OCR
chrome_writting	8,825	8,825	8,825	150,025	172,940	Naive OCR	(76)
cocotext	16,169	16,169	16,169	143,818	177,111	Naive OCR	(77)
ctw	24,290	24,290	180,621	9,787,485	1,653,254	Naive OCR	(78)
funsd	194	194	3,879	16,856	29,996	Naive OCR	(79)
hme100k	74,492	74,492	74,492	1,117,380	1,757,743	Naive OCR	(80)
hw_squad	20,457	20,457	83,682	1,071,534	388,518	Naive OCR	(81)
iam	5,663	5,663	5,663	45,582	130,794	Naive OCR	(82)
iiit5k	1,990	1,990	1,990	35,820	4,259	Naive OCR	(83)
imgur5k	5,934	5,934	5,934	89,010	288,054	Naive OCR	(84)
k12_printing	256,636	256,636	256,636	14,114,980	7,465,001	Naive OCR	(19)
latex_handwritten	39,583	39,583	39,583	390,343	1,874,733	Naive OCR	(85)
latexformulas	552,340	552,340	552,340	5,138,603	43,094,747	Naive OCR	(86)
maptext	200	200	799	9,434	70,813	Naive OCR	(87)
mathwriting-google	300,000	300,000	300,000	2,461,270	5,954,806	Naive OCR	(88) (Converted)
memotion	6,991	6,991	6,991	194,718	177,429	Naive OCR	(89)
orand_car_a	1,999	1,999	1,999	43,978	9,035	Naive OCR	(90)
rendered_text	10,000	10,000	10,000	85,879	244,183	Naive OCR	(91)
sroie	33,616	33,616	33,616	605,088	243,240	Naive OCR	(92)
svrd	4,396	4,396	4,396	65,400	834,514	Naive OCR	(93)
SynthCodeNet	499,983	499,983	499,983	2,000,683	253,422,136	Naive OCR	(24)
synthdog	500,000	500,000	500,000	8,849,848	48,010,145	Naive OCR	(94)
SynthFormulaNet	499,997	499,997	499,997	1,999,631	51,215,097	Naive OCR	(24)
tal_ocr_eng	256,646	256,646	256,646	3,385,012	7,465,207	Naive OCR	(95)
wordart	19,066	4,804	4,804	78,032	54,263	Naive OCR	(96)
olmOCR-mix-0225-documents	228,864	228,864	228,858	2,197,147	163,194,337	Naive OCR	(97) (Converted)
olmOCR-mix-0225-books	15,194	15,194	15,194	145,750	7,962,779	Naive OCR	(97) (Converted)
a_okvqa	54,602	54,602	54,602	1,065,188	360,990	OCR QA	(98)
aokvqa	16,539	16,539	17,056	743,458	218,917	OCR QA	(98)
arxivqa	100,000	100,000	100,000	7,022,001	6,422,269	OCR QA	(99)
bentham	10,843	10,843	10,843	103,042	124,459	OCR QA	(81)
blockdiagramcomputerized	502	502	502	5,067	34,453	OCR QA	(100)
blockdiagramhandwritten	1,029	1,029	1,029	11,444	75,598	OCR QA	(100)
CoSyn_400k_diagram	34,963	34,963	300,357	3,356,844	11,943,321	OCR QA	(13)
CoSyn_400k_document	71,282	71,282	605,173	6,216,517	16,095,526	OCR QA	(13)
CoSyn_400k_music	11,969	11,969	81,786	792,129	3,175,586	OCR QA	(13)
CoSyn_400k_nutrition	6,931	6,931	112,097	1,642,936	3,687,254	OCR QA	(13)
diagram_image_to_text	300	300	300	3,631	20,723	OCR QA	(101)
DoclingMatix	2,465,202	1,270,911	10,626,898	162,581,660	2,996,338,775	OCR QA	(24)
docvqa	10,189	10,189	39,463	724,814	275,510	OCR QA	(102)
est_vqa	19,358	19,358	19,358	286,343	143,270	OCR QA	(103)
handwriting_forms	1,400	1,400	1,400	81,200	41,490	OCR QA	(104)
infographic_vqa	1,982	4,394	23,717	392,456	86,951	OCR QA	(105)
infographic_vqa_llava_format	4,394	2,113	10,054	174,352	43,912	OCR QA	(105)
infographic(gpt4v)	2,113	1,982	1,982	275,498	1,044,183	OCR QA	(105)
invoices_receipts	3,013	3,013	3,013	36,745	771,948	OCR QA	(106)
mapqa	37,417	37,417	483,416	8,454,722	5,657,339	OCR QA	(107)
mapqa(mathv360k)	5,225	5,225	5,225	168,390	44,560	OCR QA	(16)
mmsoc_memotion	6,991	6,991	6,991	188,505	421,250	OCR QA	(108)
ocrvqa	165,746	165,746	801,579	12,217,564	4,801,833	OCR QA	(109)
pdfvqa	8,593	8,593	95,000	1,272,618	939,948	OCR QA	(110)
screen2words	15,730	15,730	15,743	133,014	120,781	OCR QA	(111)
screenqa	80,761	80,761	80,761	940,729	826,795	OCR QA	(112)
slidevqa	11,868	1,919	10,617	333,065	156,036	OCR QA	(113)
st_vqa	17,247	17,247	23,121	338,837	98,892	OCR QA	(114)
sujet_finance	9,801	9,801	107,050	1,395,624	1,925,361	OCR QA	(115)
textocr(gpt4v)	25,060	25,060	25,060	150,360	2,436,974	OCR QA	(116)
textvqa	21,953	21,953	34,602	553,990	141,882	OCR QA	(117)
ureader_cap	91,215	91,215	91,215	1,086,484	1,435,964	OCR QA	(118)
ureader_ie	17,320	17,320	17,320	406,237	128,229	OCR QA	(118)
ureader_kg_processed	37,550	37,550	37,550	352,907	2,013,731	OCR QA	(118)
ureader_qa_processed	252,953	252,953	252,953	7,100,750	930,617	OCR QA	(118)
visualmrc	3,027	3,027	11,988	139,751	147,385	OCR QA	(119)
ai2d_merged	4,858	4,858	12,325	755,455	1,319,140	Science	(120)
CoSyn_400k_chemical	8,942	8,942	55,391	634,881	2,450,290	Science	(13)
CoSyn_400k_circuit	10,470	10,470	67,939	713,575	2,637,618	Science	(13)
pathvqa	32,632	32,632	32,632	639,385	85,168	Science	(121)
pmc_vqa(mathv360k)	35,948	35,948	35,948	1,889,167	255,109	Science	(16)
scienceqa	4,976	4,976	6,149	1,081,220	18,447	Science	(122)
scienceqa(nona_context)	19,208	19,208	19,208	1,624,583	25,311	Science	(19)
tqa	2,749	2,749	12,567	395,956	149,776	Science	(123)
visualwebinstruct(filtered)	263,581	263,581	263,581	8,341,540	31,802,459	Science	(124)
vqarad	313	313	1,793	25,181	6,003	Science	(125)
text_code_feedback	0	66,383	221,096	19,349,056	79,752,351	Text-only	(126)
text_codefeedback_filtered_instruction	0	156,525	156,525	27,684,170	62,764,414	Text-only	(126)
text_infinitymath	0	101,380	101,380	9,158,132	212,543	Text-only	(127)
text_mathinstruct	0	262,039	262,039	20,405,295	44,145,362	Text-only	(128)
text_mathqa	0	394,996	394,996	23,552,035	72,451,061	Text-only	(129)
text_mathstepdpo10k	0	10,795	10,795	557,233	989,312	Text-only	(130)
text_numinamath_cot	0	859,494	859,494	75,818,870	387,758,581	Text-only	(131)
text_openhermes_2_5	0	1,001,551	1,008,268	142,376,960	233,561,291	Text-only	(132)
text_openorca	0	4,233,853	4,233,853	1,049,478,873	468,042,176	Text-only	(133)
text_orcamath	0	200,035	200,035	12,691,014	61,860,987	Text-only	(134)
text_pythoncode25k	0	49,626	49,626	1,629,286	4,945,892	Text-only	(135)
text_pythoncodealpaca	0	18,612	18,612	655,127	2,683,469	Text-only	(136)
text_ruozhiba	0	1,496	1,496	69,795	234,822	Text-only	(137)
text_theoremqa	0	800	800	50,065	3,468	Text-only	(138)
text_wizardlm_evol	0	69,999	69,999	7,753,963	21,955,856	Text-only	(139)
text_OpenMathInstruct-2	0	1,000,000	1,000,000	74,905,850	413,132,418	Text-only	(140)
Totals	17,372,293	24,322,193	88,928,343	3,168,958,417	9,459,677,828

Cleaning

After gathering all the sub-datasets, every turn is cleaned. We removed all individual turns whose combined question and answer length exceeds 8192 tokens. We resize big images to have a longest side of 2048 pixels while keeping the aspect ratio, and discard samples with corrupted images.

Rating

Finally, we rate every single turn in our dataset across 4 axes. For this, we used a LLM and VLM-as-a-judge pipeline (using Qwen3-32B and Qwen2.5VL-32B-Instruct), to rate every turn on a scale from 1-5 in these 4 categories:

Text Formatting Quality: How is the quality of the answer both linguistically and structurally? (Question and Answer)
Question-Answer Relevance: Does the answer properly respond to the question? (Question and Answer)
Visual Dependency: How much does the question depend on visual information to be answered? (Question only)
Image-Question Correspondence: How well does the image support answering the question? (Image and Question)

This is the distribution of scores across the different filters for FineVision:

Filter	1	2	3	4	5
Formatting	0.5	0.7	1.1	77.5	20.3
Relevance	2.9	0.5	14.7	16.5	65.4
Visual Dependency	11.0	20.4	2.6	24.2	41.8
Image Correspondence	8.1	3.6	17.3	26.8	44.1

FineVision Base Dataset

We classify FineVision’s subsets into 9 categories: Captioning & Knowledge, Chart & Table, General VQA, Grounding & Counting, Mathematics, Naive OCR, OCR QA, Science and Text-only (Fig. 1).

There are multiple ways to count the data in a multimodal dataset. The most common are the number of samples and the number of images. Additionally, a single sample can consist of multiple question/answer pairs in the form of a multi-turn conversation. Similarly to text-only datasets, the number of answer tokens is also interesting, since these are the tokens the model is actually trained on. We count all these characteristics for FineVision and arrive at 17.3M images, 24.3M samples, 88.9M turns, and 9.5B answer tokens. Based on these 4 distributions, multiple different mixtures are possible. In conjunction with the provided ratings, we encourage the community to create their own mixtures and experiment with the data. For example, large categories could be downsamples, while high-quality data could be upsampled. After collecting and processing the data, we run multiple experiments and ablations to provide practical recommendations on how to train small, data-centric VLMs.

Figure 1: Distribution of Categories in FineVision by Answer Tokens, Number of Samples, Turns, and Images. While the distributions differ a bit with the different metrics, FineVision provides a good baseline mixture, especially when judging by the number of images in the individual categories. Samples from Chart & Table usually lend themselves well to multi-turn conversations, since multiple similar questions can be asked for a single Chart. Samples from OCR QA often have a lot of answer tokens, since they aim at detailed document understanding, which are rarely answered with a short sentence.

Experimental Setup

To ensure a fair comparison between different configurations, we use the same setup and evaluations for all of our ablations. This enables us to compare FineVision to other publicly available datasets as well as experiment with different intra-dataset configurations.

Model Architecture: nanoVLM

For all ablations and experiments, we train a 460M parameter VLM, since it provides a good trade-off between training time and model performance. We utilize the lightweight nanoVLM training framework with SmolLM2-360M-Instruct as the text backbone, and SigLIP2-Base-512 as the vision encoder. We experimented with a classic 2-stage training schedule where the first stage is used to train mainly the Modality Projection to align the Language and Image Embeddings, and the second stage is used to train the whole model. Interestingly, we did not observe any significant benefits from this additional first stage compared to training the whole model directly at our size and training duration, so we settled on a single-stage training for most ablations.

Baseline Datasets

We use three similar open source alternatives as baselines to compare our dataset to: The Cauldron, LLaVA-OneVision and Cambrian-7M.

Name	Images	Samples	Turns	Answer Tokens
Cauldron	2.0M	1.8M	27.8M	0.3B
LLaVa-Vision	2.5M	3.9M	9.1M	1.0B
Cambrian-7M	5.4M	7M	12.2M	0.8B
FineVision	17.3M	24.3M	88.9M	9.5B

Evaluations

We utilize lmms-eval during training to evaluate our ablations in a reproducible manner. We evaluate on a diverse set of 11 benchmarks: AI2D, ChartQA, DocVQA, InfoVQA, MME, MMMU, MMStar, OCRBench, ScienceQA, TextVQA and SEED-Bench. Since these benchmarks cover different topics and produce results on different scales, e.g. AI2D returns the accuracy of the exact matches (0-1), but MME returns a continuous score (0-2800), aggregating them is not trivial. In our ablations the relative performance between the different configurations matters, so to provide a robuts summary metric we determine the rank of each model compared to the others in every benchmark at every training step and average it over all the benchmarks. This way, we can judge where different configurations rank among each other over the course of training. To keep a sense of how big the absolute difference between models is, we also provide an average over all metrics and incorporate MME by normalizing it between 0 and 1.

Training Configuration

Each of our ablations trains the 460M model with a maximal image size of 1536x1536 pixel (without resizing smaller images) and a maximal input token length of 4096. This results in a maximum batch size of 2 for a single H100, which we adapt with 8 steps of gradient accumulation on each of the 32 GPUs for an effective batch size of 512. In all single stage configurations we train for 20k Steps on 32 H100s for approximately 20h while evaluating all 11 benchmarks every 1k Steps. If not specified otherwise, the “Baseline” in our intra dataset ablations refers to a training run on the full unfiltered and unchanged dataset. In this configuration, a full epoch of the unfiltered FineVision dataset takes 12k steps.

Experiments

While many interesting questions could be investigated, we mainly focus on the aspects of the training that are influenced by the data. Before we dive into the internal details of FineVision, let’s have a look at our performance against the baselines.

How does FineVision compare to other open datasets?

Here we see the first interesting trend: VLMs still benefit from training on a larger, more diverse dataset than what was available until today. FineVision doesn’t lead the race in the first few thousand training steps, after all, it does include new tasks such as pointing and agentic browsing, so it shouldn’t be better at first. But after seeing enough varied data, FineVision clearly shows the best performance across a wide set of benchmarks, which can be seen in its average ranking (Fig. 2). One epoch of FineVision in our setup takes 12k training steps, so we train for close to 2 epochs in these ablations. Looking at the average benchmark, we can see how the models saturate around different points: 18k steps for Cambrian, 12k for LLaVa and 7k for the Cauldron. In particular, over 11 different benchmarks, FineVision achieves an average improvement of 40.7% over the Cauldron, 12.1% over Cambrian, and 46.3% over LLaVa, which increases to 51.3%, 18.6% and 58.0% when comparing the deduplicated versions of the datasets. Additionally, FineVision includes data for tasks such as agentic browsing, and counting and pointing, which are not part of the other baselines.

Figure 2: Average Rank of Models trained on different open source datasets. FineVision shows both the highest average rank as well as the highest average over benchmarks.

How much test data is in publicly available datasets?

We investigate data leakage by finding images from test sets that appear in the dataset. For this, we constructed an image deduplication pipeline. We used this pipeline to compare all images in FineVision to all images of 66 image-text benchmarks from the lmms-eval framework.

For the comparison, we embed the images using the SSCD descriptor, and compute the cosine similarity between a given image in FineVision and all images from the test-set embeddings. Whenever a sample has a similarity higher than a threshold of 0.95 it is assumed to be a duplicate.

While our tests with various thresholds show that this is still flagging more false-positives than false-negatives, given the scale of data we have, we preferred to err on the side of caution.

Below is an example of a correctly identified Duplicate (“Photo”), a false-positive with a similarity score above 0.95 (“Chart”) and a false-negative with a similarity score below 0.95 (“Drawing”) (Fig. 3).

We open-source the deduplication pipeline here as well as the precomputed test-set embedding’s here.

Figure 3: Examples of the Deduplication Results.

We repeated this deduplication procedure on all the baselines to analyse how contaminated they are. We found that all baselines contain between 2-3% images from test benchmarks, and removing them results in a performance drop of 2.4-2.8%. Interestingly, we find that for some benchmarks the difference is negligible, while other benchmarks suffer significantly. For example, after deduplicating, ScienceQA falls by 14.49% on average while OCRBench only drops by 1.08%. This deduplications also shows that FineVision contains the smallest relative amount of duplicated data at 1%, and also suffers the smallest performance drop over all benchmarks after deduplication at just 1.45%.

Name	Samples	Contamination Rate	Performance Drop
Cauldron	1.8M	3.05%	2.39%
Llava-Vision	3.9M	2.15%	2.72%
Cambrian-7M	7.0M	2.29%	2.78%
FineVision	24.3M	1.02%	1.45%

Additionally, we experimented with removing all found samples from all datasets to see if the outcome is different from Fig. 2, but we observe the same distribution (Fig. 4).

Figure 4: Average Rank of Models trained on different deduplicated open source datasets. Even after deduplicating all dataset, FineVision shows the best performance.

How diverse are the datasets?

Similarly to the size comparison, we also wanted to evaluate the datasets for diversity. Evaluating the diversity of a dataset is a field of study for itself, which we will not dive into here, rather we borrow techniques from computer vision and use the already computed SSCD embeddings as a proxy of visual diversity. To not rely on a subsample of the dataset in estimating the diversity, we analyse the covariance metric of the full embeddings. From this covariance matrix, we can calculate the eigenvalues for analysis. We get the effective rank of the covariance matrix, which measures how uniformly the variance is distributed across dimensions, as well as the participation ratio, which measures how many dimensions actively contribute to the overall variance. To obtain a single diversity score for the datasets, we normalize the effective rank and participation ratio with the embedding dimension and compute their geometric mean. We observe that FineVision is not only the biggest, but also the most diverse dataset. Additionally, you can also clearly see that more images do not necessarily result in more diversity, since LLaVa is substantially less diverse than the Cauldron, even with more images.

Name	Images	Effective Rank	Participation Ratio	Diversity
Cauldron	2.0M	324.05	129.22	0.400
LLaVa-Vision	2.5M	267.89	87.05	0.298
Cambrian-7M	5.4M	359.73	152.70	0.458
FineVision	17.3M	359.22	182.52	0.500

Should you merge multiple questions for the same image into a single multi turn conversation?

Since the training of a VLM already builds upon pretrained vision and language backbones, datasets are usually not completely unstructured, but follow an image+question and answer structure. Some works have shown that consolidating multiple questions for the same image into a multi-turn conversation where the image is shown only once improves model performance, reduces training budget, and reduces the datasets’ memory footprint. We therefore experiment with deduplicating every image in our dataset internally using the same SSCD descriptors, manually inspect the resulting clusters, and merge fitting samples into a multi-turn conversation.

When training with the same training budget, we find that both models perform very similarly (Fig. 5). Some benchmarks favor one image/several turns, while others favor one image/one turn. Given this, we decide to release the dataset without merging multiple questions for the same image, and open-source the pipeline in case users want to explore this further.

Figure 5: Average Ranking of Models trained with internally deduplicated / merged samples. No clear benefit in merging can be seen with respect to model performance.

Should you train on multilingual data if your language backbone was not?

There are some multilingual datasets in our mixture, but since our Language Backbone is only trained on English data, we experimented with removing all the multilingual, mainly Chinese, subsets. Our results show that there is a slight advantage in leaving the multilingual data, even if it was not part of the Language Backbone’s initial training. We believe this reinforces our hypothesis that more diversity in the dataset is generally preferable for VLM training. In our training setup with this configuration, one epoch over the whole non-deduplicated dataset equals ~12k steps, so the benefit of unseen languages only materializes after the first full epoch (Fig. 5).

Figure 6: Average Rank of Models trained with and without multilingual samples. Keeping samples in unseen langauges improves performance after the first epoch.

How can you assess the quality of the dataset?

The usual goal for every dataset, to collect samples with the highest quality possible, is quite an abstract endeavour in practice, especially for multimodal datasets. Additionally, different training stages usually have different qualitative and quantitative requirements. Finally, tuning the mixtures of different categories is also reliant on how much data with what quality is available. For image-text datasets, there are 3 different combinatorial ways to evaluate a sample: text-only, image-only, and image-text correspondence. The question persists, how do you actually measure the quality of a sample, especially if you have to do so in 3 different ways. We propose doing so by leveraging both a LLM and a VLM as a judge.

To try to quantify the quality of the training data and the effect it has on the model’s performance, we run extensive ablations on our generated ratings.

Figure 6: Average Rank of Models trained with samples that have all 4 ratings above a certain threshold. Keeping all samples results in the best performance.

Interestingly, both when only training on turns that have any of the 4 ratings under a certain threshold, as well as when training on turns where only a single rating at a time is used, we observe the same behaviour. Simply training on the most diverse data, that one containing all samples, outperforms in benchmarks (Fig. 6) (Fig. 7). This could mean multiple things. Firstly, we can see almost the same distribution in the ranks across all filters: from best to worst with an increase in the rating threshold. For example the visual dependency and the image correspondence rating both result in exactly the same distribution of rankings, corresponding to the natural order of options, 1 through 5. This could indicate that with a sufficiently large dataset that you train on long enough, it hurts more to remove samples, even if they were judged to be of low quality, than to train on them.

Additionally, the notion of quality in VLM datasets is inherently nuanced. Unlike LLMs, where pre-training often relies on massive web crawls, training a VLM is closer to the supervised fine-tuning (SFT) stage. We do not train on crawls of internet data, instead we train on individual samples of Image-Question and Answer pairs, and these datapoints are usually ‘curated rather than collected’. We also do not train on trillions of tokens, but on billions. This built-in curation provides a baseline level of quality from the start. FineVision follows this pattern: it brings together widely used VLM datasets along with a few new ones in low-resource domains. We could therefore be trying to measure and quantify noisy nuances in the quality of Image-Question-Answer Pairs, instead of using the fact that they are already curated SFT datasets as the measure for quality.

Alternatively, while we used state-of-the-art open source models to judge our datapoints, we still had to find a compromise between model quality and cost due to the raw required effort to rate every single turn of FineVision. The chosen models could simply not be powerful enough to recognize and judge the quality of samples. Even though our first proposal to judge the quality of multimodal data on a per-turn basis did not yield any improvement in model performance, we believe that this is still an exciting and important direction of research and hope the release of FineVision encourages the community to develop techniques for this at large scale.

Model Performance After Applying Individual Filters

Figure 7: Comparison across thresholds for all four filters individually: Formatting, Relevance, Visual Dependency, and Image-Question Correspondence. Keeping all samples results in the best average performance.

Should you train in multiple stages?

The standard training procedure of a VLM usually follows at least two stages. First, you train only the connecting module, potentially in addition the image encoder, and then you train the whole model in a second stage. Some work has even introduced an additional Stage 2.5 (141), where you train the full model on a smaller subset of higher quality data. To investigate this on small models, we experiment both with single, dual and triple stage training.

1 Stage vs 2 Stages

To evaluate if pre-training the Modality Projection and the Vision Encoder provides any benefits to the final model performance, we conduct this experiment at a higher image resolution of 2048px and train substantially longer. We can see that even for training longer, the general difference in model performance is quite small. Individual benchmarks, do show differences (ScienceQA drops by 5% but OCRBench improves by 5% in the two-stage setup) (Fig. 8), so the better setup is individual to the desired model capabilities. This also shows that evaluation (and through this also correctly training) a VLM is not straightforward tasks, since availible benchmarks are limited proxies for the underlying model performance.

Figure 8: Average Rank of a model trained for 60K steps in a single stage, and a model trained for the same 60k steps on top of pretraining the Modality Projection and Vision Encoder for 15k steps. The pre-training procedure is not depicted in this graph.

2 Stages vs 2.5 Stages

We also experiment if splitting the second stage results in any performance improvements.

We take the baseline, and continue training for another 20k steps, both with the unfiltered (>= 1) as well as filtered subsets of FineVision according to our ratings.

Figure 9: Average Rank if a model trained for an additional 20K steps on top of unfiltered training for 20K steps. Subselecting data for the final training steps does not yield a performance improvement with our quality measure. Only the 20k steps for the final stage are depicted here, the first 20k steps are the same for all variations.

As in the previous results, we observe that the best outcome is simply achieved by training on as much and as diverse data as possible (Fig. 9). Like before, this could also be due to the way we filter the data, and a different quality measure might yield different results.

Conclusion

We introduce FineVision, a new state of the art open dataset to train VLMs, that is both bigger and more diverse than previous open source datasets. We provide extensive analysis regarding size, diversity, contamination and data-centric model training, and hope we can empower both further research and the community with this.

1.
hazal-karakus/mscoco-controlnet-canny-less-colors · Datasets at Hugging Face [Internet]. [cited 2025 Aug 28]. Available from: https://huggingface.co/datasets/hazal-karakus/mscoco-controlnet-canny-less-colors/viewer
2.
Li X, Zhang F, Diao H, Wang Y, Wang X, Duan L. Densefusion-1m: Merging vision experts for comprehensive multimodal perception. Advances in Neural Information Processing Systems [Internet]. 2024 [cited 2025 Aug 28];37:18535–56. Available from: https://proceedings.neurips.cc/paper_files/paper/2024/hash/20ffc2b42c7de4a1960cfdadf305bbe2-Abstract-Datasets_and_Benchmarks_Track.html
3.
Mollahosseini A, Hasani B, Mahoor MH. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing [Internet]. 2017 [cited 2025 Sep 2];10(1):18–31. Available from: https://ieeexplore.ieee.org/abstract/document/8013713/
4.
Weyand T, Araujo A, Cao B, Sim J. Google Landmarks Dataset v2 – A Large-Scale Benchmark for Instance-Level Recognition and Retrieval [Internet]. 2020. Available from: https://arxiv.org/abs/2004.01804
5.
Pi R, Zhang J, Zhang J, Pan R, Chen Z, Zhang T. Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions [Internet]. arXiv; 2024 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2406.07502
6.
laion/gpt4v-dataset · Datasets at Hugging Face [Internet]. 2023 [cited 2025 Sep 2]. Available from: https://huggingface.co/datasets/laion/gpt4v-dataset
7.
Pont-Tuset J, Uijlings J, Changpinyo S, Soricut R, Ferrari V. Connecting Vision and Language with Localized Narratives. In: Vedaldi A, Bischof H, Brox T, Frahm J-M, editors. Computer Vision – ECCV 2020 [Internet]. Cham: Springer International Publishing; 2020 [cited 2025 Aug 28]. p. 647–64. Available from: https://link.springer.com/10.1007/978-3-030-58558-7_38
8.
ShareGPT-4o [Internet]. [cited 2025 Aug 28]. Available from: https://sharegpt4o.github.io/
9.
Chen L, Li J, Dong X, Zhang P, He C, Wang J, et al. ShareGPT4V: Improving Large Multi-modal Models with Better Captions. In: Leonardis A, Ricci E, Roth S, Russakovsky O, Sattler T, Varol G, editors. Computer Vision – ECCV 2024 [Internet]. Cham: Springer Nature Switzerland; 2025 [cited 2025 Sep 2]. p. 370–87. Available from: https://link.springer.com/10.1007/978-3-031-72643-9_22
back: 1, 2, 3, 4
10.
Sidorov O, Hu R, Rohrbach M, Singh A. TextCaps: A Dataset for Image Captioning with Reading Comprehension. In: Vedaldi A, Bischof H, Brox T, Frahm J-M, editors. Computer Vision – ECCV 2020 [Internet]. Cham: Springer International Publishing; 2020 [cited 2025 Aug 28]. p. 742–58. Available from: https://link.springer.com/10.1007/978-3-030-58536-5_44
11.
Kantharaj S, Leong RTK, Lin X, Masry A, Thakkar M, Hoque E, et al. Chart-to-Text: A Large-Scale Benchmark for Chart Summarization [Internet]. arXiv; 2022 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2203.06486
12.
Masry A, Long DX, Tan JQ, Joty S, Hoque E. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning [Internet]. arXiv; 2022 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2203.10244
13.
Yang Y, Patel A, Deitke M, Gupta T, Weihs L, Head A, et al. Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation [Internet]. arXiv; 2025 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2502.14846
back: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
14.
Kafle K, Price B, Cohen S, Kanan C. Dvqa: Understanding data visualizations via question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition [Internet]. 2018 [cited 2025 Aug 28]. p. 5648–56. Available from: http://openaccess.thecvf.com/content_cvpr_2018/html/Kafle_DVQA_Understanding_Data_CVPR_2018_paper.html
15.
Kahou SE, Michalski V, Atkinson A, Kadar A, Trischler A, Bengio Y. FigureQA: An Annotated Figure Dataset for Visual Reasoning [Internet]. arXiv; 2018 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/1710.07300
16.
Shi W, Hu Z, Bin Y, Liu J, Yang Y, Ng S-K, et al. Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models [Internet]. arXiv; 2024 [cited 2025 Sep 2]. Available from: http://arxiv.org/abs/2406.17294
back: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
17.
Chen Z, Chen W, Smiley C, Shah S, Borova I, Langdon D, et al. FinQA: A Dataset of Numerical Reasoning over Financial Data [Internet]. arXiv; 2022 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2109.00122
18.
Cheng Z, Dong H, Wang Z, Jia R, Guo J, Gao Y, et al. HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation [Internet]. arXiv; 2022 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2108.06712
19.
lmms-lab/LLaVA-OneVision-Data · Datasets at Hugging Face [Internet]. 2025 [cited 2025 Sep 2]. Available from: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data
back: 1, 2, 3
20.
Liu F, Wang X, Yao W, Chen J, Song K, Cho S, et al. MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning [Internet]. arXiv; 2024 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2311.10774
21.
Zhao Y, Li Y, Li C, Zhang R. MultiHiertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data [Internet]. arXiv; 2022 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2206.01347
22.
Methani N, Ganguly P, Khapra MM, Kumar P. Plotqa: Reasoning over scientific plots. In: Proceedings of the ieee/cvf winter conference on applications of computer vision [Internet]. 2020 [cited 2025 Aug 28]. p. 1527–36. Available from: http://openaccess.thecvf.com/content_WACV_2020/html/Methani_PlotQA_Reasoning_over_Scientific_Plots_WACV_2020_paper.html
23.
Zhao Y, Zhao C, Nan L, Qi Z, Zhang W, Tang X, et al. RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations [Internet]. arXiv; 2023 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2306.14321
back: 1, 2, 3
24.
Nassar A, Marafioti A, Omenetti M, Lysak M, Livathinos N, Auer C, et al. SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion [Internet]. arXiv; 2025 [cited 2025 Sep 2]. Available from: http://arxiv.org/abs/2503.11576
back: 1, 2, 3, 4
25.
Lu P, Qiu L, Chang K-W, Wu YN, Zhu S-C, Rajpurohit T, et al. Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning [Internet]. arXiv; 2023 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2209.14610
26.
Zhu F, Lei W, Feng F, Wang C, Zhang H, Chua T-S. Towards Complex Document Understanding By Discrete Reasoning. In: Proceedings of the 30th ACM International Conference on Multimedia [Internet]. Lisboa Portugal: ACM; 2022 [cited 2025 Aug 28]. p. 4857–66. Available from: https://dl.acm.org/doi/10.1145/3503161.3548422
27.
Zhu F, Lei W, Huang Y, Wang C, Zhang S, Lv J, et al. TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance [Internet]. arXiv; 2021 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2105.07624
28.
Masry A, Kavehzadeh P, Do XL, Hoque E, Joty S. UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning [Internet]. arXiv; 2023 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2305.14761
29.
Tang BJ, Boggust A, Satyanarayan A. VisText: A Benchmark for Semantically Rich Chart Captioning [Internet]. arXiv; 2023 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2307.05356
30.
jp1924/VQAonBD · Datasets at Hugging Face [Internet]. [cited 2025 Sep 2]. Available from: https://huggingface.co/datasets/jp1924/VQAonBD
31.
Shridhar M, Yuan X, Côté M-A, Bisk Y, Trischler A, Hausknecht M. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning [Internet]. arXiv; 2021 [cited 2025 Sep 2]. Available from: http://arxiv.org/abs/2010.03768
32.
Chen GH, Chen S, Zhang R, Chen J, Wu X, Zhang Z, et al. ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models [Internet]. arXiv; 2024 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2402.11684
back: 1, 2
33.
Tong P, Brown E, Wu P, Woo S, IYER AJV, Akula SC, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. Advances in Neural Information Processing Systems [Internet]. 2024 [cited 2025 Aug 28];37:87310–56. Available from: https://proceedings.neurips.cc/paper_files/paper/2024/hash/9ee3a664ccfeabc0da16ac6f1f1cfe59-Abstract-Conference.html
34.
REILX/chinese-meme-description-dataset · Datasets at Hugging Face [Internet]. 2024 [cited 2025 Sep 2]. Available from: https://huggingface.co/datasets/REILX/chinese-meme-description-dataset
35.
Ren M, Kiros R, Zemel R. Exploring models and data for image question answering. Advances in neural information processing systems [Internet]. 2015 [cited 2025 Aug 28];28. Available from: https://proceedings.neurips.cc/paper/2015/hash/831c2f88a604a07ca94314b56a4921b8-Abstract.html
36.
Belouadi J, Lauscher A, Eger S. AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ [Internet]. arXiv; 2024 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2310.00367
back: 1, 2
37.
Sima C, Renz K, Chitta K, Chen L, Zhang H, Xie C, et al. DriveLM: Driving with Graph Visual Question Answering. In: Leonardis A, Ricci E, Roth S, Russakovsky O, Sattler T, Varol G, editors. Computer Vision – ECCV 2024 [Internet]. Cham: Springer Nature Switzerland; 2025 [cited 2025 Aug 28]. p. 256–74. Available from: https://link.springer.com/10.1007/978-3-031-72943-0_15
38.
Kiela D, Firooz H, Mohan A, Goswami V, Singh A, Ringshia P, et al. The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems [Internet]. 2020 [cited 2025 Aug 28];33:2611–24. Available from: https://proceedings.neurips.cc/paper/2020/hash/1b84c4cee2b8b3d823b30e2d604b1878-Abstract.html
39.
Lu P, Qiu L, Chen J, Xia T, Zhao Y, Zhang W, et al. IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning [Internet]. arXiv; 2022 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2110.13214
40.
Cha S, Lee J, Lee Y, Yang C. Visually Dehallucinative Instruction Generation: Know What You Don’t Know [Internet]. arXiv; 2024 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2402.09717
41.
Liu H, Li C, Wu Q, Lee YJ. Visual instruction tuning. Advances in neural information processing systems [Internet]. 2023 [cited 2025 Aug 28];36:34892–916. Available from: https://proceedings.neurips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html
42.
Zhang Y, Zhang R, Gu J, Zhou Y, Lipka N, Yang D, et al. LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding [Internet]. arXiv; 2024 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2306.17107
43.
Vik’s ML Research Blog [Internet]. [cited 2025 Aug 28]. Available from: https://vikhyat.net/posts/2024-08-17-lnqa.html
44.
Liu F, Lin K, Li L, Wang J, Yacoob Y, Wang L. Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning [Internet]. arXiv; 2024 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2306.14565
45.
Wang J, Meng L, Weng Z, He B, Wu Z, Jiang Y-G. To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning [Internet]. arXiv; 2023 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2311.07574
46.
Li B, Zhang Y, Chen L, Wang J, Pu F, Yang J, et al. MIMIC-IT: Multi-Modal In-Context Instruction Tuning [Internet]. arXiv; 2023 [cited 2025 Sep 2]. Available from: http://arxiv.org/abs/2306.05425
47.
Luo R, Zhang H, Chen L, Lin T-E, Liu X, Wu Y, et al. MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct [Internet]. arXiv; 2024 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2409.05840
48.
Wu S, Zhu K, Bai Y, Liang Y, Li Y, Wu H, et al. MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models [Internet]. arXiv; 2024 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2407.17379
49.
Suhr A, Zhou S, Zhang A, Zhang I, Bai H, Artzi Y. A Corpus for Reasoning About Natural Language Grounded in Photographs [Internet]. arXiv; 2019 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/1811.00491
50.
Tu H, Cui C, Wang Z, Zhou Y, Zhao B, Han J, et al. How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs [Internet]. arXiv; 2023 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2311.16101
back: 1, 2
51.
Yu Y, Chung S, Lee B-K, Ro YM. SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models [Internet]. arXiv; 2024 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2408.12114
52.
Yang K, Russakovsky O, Deng J. Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision [Internet]. 2019 [cited 2025 Aug 28]. p. 2051–60. Available from: http://openaccess.thecvf.com/content_ICCV_2019/html/Yang_SpatialSense_An_Adversarially_Crowdsourced_Benchmark_for_Spatial_Relation_Recognition_ICCV_2019_paper.html
53.
Jhamtani H, Berg-Kirkpatrick T. Learning to Describe Differences Between Pairs of Similar Images [Internet]. arXiv; 2018 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/1808.10584
54.
Xu Z, Feng C, Shao R, Ashby T, Shen Y, Jin D, et al. Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning [Internet]. arXiv; 2024 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2402.11690
55.
Zhu Y, Groth O, Bernstein M, Fei-Fei L. Visual7w: Grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition [Internet]. 2016 [cited 2025 Aug 28]. p. 4995–5004. Available from: http://openaccess.thecvf.com/content_cvpr_2016/html/Zhu_Visual7W_Grounded_Question_CVPR_2016_paper.html
56.
Gurari D, Li Q, Stangl AJ, Guo A, Lin C, Grauman K, et al. Vizwiz grand challenge: Answering visual questions from blind people. In: Proceedings of the IEEE conference on computer vision and pattern recognition [Internet]. 2018 [cited 2025 Aug 28]. p. 3608–17. Available from: http://openaccess.thecvf.com/content_cvpr_2018/html/Gurari_VizWiz_Grand_Challenge_CVPR_2018_paper.html
57.
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition [Internet]. 2017 [cited 2025 Aug 28]. p. 6904–13. Available from: http://openaccess.thecvf.com/content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html
58.
Liu F, Emerson G, Collier N. Visual spatial reasoning. Transactions of the Association for Computational Linguistics [Internet]. 2023 [cited 2025 Aug 28];11:635–51. Available from: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00566/116470
59.
Laurençon H, Tronchon L, Sanh V. Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset [Internet]. arXiv; 2024 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2403.09029
60.
Lu Y, Jiang D, Chen W, Wang WY, Choi Y, Lin BY. Wildvision: Evaluating vision-language models in the wild with human preferences. Advances in Neural Information Processing Systems [Internet]. 2024 [cited 2025 Aug 28];37:48224–55. Available from: https://proceedings.neurips.cc/paper_files/paper/2024/hash/563991b5c8b45fe75bea42db738223b2-Abstract-Datasets_and_Benchmarks_Track.html
61.
Nandy A, Agarwal Y, Patwa A, Das MM, Bansal A, Raj A, et al. YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models [Internet]. arXiv; 2024 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2409.13592
62.
Xu Y, Wang Z, Wang J, Lu D, Xie T, Saha A, et al. Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [Internet]. arXiv; 2025 [cited 2025 Sep 2]. Available from: http://arxiv.org/abs/2412.04454
63.
Zheng L, Huang Z, Xue Z, Wang X, An B, Yan S. AgentStudio: A Toolkit for Building General Virtual Agents [Internet]. arXiv; 2025 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2403.17918
64.
Shao S, Li Z, Zhang T, Peng C, Yu G, Zhang X, et al. Objects365: A Large-Scale, High-Quality Dataset for Object Detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) [Internet]. Seoul, Korea (South): IEEE; 2019 [cited 2025 Sep 2]. p. 8429–38. Available from: https://ieeexplore.ieee.org/document/9009553/
65.
Acharya M, Kafle K, Kanan C. Tallyqa: Answering complex counting questions. In: Proceedings of the AAAI conference on artificial intelligence [Internet]. 2019 [cited 2025 Aug 28]. p. 8076–84. Available from: https://ojs.aaai.org/index.php/AAAI/article/view/4815
66.
Lindström AD, Abraham SS. CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning [Internet]. arXiv; 2022 [cited 2025 Sep 2]. Available from: http://arxiv.org/abs/2208.05358
67.
Gao J, Pi R, Zhang J, Ye J, Zhong W, Wang Y, et al. G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model [Internet]. arXiv; 2025 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2312.11370
back: 1, 2
68.
Lu P, Gong R, Jiang S, Qiu L, Huang S, Liang X, et al. Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning [Internet]. arXiv; 2021 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2105.04165
69.
Kazemi M, Alvari H, Anand A, Wu J, Chen X, Soricut R. GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning [Internet]. arXiv; 2023 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2312.12241
70.
Cao J, Xiao J. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In: Proceedings of the 29th international conference on computational linguistics [Internet]. 2022 [cited 2025 Aug 28]. p. 1511–20. Available from: https://aclanthology.org/2022.coling-1.130/
71.
Seo M, Hajishirzi H, Farhadi A, Etzioni O, Malcolm C. Solving geometry problems: Combining text and diagram interpretation. In: Proceedings of the 2015 conference on empirical methods in natural language processing [Internet]. 2015 [cited 2025 Aug 28]. p. 1466–76. Available from: https://aclanthology.org/D15-1171.pdf
72.
Lu P, Gong R, Jiang S, Qiu L, Huang S, Liang X, et al. Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning [Internet]. arXiv; 2021 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2105.04165
73.
Zhang R, Wei X, Jiang D, Guo Z, Li S, Zhang Y, et al. MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine [Internet]. arXiv; 2024 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2407.08739
back: 1, 2
74.
Zhang C, Gao F, Jia B, Zhu Y, Zhu S-C. Raven: A dataset for relational and analogical visual reasoning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition [Internet]. 2019 [cited 2025 Aug 28]. p. 5317–27. Available from: http://openaccess.thecvf.com/content_CVPR_2019/html/Zhang_RAVEN_A_Dataset_for_Relational_and_Analogical_Visual_REasoNing_CVPR_2019_paper.html
75.
Chng CK, Liu Y, Sun Y, Ng CC, Luo C, Ni Z, et al. Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In: 2019 International Conference on Document Analysis and Recognition (ICDAR) [Internet]. IEEE; 2019 [cited 2025 Sep 2]. p. 1571–6. Available from: https://ieeexplore.ieee.org/abstract/document/8978157/
76.
Mouchere H, Viard-Gaudin C, Zanibbi R, Garain U, Kim DH, Kim JH. Icdar 2013 crohme: Third international competition on recognition of online handwritten mathematical expressions. In: 2013 12th International Conference on Document Analysis and Recognition [Internet]. IEEE; 2013 [cited 2025 Aug 28]. p. 1428–32. Available from: https://ieeexplore.ieee.org/abstract/document/6628849/
77.
Veit A, Matera T, Neumann L, Matas J, Belongie S. COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images [Internet]. arXiv; 2016 [cited 2025 Sep 2]. Available from: http://arxiv.org/abs/1601.07140
78.
Yuan T-L, Zhu Z, Xu K, Li C-J, Mu T-J, Hu S-M. A Large Chinese Text Dataset in the Wild. J Comput Sci Technol [Internet]. 2019 May [cited 2025 Sep 2];34(3):509–21. Available from: http://link.springer.com/10.1007/s11390-019-1923-y
79.
Jaume G, Ekenel HK, Thiran J-P. Funsd: A dataset for form understanding in noisy scanned documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) [Internet]. IEEE; 2019 [cited 2025 Aug 28]. p. 1–6. Available from: https://ieeexplore.ieee.org/abstract/document/8892998/
80.
Yuan Y, Liu X, Dikubab W, Liu H, Ji Z, Wu Z, et al. Syntax-aware network for handwritten mathematical expression recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition [Internet]. 2022 [cited 2025 Aug 28]. p. 4553–62. Available from: http://openaccess.thecvf.com/content/CVPR2022/html/Yuan_Syntax-Aware_Network_for_Handwritten_Mathematical_Expression_Recognition_CVPR_2022_paper.html
81.
Mathew M, Gomez L, Karatzas D, Jawahar CV. Asking questions on handwritten document collections. IJDAR [Internet]. 2021 Sep [cited 2025 Aug 28];24(3):235–49. Available from: https://link.springer.com/10.1007/s10032-021-00383-3
back: 1, 2
82.
Marti U-V, Bunke H. The IAM-database: an English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition [Internet]. 2002 Nov 1 [cited 2025 Aug 28];5(1):39–46. Available from: http://link.springer.com/10.1007/s100320200071
83.
Mishra A, Alahari K, Jawahar CV. Scene text recognition using higher order language priors. In: BMVC-British machine vision conference [Internet]. BMVA; 2012 [cited 2025 Aug 28]. Available from: https://inria.hal.science/hal-00818183/
84.
Krishnan P, Kovvuri R, Pang G, Vassilev B, Hassner T. Textstylebrush: transfer of text aesthetics from a single example. IEEE Transactions on Pattern Analysis and Machine Intelligence [Internet]. 2023 [cited 2025 Aug 28];45(7):9122–34. Available from: https://ieeexplore.ieee.org/abstract/document/10027471/
85.
Im2Latex [Internet]. [cited 2025 Sep 2]. Available from: https://sujayr91.github.io/Im2Latex/
86.
OleehyO/latex-formulas · Datasets at Hugging Face [Internet]. 2024 [cited 2025 Aug 28]. Available from: https://huggingface.co/datasets/OleehyO/latex-formulas
87.
Li Z, Lin Y, Chiang Y-Y, Weinman J, Tual S, Chazalon J, et al. ICDAR 2024 Competition on Historical Map Text Detection, Recognition, and Linking. In: Barney Smith EH, Liwicki M, Peng L, editors. Document Analysis and Recognition - ICDAR 2024 [Internet]. Cham: Springer Nature Switzerland; 2024 [cited 2025 Aug 28]. p. 363–80. Available from: https://link.springer.com/10.1007/978-3-031-70552-6_22
88.
Gervais P, Fadeeva A, Maksai A. MathWriting: A Dataset For Handwritten Mathematical Expression Recognition. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V2 [Internet]. Toronto ON Canada: ACM; 2025 [cited 2025 Sep 2]. p. 5459–69. Available from: https://dl.acm.org/doi/10.1145/3711896.3737436
89.
Sharma C, Bhageria D, Scott W, PYKL S, Das A, Chakraborty T, et al. SemEval-2020 Task 8: Memotion Analysis – The Visuo-Lingual Metaphor! [Internet]. arXiv; 2020 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2008.03781
90.
Diem M, Fiel S, Kleber F, Sablatnig R, Saavedra JM, Contreras D, et al. ICFHR 2014 competition on handwritten digit string recognition in challenging datasets (HDSRC 2014). In: 2014 14th International Conference on Frontiers in Handwriting Recognition [Internet]. IEEE; 2014 [cited 2025 Aug 28]. p. 779–84. Available from: https://ieeexplore.ieee.org/abstract/document/6981115/
91.
wendlerc/RenderedText · Datasets at Hugging Face [Internet]. 2024 [cited 2025 Aug 28]. Available from: https://huggingface.co/datasets/wendlerc/RenderedText
92.
Huang Z, Chen K, He J, Bai X, Karatzas D, Lu S, et al. Icdar2019 competition on scanned receipt ocr and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR) [Internet]. IEEE; 2019 [cited 2025 Aug 28]. p. 1516–20. Available from: https://ieeexplore.ieee.org/abstract/document/8977955/
93.
Yu W, Zhang C, Cao H, Hua W, Li B, Chen H, et al. ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images. In: Fink GA, Jain R, Kise K, Zanibbi R, editors. Document Analysis and Recognition - ICDAR 2023 [Internet]. Cham: Springer Nature Switzerland; 2023 [cited 2025 Sep 2]. p. 536–52. Available from: https://link.springer.com/10.1007/978-3-031-41679-8_32
94.
Kim G, Hong T, Yim M, Nam J, Park J, Yim J, et al. OCR-Free Document Understanding Transformer. In: Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T, editors. Computer Vision – ECCV 2022 [Internet]. Cham: Springer Nature Switzerland; 2022 [cited 2025 Aug 28]. p. 498–517. Available from: https://link.springer.com/10.1007/978-3-031-19815-1_29
95.
https://ai.100tal.com/dataset [Internet]. [cited 2025 Sep 2]. Available from: https://ai.100tal.com/dataset
96.
Xie X, Fu L, Zhang Z, Wang Z, Bai X. Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition. In: Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T, editors. Computer Vision – ECCV 2022 [Internet]. Cham: Springer Nature Switzerland; 2022 [cited 2025 Aug 28]. p. 303–21. Available from: https://link.springer.com/10.1007/978-3-031-19815-1_18
97.
Poznanski J, Rangapur A, Borchardt J, Dunkelberger J, Huff R, Lin D, et al. olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models [Internet]. arXiv; 2025 [cited 2025 Sep 2]. Available from: http://arxiv.org/abs/2502.18443
back: 1, 2
98.
Schwenk D, Khandelwal A, Clark C, Marino K, Mottaghi R. A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge. In: Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T, editors. Computer Vision – ECCV 2022 [Internet]. Cham: Springer Nature Switzerland; 2022 [cited 2025 Aug 28]. p. 146–62. Available from: https://link.springer.com/10.1007/978-3-031-20074-8_9
back: 1, 2
99.
Li L, Wang Y, Xu R, Wang P, Feng X, Kong L, et al. Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models [Internet]. arXiv; 2024 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2403.00231
100.
Bhushan S, Lee M. Block diagram-to-text: Understanding block diagram images by generating natural language descriptors. In: Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022 [Internet]. 2022 [cited 2025 Sep 2]. p. 153–68. Available from: https://aclanthology.org/2022.findings-aacl.15/
back: 1, 2
101.
Kamizuru00/diagram_image_to_text · Datasets at Hugging Face [Internet]. 2024 [cited 2025 Aug 28]. Available from: https://huggingface.co/datasets/Kamizuru00/diagram_image_to_text
102.
Mathew M, Karatzas D, Jawahar CV. Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision [Internet]. 2021 [cited 2025 Aug 28]. p. 2200–9. Available from: http://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html
103.
Wang X, Liu Y, Shen C, Ng CC, Luo C, Jin L, et al. On the general value of evidence, and bilingual scene-text visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition [Internet]. 2020 [cited 2025 Aug 28]. p. 10126–35. Available from: http://openaccess.thecvf.com/content_CVPR_2020/html/Wang_On_the_General_Value_of_Evidence_and_Bilingual_Scene-Text_Visual_CVPR_2020_paper.html
104.
ift/handwriting_forms · Datasets at Hugging Face [Internet]. [cited 2025 Sep 2]. Available from: https://huggingface.co/datasets/ift/handwriting_forms
105.
Mathew M, Bagal V, Tito R, Karatzas D, Valveny E, Jawahar CV. Infographicvqa. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision [Internet]. 2022 [cited 2025 Aug 28]. p. 1697–706. Available from: http://openaccess.thecvf.com/content/WACV2022/html/Mathew_InfographicVQA_WACV_2022_paper.html
back: 1, 2, 3
106.
mychen76/invoices-and-receipts_ocr_v1 · Datasets at Hugging Face [Internet]. 2025 [cited 2025 Sep 2]. Available from: https://huggingface.co/datasets/mychen76/invoices-and-receipts_ocr_v1
107.
Chang S, Palzer D, Li J, Fosler-Lussier E, Xiao N. MapQA: A Dataset for Question Answering on Choropleth Maps [Internet]. arXiv; 2022 [cited 2025 Aug 28]. Available from: http://arxiv.org/abs/2211.08545
108.
Sharma C, Paka W, Scott DB, Das A, Poria S, Chakraborty T, et al. Task report: Memotion analysis 1.0@ semeval 2020: The visuo-lingual metaphor. In: Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval-2020), Barcelona, Spain, Sep Association for Computational Linguistics. 2020. p. 759–73.
109.
Mishra A, Shekhar S, Singh AK, Chakraborty A. Ocr-vqa: Visual question answering by reading text in images. In: 2019 international conference on document analysis and recognition (ICDAR) [Internet]. IEEE; 2019 [cited 2025 Sep 1]. p. 947–52. Available from: https://ieeexplore.ieee.org/abstract/document/8978122/
110.
Ding Y, Luo S, Chung H, Han SC. PDF-VQA: A New Dataset for Real-World VQA on PDF Documents. In: De Francisci Morales G, Perlich C, Ruchansky N, Kourtellis N, Baralis E, Bonchi F, editors. Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track [Internet]. Cham: Springer Nature Switzerland; 2023 [cited 2025 Sep 1]. p. 585–601. Available from: https://link.springer.com/10.1007/978-3-031-43427-3_35
111.
Wang B, Li G, Zhou X, Chen Z, Grossman T, Li Y. Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning. In: The 34th Annual ACM Symposium on User Interface Software and Technology [Internet]. Virtual Event USA: ACM; 2021 [cited 2025 Sep 1]. p. 498–510. Available from: https://dl.acm.org/doi/10.1145/3472749.3474765
112.
Hsiao Y-C, Zubach F, Baechler G, Sunkara S, Carbune V, Lin J, et al. ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots [Internet]. arXiv; 2025 [cited 2025 Sep 1]. Available from: http://arxiv.org/abs/2209.08199
113.
Tanaka R, Nishida K, Nishida K, Hasegawa T, Saito I, Saito K. Slidevqa: A dataset for document visual question answering on multiple images. In: Proceedings of the AAAI Conference on Artificial Intelligence [Internet]. 2023 [cited 2025 Sep 1]. p. 13636–45. Available from: https://ojs.aaai.org/index.php/AAAI/article/view/26598
114.
Biten AF, Tito R, Mafla A, Gomez L, Rusinol M, Valveny E, et al. Scene text visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision [Internet]. 2019 [cited 2025 Sep 1]. p. 4291–301. Available from: http://openaccess.thecvf.com/content_ICCV_2019/html/Biten_Scene_Text_Visual_Question_Answering_ICCV_2019_paper.html
115.
sujet-ai/Sujet-Finance-QA-Vision-100k · Datasets at Hugging Face [Internet]. 2024 [cited 2025 Sep 1]. Available from: https://huggingface.co/datasets/sujet-ai/Sujet-Finance-QA-Vision-100k
116.
jimmycarter/textocr-gpt4v · Datasets at Hugging Face [Internet]. 2024 [cited 2025 Sep 1]. Available from: https://huggingface.co/datasets/jimmycarter/textocr-gpt4v
117.
Singh A, Natarajan V, Shah M, Jiang Y, Chen X, Batra D, et al. Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition [Internet]. 2019 [cited 2025 Sep 1]. p. 8317–26. Available from: http://openaccess.thecvf.com/content_CVPR_2019/html/Singh_Towards_VQA_Models_That_Can_Read_CVPR_2019_paper.html
118.
Ye J, Hu A, Xu H, Ye Q, Yan M, Xu G, et al. UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model [Internet]. arXiv; 2023 [cited 2025 Sep 1]. Available from: http://arxiv.org/abs/2310.05126
back: 1, 2, 3, 4
119.
Tanaka R, Nishida K, Yoshida S. Visualmrc: Machine reading comprehension on document images. In: Proceedings of the AAAI Conference on Artificial Intelligence [Internet]. 2021 [cited 2025 Sep 1]. p. 13878–88. Available from: https://ojs.aaai.org/index.php/AAAI/article/view/17635
120.
andito/ai2d-merged · Datasets at Hugging Face [Internet]. [cited 2025 Sep 2]. Available from: https://huggingface.co/datasets/andito/ai2d-merged
121.
He X, Zhang Y, Mou L, Xing E, Xie P. PathVQA: 30000+ Questions for Medical Visual Question Answering [Internet]. arXiv; 2020 [cited 2025 Sep 1]. Available from: http://arxiv.org/abs/2003.10286
122.
Lu P, Mishra S, Xia T, Qiu L, Chang K-W, Zhu S-C, et al. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems [Internet]. 2022 [cited 2025 Sep 1];35:2507–21. Available from: https://proceedings.neurips.cc/paper_files/paper/2022/hash/11332b6b6cf4485b84afadb1352d3a9a-Abstract-Conference.html
123.
Kembhavi A, Seo M, Schwenk D, Choi J, Farhadi A, Hajishirzi H. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern recognition [Internet]. 2017 [cited 2025 Sep 1]. p. 4999–5007. Available from: http://openaccess.thecvf.com/content_cvpr_2017/html/Kembhavi_Are_You_Smarter_CVPR_2017_paper.html
124.
Jia Y, Li J, Yue X, Li B, Nie P, Zou K, et al. VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search [Internet]. arXiv; 2025 [cited 2025 Sep 1]. Available from: http://arxiv.org/abs/2503.10582
125.
Lau JJ, Gayen S, Ben Abacha A, Demner-Fushman D. A dataset of clinically generated visual questions and answers about radiology images. Scientific data [Internet]. 2018 [cited 2025 Sep 1];5(1):1–10. Available from: https://www.nature.com/articles/sdata2018251
126.
Zheng T, Zhang G, Shen T, Liu X, Lin BY, Fu J, et al. OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement [Internet]. arXiv; 2025 [cited 2025 Sep 1]. Available from: http://arxiv.org/abs/2402.14658
back: 1, 2
127.
Zhang B-W, Yan Y, Li L, Liu G. Infinity Math: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning. In: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management [Internet]. Boise ID USA: ACM; 2024 [cited 2025 Sep 1]. p. 5405–9. Available from: https://dl.acm.org/doi/10.1145/3627673.3679122
128.
Yue X, Qu X, Zhang G, Fu Y, Huang W, Sun H, et al. MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning [Internet]. arXiv; 2023 [cited 2025 Sep 1]. Available from: http://arxiv.org/abs/2309.05653
129.
Amini A, Gabriel S, Lin P, Koncel-Kedziorski R, Choi Y, Hajishirzi H. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms [Internet]. arXiv; 2019 [cited 2025 Sep 1]. Available from: http://arxiv.org/abs/1905.13319
130.
Lai X, Tian Z, Chen Y, Yang S, Peng X, Jia J. Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [Internet]. arXiv; 2024 [cited 2025 Sep 1]. Available from: http://arxiv.org/abs/2406.18629
131.
AI-MO/NuminaMath-CoT · Datasets at Hugging Face [Internet]. 2025 [cited 2025 Sep 1]. Available from: https://huggingface.co/datasets/AI-MO/NuminaMath-CoT
132.
teknium/OpenHermes-2.5 · Datasets at Hugging Face [Internet]. 2024 [cited 2025 Sep 1]. Available from: https://huggingface.co/datasets/teknium/OpenHermes-2.5
133.
Open-Orca/OpenOrca · Datasets at Hugging Face [Internet]. 2024 [cited 2025 Sep 1]. Available from: https://huggingface.co/datasets/Open-Orca/OpenOrca
134.
Mitra A, Khanpour H, Rosset C, Awadallah A. Orca-Math: Unlocking the potential of SLMs in Grade School Math [Internet]. arXiv; 2024 [cited 2025 Sep 1]. Available from: http://arxiv.org/abs/2402.14830
135.
flytech/python-codes-25k · Datasets at Hugging Face [Internet]. 2024 [cited 2025 Sep 1]. Available from: https://huggingface.co/datasets/flytech/python-codes-25k
136.
sahil2801/CodeAlpaca-20k · Datasets at Hugging Face [Internet]. 2023 [cited 2025 Sep 2]. Available from: https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k
137.
qywu/ruozhiba_en · Datasets at Hugging Face [Internet]. [cited 2025 Sep 1]. Available from: https://huggingface.co/datasets/qywu/ruozhiba_en
138.
Chen W, Yin M, Ku M, Lu P, Wan Y, Ma X, et al. TheoremQA: A Theorem-driven Question Answering dataset [Internet]. arXiv; 2023 [cited 2025 Sep 1]. Available from: http://arxiv.org/abs/2305.12524
139.
WizardLMTeam/WizardLM_evol_instruct_70k · Datasets at Hugging Face [Internet]. 2024 [cited 2025 Sep 1]. Available from: https://huggingface.co/datasets/WizardLMTeam/WizardLM_evol_instruct_70k
140.
Toshniwal S, Du W, Moshkov I, Kisacanin B, Ayrapetyan A, Gitman I. OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data [Internet]. arXiv; 2024 [cited 2025 Sep 2]. Available from: http://arxiv.org/abs/2410.01560
141.
Li Z, Chen G, Liu S, Wang S, VS V, Ji Y, et al. Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models [Internet]. 2025. Available from: https://arxiv.org/abs/2501.14818

FineVision:
Open Data Is All You Need

Authors

Affiliations

Published

PDF