LLaVAR

Abstract

Large language models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Most existing text-rich image benchmarks are simple extraction-based question answering, and many LMMs now easily achieve high scores. This means that current benchmarks fail to accurately reflect performance of different models, and a natural idea is to build a new benchmark to evaluate their complex reasoning and spatial understanding abilities. In this work, we propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding. MMR is the first text-rich image benchmark built on human annotations with the help of language models. By evaluating several state-of-the-art LMMs, including GPT-4o, it reveals the limited capabilities of existing LMMs underscoring the value of our benchmark.

[UPDATE 08/26] Initial Paper Release of MMR.

The MMR benchmark evaluates the reasoning and spatial understanding capabilities of LMMs in text-rich image comprehension. The performance scores highlight the effectiveness of various models, offering guidance on model selection for real-world applications.

		Text		Font		Localization		Spatial Relation			Grounding
Model	Size	Label	Pos	Size	Color	Obj	Text	O-T	O-O	T-T	O-Box	T-PNLS	T-Box	Total
Claude 3.5 Sonnet	-	42	31	43	46	38	45	39	36	46	39	47	11	463
GPT-4o	-	46	34	43	41	40	42	34	37	40	33	46	21	457
GPT-4V	-	43	33	43	40	37	38	26	26	45	26	48	10	415
LLaVA-NEXT	34B	39	27	42	39	39	39	28	31	46	40	37	5	412
Phi-3-Vision	4B	40	34	42	39	41	42	31	33	42	38	13	2	397
InternVL2	8B	42	30	46	44	39	42	27	33	45	15	5	0	368
Qwen-vl-max	-	39	27	41	36	34	33	26	32	37	24	32	5	366
LLaVA-NEXT	13B	36	27	37	33	38	38	23	31	37	39	2	0	335
Qwen-vl-plus	-	38	23	32	35	26	23	24	23	27	34	22	3	310
Idefics-2	8B	36	23	36	29	31	27	20	21	33	0	0	0	256
LLaVA 1.5	13B	30	10	25	20	32	17	16	24	26	33	0	4	243
InternVL2	1B	35	29	32	24	28	25	17	27	19	0	1	0	237
Monkey-Chat	7B	36	22	33	27	26	16	9	18	27	0	0	0	214
Idefics	80B	0	1	21	20	21	17	20	19	20	0	0	0	139

BibTeX

@article{chen2024mmr, title={MMR: Evaluating Reading Ability of Large Multimodal Models}, author={Chen, Jian and Zhang, Ruiyi and Zhou, Yufan and Rossi, Ryan and Gu, Jiuxiang and Chen, Changyou}, journal={arXiv preprint arXiv:2408.14594}, year={2024} }

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca, Vicuna, and LLaVA.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna, GPT-4 and LLaVA. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Evaluating Reading Ability of Large Multimodal Models

Multi-Modal Reading (MMR) Benchmark includes 550 annotated question-answer pairs across 11 distinct tasks involving texts, fonts, visual elements, bounding boxes, spatial relations, and grounding, with carefully designed evaluation metrics.

Abstract

Leaderboard

BibTeX

Acknowledgement