Evaluating Reading Ability of Large Multimodal Models


Adobe Research University at Buffalo

Multi-Modal Reading (MMR) Benchmark includes 550 annotated question-answer pairs across 11 distinct tasks involving texts, fonts, visual elements, bounding boxes, spatial relations, and grounding, with carefully designed evaluation metrics.

Question Distributions of MMR
Performance of different methods

Abstract

Large language models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Most existing text-rich image benchmarks are simple extraction-based question answering, and many LMMs now easily achieve high scores. This means that current benchmarks fail to accurately reflect performance of different models, and a natural idea is to build a new benchmark to evaluate their complex reasoning and spatial understanding abilities. In this work, we propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding. MMR is the first text-rich image benchmark built on human annotations with the help of language models. By evaluating several state-of-the-art LMMs, including GPT-4o, it reveals the limited capabilities of existing LMMs underscoring the value of our benchmark.

[UPDATE 08/26] Initial Paper Release of MMR.

Leaderboard

The MMR benchmark evaluates the reasoning and spatial understanding capabilities of LMMs in text-rich image comprehension. The performance scores highlight the effectiveness of various models, offering guidance on model selection for real-world applications.

Text Font Localization Spatial Relation Grounding
Model Size Label Pos Size Color Obj Text O-T O-O T-T O-Box T-PNLS T-Box Total
Claude 3.5 Sonnet - 42 31 43 46 38 45 39 36 46 39 47 11 463
GPT-4o - 46 34 43 41 40 42 34 37 40 33 46 21 457
GPT-4V - 43 33 43 40 37 38 26 26 45 26 48 10 415
LLaVA-NEXT 34B 39 27 42 39 39 39 28 31 46 40 37 5 412
Phi-3-Vision 4B 40 34 42 39 41 42 31 33 42 38 13 2 397
InternVL2 8B 42 30 46 44 39 42 27 33 45 15 5 0 368
Qwen-vl-max - 39 27 41 36 34 33 26 32 37 24 32 5 366
LLaVA-NEXT 13B 36 27 37 33 38 38 23 31 37 39 2 0 335
Qwen-vl-plus - 38 23 32 35 26 23 24 23 27 34 22 3 310
Idefics-2 8B 36 23 36 29 31 27 20 21 33 0 0 0 256
LLaVA 1.5 13B 30 10 25 20 32 17 16 24 26 33 0 4 243
InternVL2 1B 35 29 32 24 28 25 17 27 19 0 1 0 237
Monkey-Chat 7B 36 22 33 27 26 16 9 18 27 0 0 0 214
Idefics 80B 0 1 21 20 21 17 20 19 20 0 0 0 139

BibTeX


        @article{chen2024mmr,
          title={MMR: Evaluating Reading Ability of Large Multimodal Models},
          author={Chen, Jian and Zhang, Ruiyi and Zhou, Yufan and Rossi, Ryan and Gu, Jiuxiang and Chen, Changyou},
          journal={arXiv preprint arXiv:2408.14594},
          year={2024}
        }
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca, Vicuna, and LLaVA.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna, GPT-4 and LLaVA. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.