The MMR benchmark evaluates the reasoning and spatial understanding capabilities of LMMs in text-rich image comprehension. The performance scores highlight the effectiveness of various models, offering guidance on model selection for real-world applications.
Text | Font | Localization | Spatial Relation | Grounding | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Model | Size | Label | Pos | Size | Color | Obj | Text | O-T | O-O | T-T | O-Box | T-PNLS | T-Box | Total |
Claude 3.5 Sonnet | - | 42 | 31 | 43 | 46 | 38 | 45 | 39 | 36 | 46 | 39 | 47 | 11 | 463 |
GPT-4o | - | 46 | 34 | 43 | 41 | 40 | 42 | 34 | 37 | 40 | 33 | 46 | 21 | 457 |
GPT-4V | - | 43 | 33 | 43 | 40 | 37 | 38 | 26 | 26 | 45 | 26 | 48 | 10 | 415 |
LLaVA-NEXT | 34B | 39 | 27 | 42 | 39 | 39 | 39 | 28 | 31 | 46 | 40 | 37 | 5 | 412 |
Phi-3-Vision | 4B | 40 | 34 | 42 | 39 | 41 | 42 | 31 | 33 | 42 | 38 | 13 | 2 | 397 |
InternVL2 | 8B | 42 | 30 | 46 | 44 | 39 | 42 | 27 | 33 | 45 | 15 | 5 | 0 | 368 |
Qwen-vl-max | - | 39 | 27 | 41 | 36 | 34 | 33 | 26 | 32 | 37 | 24 | 32 | 5 | 366 |
LLaVA-NEXT | 13B | 36 | 27 | 37 | 33 | 38 | 38 | 23 | 31 | 37 | 39 | 2 | 0 | 335 |
Qwen-vl-plus | - | 38 | 23 | 32 | 35 | 26 | 23 | 24 | 23 | 27 | 34 | 22 | 3 | 310 |
Idefics-2 | 8B | 36 | 23 | 36 | 29 | 31 | 27 | 20 | 21 | 33 | 0 | 0 | 0 | 256 |
LLaVA 1.5 | 13B | 30 | 10 | 25 | 20 | 32 | 17 | 16 | 24 | 26 | 33 | 0 | 4 | 243 |
InternVL2 | 1B | 35 | 29 | 32 | 24 | 28 | 25 | 17 | 27 | 19 | 0 | 1 | 0 | 237 |
Monkey-Chat | 7B | 36 | 22 | 33 | 27 | 26 | 16 | 9 | 18 | 27 | 0 | 0 | 0 | 214 |
Idefics | 80B | 0 | 1 | 21 | 20 | 21 | 17 | 20 | 19 | 20 | 0 | 0 | 0 | 139 |
@article{chen2024mmr,
title={MMR: Evaluating Reading Ability of Large Multimodal Models},
author={Chen, Jian and Zhang, Ruiyi and Zhou, Yufan and Rossi, Ryan and Gu, Jiuxiang and Chen, Changyou},
journal={arXiv preprint arXiv:2408.14594},
year={2024}
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca, Vicuna, and LLaVA.
Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna, GPT-4 and LLaVA. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.