| Random | - | 22.0 | 21.9 | 21.8 | 21.9 |
| Llama-3.2-11b-vision-instruct | Small | 30.3 | 32.4 | 29.3 | 28.7 |
| LLaVA-Mistral-7B | Medical | 39.8 | 31.6 | 43.1 | 37.1 |
| VILA1.5-13b | Small | 41.8 | 41.8 | 47.5 | 40.9 |
| Llama-3.2-90b-Vision-Instruct | Large | 42.4 | 44.9 | 42.1 | 38.7 |
| LLaVA-Med-Mistral-7B | Medical | 43.0 | 37.3 | 47.1 | 41.6 |
| Llama-3.1-Nemotron-70b-Instruct | Large | 44.2 | 44.9 | 43.3 | 44.8 |
| Pixtral-12b | Small | 45.6 | 46.9 | 44.8 | 44.8 |
| *GPT-4o | Large | 45.6 | 48.7 | 43.1 | 44.8 |
| GPT-4o-mini | Small | 46.2 | 48.5 | 43.6 | 47.0 |
| Gemini-Flash-1.5-8b | Small | 46.7 | 48.7 | 43.6 | 49.1 |
| Claude-3.5-Haiku | Small | 47.1 | 48.0 | 43.8 | 51.7 |
| VILA1.5-40b | Large | 47.5 | 47.2 | 47.9 | 47.4 |
| Qwen-2-vl-72b-Instruct | Large | 47.5 | 49.2 | 45.7 | 47.8 |
| Grok-2-Vision | Large | 48.4 | 50.3 | 46.4 | 48.7 |
| Qwen-2-VL-7b | Small | 48.8 | 54.1 | 43.3 | 49.6 |
| Pixtral-Large | Large | 49.8 | 50.8 | 49.5 | 48.7 |
| Human | - | 50.3 | 52.7 | 47.5 | 51.4 |
| Gemini-Pro-1.5 | Large | 51.1 | 52.0 | 50.2 | 50.9 |
| *Claude-3.5-Sonnet | Large | 51.7 | 54.1 | 50.2 | 50.4 |
| o1 | Reasoning | 52.8 | 55.4 | 50.2 | 53.0 |
| Claude Sonnet 4.5 | Reasoning | 54.4 | 55.1 | 56.4 | 49.6 |
| o4-mini | Reasoning | 55.6 | 57.9 | 56.1 | 50.4 |
| o3 | Reasoning | 59.3 | 61.5 | 60.5 | 53.5 |
| GPT-5 | Reasoning | 59.4 | 63.3 | 58.9 | 53.9 |