MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

CVPR 2025

James Burgess*¹, Jeffrey J Nirschl*¹, Laura Bravo-Sánchez*¹, Alejandro Lozano¹, Sanket Rajan Gupte¹, Jesus G. Galaz-Montoya¹, Yuhui Zhang¹, Yuchang Su², Disha Bhowmik³, Zachary Coman³, Sarina M. Hasan⁴, Alexandra Johannesson⁵, William D. Leineweber¹, Malvika G Nair³, Ridhi Yarlagadda³,1 Connor Zuraski¹, Wah Chiu¹, Sarah Cohen³, Jan N. Hansen¹, Manuel D Leonetti⁶, Chad Liu⁶, Emma Lundberg^1,5,6, Serena Yeung-Levy^1,6

¹Stanford University, ²Tsinghua University, ³University of North Carolina at Chapel Hill, ⁴Princeton University ⁵KTH Royal Institute of Technology ⁶Chan Zuckerberg Biohub Network

arXiv Benchmark Code Leaderboard

Two key themes in AI research are reasoning in large language models (LLMs) and applying LLMs to scientific research. Our work bridges both by introducing MicroVQA, a benchmark that evaluates LLM reasoning on multiple-choice questions about microscopy images, created by expert biologists.

Contributions:

Manually created by practicing biologists, the questions in MicroVQA reflect tasks where LLMs could meaningfully assist biological research, and each requires multimodal reasoning.
We also advance the practice of AI benchmarking with our RefineBot method, which makes multiple-choice questions more challenging by using LLM feedback to remove language shortcuts.

Key Elements of MicroVQA

Useful for real scientific researchers: MicroVQA has 1042 questions from three tasks that are important to scientific research. The tasks are: expert visual understanding, hypothesis generation, and experimental proposal.
Difficult reasoning: The questions require domain knowledge and multistep reasoning at the research level of difficulty. Existing vision-language benchmarks for science fall short on one of these dimensions: for example, MMMU-Pro has complex reasoning but at the college-level (from exams), while OmnimedVQA is graduate-level but has less complex reasoning.
High-quality, human-created questions: The questions were created and verified by 11 experts in diverse biological research disciplines. Each question took on average 30mins to write.
Challenging for current LLMs: We tested state-of-the-art MLLMs and specialized medical models on MicroVQA, with the top model scoring only 52.8% (leaderboard).
Multimodal: There are text-only benchmarks that are both research-level and require difficult reasoning – GPQA and (most of) LAB-Bench. But scientific research is multimodal, so we make questions that need image understanding. To make sure that MicroVQA is vision-centric, we intoroduce the RefineBot method to remove language shortcuts from multiple-choice questions.

Explore the benchmark here:

Building MicroVQA

Why build MicroVQA? Scientific research involves reasoning about complex data while integrating domain-specific knowledge. A lot of recent excitement around LLMs comes from the hope that they can augment humans, either as chat assistants [1,2, 3,4] or as agents [1,2,3,4]. There are some compelling visions for how these systems might look [1,2] but there's now a big need to define concrete tasks that would be useful for real scientists - we need realistic and challenging benchmarks.

Defining the tasks. Step 1 was to define tasks, or categories of questions to guide benchmark collection. We developed them with 8 collaborators over ~16h of discussions. The tasks are:

Expert visual understanding, which commonly involves anomaly detection or image comparisons. Analysis must consider the sample preparation context and expert knowledge is needed to evaluate biological features and technical artifacts.
Hypothesis generation considers the experimental context and tries to find an explanation for what we see in the image. It requires abductive reasoning, since you need to select from many possible hypotheses given incomplete information.
Experimental proposal requires deciding on next steps to validate a given hypothesis. It requires knowledge about experimental protocols and reasoning about whether the experiment would prove that hypothesis.

RefineBot: Making MCQs That Test Reasoning and Remove Language Shortcuts

When building QA & VQA evaluations, you typically start with a question string and answer string written by a person. Then it's common to use an LLM to transform it into a good multiple-choice question (MCQ), including the wrong answers that are called 'distractors'. Here we'll use a very easy biology question to demonstrate.

We found that this naive approach leads to questions that are very easy for LLMs to solve - GPT-4o could get 90% of questions correct, even without the image. We were confident that the original questions really do need the image, so we concluded that the question generation process was introducing language shortcuts, meaning LLMs could cheat and solve them using test-taking strategies. We identified 3 types of shortcuts:

Shortcut 1 - visual giveaway. The text 'gives away' the image content so it's trivial to answer the question.

Shortcut 2 - weak distractors. The LLM generates implausible or weak distractors that are easy to eliminate.

Shortcut 3 - language bias. The model can make an educated guess based on the prior learned by the LLM [1].

We created RefineBot to fix this (code here). The key idea is that if you give the question to the LLM without the image (box 1 below), then the LLM's chain-of-thought response can show you what language shortcut it used (box 2). Then you can simply prompt the LLM to rewrite the question in a way that removes the shortcut (boxes 3 and 4).

The final RefineBot system applies this basic idea in a loop, with an extra check that the revised question matches the original question.

While we designed this to deal with language shortcuts in particular, we think that a similar strategy could make any multiple-choice question harder.

Testing MLLMs on MicroVQA

We tested MicroVQA on current frontier models:

MicroVQA is too challenging for all existing multimodal LLMs, scoring below 53%.
There is little gap between the top closed- and open-source models.
For a given model family, there is very little difference between the larger model like Qwen2-VL-72B and its smaller counterpart Qwen2-VL-7B. Note that the smaller models often have the same vision encoder.
Standard fine-tuning methods for medical fine-tuning (as in Llava-Med compared with LLaVA-Mistral-7B) do improve performance a bit.
The human baseline was 51% - this reflects that biology is very specialized, so even expert humans will get many questions wrong.
Starred models (*) were used by the RefineBot in MCQ - GPT-4o and Claude-3.5-Sonnet. This could induce a small bias that makes their final performance relatively worse.
We perform 'no-image' and 'choices-only' ablations in the paper.

What do MLLMs get wrong?

Understanding error modes. We manually reviewed 30 random samples of errors by Claude-3.5-Sonnet. We spent about 45 minutes per sample to properly understand the failures. Here's some example:

50% were perception errors - failing to interpret the image. Many responses tend to rely on the 'language bias' towards common image content (that we discuss above).
30% were misconception (knowledge) errors about nuanced biomedical knowledge.
13% were over-generalization or over-simplification - the model answers a less-specific and less-nuanced version of the actual question.
7% were hallucinations about the question text added in the chain-of-thought response.

This suggests that the most important next steps for stronger microscopy MLLMs are: (1) to improve microscopy-specific image understanding in MLLMs, and (2) to improve knowledge, possibly with some RAG-based methods.

Bloom's Taxonomy to Measure Reasoning Difficulty

MicroVQA's focus on high reasoning levels. We argued that prior MLLM benchmarks for science tend to emphasize recall and basic comprehension, probably because they are derived from educational exams and text books. Our goal was to test more challenging scientific reasoning by recruiting experts to create questions that reflect real research tasks, like analyzing microscopy images in novel experimental contexts, and evaluating hypotheses.

Bloom's taxonomy. To quantify this more challenging reasoning, we used the Bloom's taxonomy [1], which classifies cognitive skills into six levels, from basic recall to complex reasoning. While multiple-choice questions (MCQs) cannot assess the highest level — called 'creation' — they effectively test comprehension, application, analysis, and evaluation [1,2,3]. We trained an LLM to classify MCQs into Bloom's levels. This lets us compare reasoning levels between different vision-language MCQ benchmarks in science:

Bloom's Taxonomy in MicroVQA — Composition of scientific MLLM benchmarks regarding estimated Bloom's taxonomy.

Among research-level benchmarks, MicroVQA has the highest-level reasoning. The next highest are LAB-Bench at the research level, and MMMU-pro at the college level. Another finding is that among benchmarks with higher level reasoning, data scale becomes very challenging - harder questions are harder to collect. Specifically, MicroVQA has 1,042 samples, LAB-Bench's vision-language subset has only 181, and MMMU-Pro has 1,730.