PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR

EACL 2026 (European Chapter of the Association for Computational Linguistics)

James Burgess1, Jan Niklas Hansen1, Duo Peng2, Yuhui Zhang1, Alejandro Lozano1, Min Woo Sun1, Emma Lundberg1,2,3, Serena Yeung-Levy1,2

1Stanford University, 2Chan Zuckerberg Biohub Network, 3KTH Royal Institute of Technology

We release an environment for training search agents over scientific literature using reinforcement learning with verifiable rewards (RLVR). We release a corpus of 16 million biomedical abstracts, a 60k QA dataset, and benchmarks — all compatible with the Search-R1 codebase. Our data creation methods are scalable and easily extendable to other scientific domains.

Search Agents

Search agents are LLMs that interleave reasoning and retrieval to answer questions. Rather than relying on fixed retrieval pipelines, they learn search strategies through RL, supervised only on final answer correctness. This capability is essential for building AI systems that can autonomously navigate and reason over scientific literature.

Overview of a search agent interleaving reasoning and retrieval

What We Release

Artifacts

We release an RLVR training environment for scientific paper QA:

Data Creation Pipeline

We also release a data creation pipeline for constructing QA training data from paper abstracts. It requires only a corpus of abstracts and access to an LLM.

Data generation pipeline

Question Categories

Our question categories were defined with domain experts in biomedicine. To extend this pipeline to other fields, define new categories relevant to your domain.

Question categories

Results

RL-trained search agents (Search-R1) substantially outperform retrieval baselines. With Qwen2.5-7B: 51.0% on PaperSearchQA vs 36.5% for RAG (+14.5 pts). Agents trained on PaperSearchQA also generalize to BioASQ, a human-created biomedical QA benchmark: 44.8% vs 29.7% for RAG (+15.1 pts). We release a reformatted version of BioASQ compatible with the Search-R1 codebase.

Learned Behaviors

Through qualitative analysis of reasoning traces, we observe agents learning to plan searches, reason before retrieving, and verify their own knowledge:

Three emergent agent behaviors observed in reasoning traces

BibTeX

@misc{burgess2026papersearchqalearningsearchreason,
      title={PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR},
      author={James Burgess and Jan N. Hansen and Duo Peng and Yuhui Zhang and Alejandro Lozano and Min Woo Sun and Emma Lundberg and Serena Yeung-Levy},
      year={2026},
      eprint={2601.18207},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.18207},
}