ResearchBench

Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

1Fudan University 2Shanghai AI Laboratory 3Nanyang Technological University 4UNSW 5NUS 6Wuhan University

*Equal contribution · Corresponding author

liuyujie.cs@gmail.com · {zonglin001,cambria}@ntu.edu.sg · zhoudongzhan@pjlab.org.cn

ResearchBench is accepted to ACL 2026 (Findings). Thanks to all my collaborators! 🎉🎉🎉

Can LLMs discover new scientific hypotheses?

Large language models (LLMs) have shown potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs on a sufficient set of scientific discovery sub-tasks—inspiration retrieval, hypothesis composition, and hypothesis ranking—where sufficient means that perfectly solving these sub-tasks perfectly solves the overall discovery task. We develop an automated LLM-based framework that extracts critical components—research questions, background surveys, inspirations, and hypotheses—from papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on publications from 2024 onward, ensuring minimal overlap with LLM pretraining data; our automated framework further enables automatic extraction of even more recent papers as LLM pretraining cutoffs advance, supporting scalable and contamination-free automatic renewal of this discovery benchmark. Our evaluation shows that, across disciplines, LLMs excel at inspiration retrieval—an out-of-distribution task—suggesting their ability to surface novel knowledge associations.

The Benchmark

1,386 frontier papers across 12 disciplines

Papers from Nature, Science, and venues of a similar level — all published in 2024 or later. Extracted automatically so the benchmark stays contamination-resistant as model cutoffs advance.

1,386
Papers
12
Disciplines
3
Sub-tasks
12
LLMs
ModelCutoffReleased
GPT-4oOct 2023May 2024
GPT-4o MiniOct 2023Jul 2024
Llama-3.1-8BDec 2023Jul 2024
Llama-3.1-70BDec 2023Jul 2024
Llama-3.2-1BSep 2024
Claude 3.5 SonnetApr 2024Jun 2024
Claude 3.5 HaikuJul 2024Oct 2024
Gemini 2.0 FlashJun 2024Dec 2024
Gemini 2.0 FTJun 2024Dec 2024
Qwen PlusNov 2024
Qwen TurboNov 2024
DeepSeek-V3Dec 2024
Main Results

How 12 LLMs perform across the discovery pipeline

Overall scores averaged across all 12 disciplines and 1,386 papers. Switch tabs to compare the three sub-tasks.

Hit Ratio (%) — top 4% of 75 candidates selected (3 papers kept).

ModelCellChemETSMSPhysEGSEVSBLBSLawMathAOverall
Llama-3.2-1B10.7011.6012.269.5911.158.4914.5713.0017.0912.5011.4210.7111.91
Llama-3.1-8B32.3938.0040.6131.3732.8059.8536.6128.5255.6428.8836.2234.6937.87
Gemini 2.0 FT31.2741.2040.6130.6332.4871.0439.3733.5759.6437.0734.6533.1640.18
GPT-4o Mini30.4243.6041.0034.6933.4466.8040.1628.8864.7332.7637.8035.7140.59
Qwen Turbo35.4942.4042.1533.9535.0366.8043.3133.2161.4529.7436.6134.6941.21
Gemini 2.0 Flash31.5538.8044.0634.3234.3974.5237.4032.4964.0037.5037.8032.6541.46
Claude 3.5 Sonnet36.3441.2042.9130.6336.3167.5740.5534.3063.6434.9137.4033.6741.62
Qwen Plus36.0647.2045.2133.5834.3972.9743.3135.3864.3634.9139.3736.2243.43
Claude 3.5 Haiku41.1348.4045.9834.6933.4469.8844.0934.3064.0037.9338.1941.3344.28
DeepSeek-V338.8746.0044.0636.9036.6275.2941.7340.0765.4536.6438.5837.7644.78
Llama-3.1-70B41.4144.0047.5136.9034.3970.6645.2837.1865.4539.2238.1939.2944.87
GPT-4o39.4446.4047.1338.3835.3575.2944.8838.6365.8239.2240.1638.7845.65

Retrieval capability climbs fast up to ~8B parameters, then plateaus around 70B regardless of training strategy.

Normalized matched score (%) — how well a generated hypothesis covers ground-truth key points.

ModelCellChemETSMSPhysEGSEVSBLBSLawMathAOverall
Claude 3.5 Haiku40.4240.8738.7146.7545.0045.3448.0046.1535.1437.8543.5934.2942.56
Llama-3.1-8B44.5847.8342.7846.0445.0544.3046.4747.3744.2147.5848.2145.1445.68
Gemini 2.0 FT45.6739.7948.4847.2248.7749.2448.5748.0241.4747.0342.8140.0046.30
Gemini 2.0 Flash46.2545.6348.6451.6347.9751.4749.4148.7747.0355.9156.2449.7150.15
Llama-3.1-70B46.6749.8650.8351.5350.6050.6152.1054.3649.4753.9451.1149.1450.92
GPT-4o Mini46.6749.4250.9152.6353.8253.3354.8654.3646.9256.9752.4853.1452.47
Qwen Turbo52.9251.4549.5551.0652.6450.9752.5756.9253.1655.7655.3853.1452.71
GPT-4o55.0053.0454.0953.9553.8252.9753.1455.3846.1553.9954.5352.5753.37
DeepSeek-V352.7852.2753.1854.2554.9153.9153.7156.3250.2755.1552.1453.7153.79
Qwen Plus60.0053.7257.2756.6358.1456.6360.5758.9751.0562.1955.9056.5757.46

All models can associate background with inspirations, but composition quality scales steadily with model capability.

Ranking accuracy (%) — each hypothesis pair is compared twice with reversed order to cancel position bias. The rightmost column shows the percentage where both orderings are correctly ranked.

ModelCellChemETSMSPhysEGSEVSBLBSLawMathAOverallDebiased
Llama-3.1-70B36.9435.5730.5737.7143.3547.1836.0243.1141.6346.0930.7325.4038.068.17
GPT-4o Mini42.2539.9434.3942.9839.7843.7840.6343.7245.0342.2432.6731.5040.131.33
Gemini 2.0 Flash43.7344.3835.9551.8654.6355.1640.9844.0046.8848.3138.2435.7545.1112.83
Qwen Turbo46.4245.1142.8848.7445.6146.4045.2649.2050.9249.2737.1537.6245.4815.00
Gemini 2.0 FT43.5244.9636.8852.8154.0854.9542.2744.5346.1548.0937.8038.4045.4913.83
Qwen Plus46.0046.0041.7249.3550.6449.1144.8046.9343.3645.4340.1641.9745.565.67
Claude 3.5 Haiku48.1546.8845.5552.4554.1052.4848.8348.0651.2352.9344.4940.2748.8613.67
Llama-3.1-8B55.4854.2055.9056.6054.3555.4855.9156.7154.6955.5555.6055.4955.655.83
GPT-4o60.7560.9953.2461.6961.3461.2060.5264.1164.6761.1452.6051.8059.6027.00
DeepSeek-V380.8882.0378.8583.6380.8281.4783.9881.7783.4880.6976.7875.8880.9976.44
Claude 3.5 Sonnet80.2380.8380.9383.2084.3384.7282.6382.4884.8781.8176.2076.5181.5977.67

Position bias is the main failure mode. The "Debiased" column shows the % of pairs where both orderings are correctly ranked — only Claude 3.5 Sonnet and DeepSeek-V3 exceed 70%.

The Tasks

A sufficient decomposition of discovery

Perfectly solving these three sub-tasks perfectly solves the overall discovery task.

01

Inspiration Retrieval

Screen 75 candidate papers and surface the few that could spark a hypothesis — even when they look unrelated to the research question.

MetricHit Ratio
02

Hypothesis Composition

Associate the research background with retrieved inspirations via mutate / refine / recombine to compose a novel hypothesis.

MetricMatched Score
03

Hypothesis Ranking

From 16 hypotheses (1 ground-truth, 15 negatives), pick the best via pairwise comparison with reversed-order debiasing.

MetricPairwise Accuracy
Examples

See the benchmark in action

QuestionHow can a low-cost, eco-friendly adsorbent be developed to efficiently remove methylene blue (MB), a common pollutant, from aqueous solutions?
GPT-4o Selected Inspirations

= ground-truth (correct)   = not ground-truth (incorrect)

Biochar-supported zerovalent iron for removal of various contaminants from aqueous solutions
Integrates biochar and zero-valent iron for cost-effective, eco-friendly enhancement of adsorption — directly applicable to MB removal.
Uniform silver nanowires synthesis by reducing AgNO3 with ethylene glycol
Control of particle characteristics informs structural optimization of porous, functionalized adsorbent surfaces.
Biochar composites from biomass & waste residues
Explores metal-enriched biochar composites; relevant but not a ground-truth inspiration for this task.
Hypothesis composition framework
Hypothesis composition framework. An evolutionary unit mutates different ways to combine background and inspiration, then recombines the best parts into a final hypothesis.
QuestionHow can we enhance the targeted delivery and therapeutic efficacy of dacarbazine in breast cancer cells while minimizing systemic toxicity?
GPT-4o Generated
A hybrid delivery system of sugar-originated carbon nanodots conjugated with dacarbazine, encapsulated in a glucose-mimetic, pH-responsive hydrogel — targeting GLUT1-overexpressing cells via the acidic tumor microenvironment.
Ground Truth
Dacarbazine-primed carbon quantum dots coated with breast cancer cell-derived exosomes (Ex-DC@CQDs) for enhanced targeted delivery, improved tumor-site accumulation, and reduced systemic toxicity.
Score: 2 / 5 Captures carbon nanodot + dacarbazine, but replaces exosome-mediated targeting with a hydrogel mechanism.
QuestionHow does the relationship between SMBHs and their host galaxies evolve during the star-formation peak at redshifts 1–3?
Ground-Truth Hypothesis
During the peak epoch (z ≈ 2), some host galaxies grow faster than their SMBHs — evidenced by an undermassive SMBH accreting at a super-Eddington rate in a high-redshift quasar.
Negative Hypothesis
At redshifts 1–3, coevolution is driven by companion-galaxy interactions and cold-gas inflows shaping molecular-gas accretion under AGN feedback, quantified via ALMA data. GPT-4o picked this
Extraction Pipeline

How the benchmark is built

An LLM-based agentic framework decomposes each paper into research question, background survey, inspirations, and hypothesis — verified by experts at 91.9% accuracy.

Extraction framework
Extraction framework overview. The decomposition module proposes candidate inspirations; the necessary checker removes redundant ones; the sufficient checker confirms coverage of the hypothesis.

Worked example — terahertz quasicrystal absorber paper. Click to expand.

InputTarget Paper — AlCuFe quasicrystal terahertz absorber
A tunable absorber film of grating structure of AlCuFe quasicrystal on a gold substrate, with three perfect absorption bands.
QuestionExtracted Research Question
How can we design a tunable, high-performance terahertz absorber that is easy to fabricate, offers multi-band absorption, and exhibits high sensitivity for sensing?
Inspirations3 Decomposed Inspirations
PtTe2-based type-II Dirac semimetal for THz photodetection
Dual-band tunable NIR light trapping via grating-based Fabry-Perot structure
Perfect metamaterial absorber
HypothesisExtracted Hypothesis
An AlCuFe quasicrystal (Dirac semimetal) grating on gold achieves perfect multi-band THz absorption, tunable via Fermi energy, grating parameters, or polarization.
Citation

Cite ResearchBench

bibtex
@article{liu2025researchbench,
  title   = {ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition},
  author  = {Liu, Yujie and Yang, Zonglin and Xie, Tong and Ni, Jinjie and Gao, Ben and Li, Yuqiang and Tang, Shixiang and Ouyang, Wanli and Cambria, Erik and Zhou, Dongzhan},
  journal = {arXiv preprint arXiv:2503.21248},
  year    = {2025}
}