A Benchmark for Multimodal Information‑Seeking and Reasoning in Agricultural Expert‑Guided Conversations

Vardhan Dongre^1*, Chi Gui^1*, Hooshang Nayyeri², Shubham Garg², Gokhan Tur¹, Dilek Hakkani‑Tür¹, Vikram Adve¹

¹University of Illinois Urbana-Champaign ²Amazon

* Equal contributions

Paper 🤗 Dataset Code Examples Error Analysis

NeurIPS 2025 Datasets & Benchmarks Track

Abstract

MIRAGE is a new multimodal benchmark designed to evaluate vision-language models in realistic expert consultation settings. MIRAGE incorporates natural user queries, expert responses, and images derived from real interactions between real users and domain experts. It presents challenges such as underspecified information and rare biological entities, which current models struggle with, highlighting a need for improved grounded reasoning, clarification abilities, and long-form response generation. The benchmark includes both single-turn (MIRAGE-MMST) and multi-turn (MIRAGE-MMMT) tasks, assessing not just accuracy, diagnostic parsimony but also the ability to simulate expert conversational decisions, such as whether to clarify or respond.

Drawing on approximately 285 000 real-world agricultural consultations from the AskExtension platform (218 431 single-turn Q&A and 66 962 multi-turn dialogues) and encompassing over 7 000 unique plant, pest, and disease entities, MIRAGE offers both multimodal single-turn and multimodal multi-turn challenges.

MIRAGE provides a rigorous testbed for evaluating vision–language models on critical AI capabilities: grounded reasoning, through complex cause-effect inference tasks; multimodal understanding, by requiring fine-grained recognition from real-world user images; and conversational decision-making, by simulating dynamic, multi-turn expert consultations that challenge models to clarify ambiguities or deliver immediate guidance.

MIRAGE-MMST: Multimodal Singleturn Benchmark

MIRAGE-MMST Overview

MIRAGE-MMST is a benchmark designed to assess multimodal vision-language models on expert-level, single-turn agricultural consultations. Each instance includes a natural-language question, user-submitted images, and associated metadata (e.g., timestamp, location). Models must identify relevant agronomic entities, reason causally about observed visual symptoms, and generate explanatory or actionable management recommendations.

The benchmark features two subsets:

Standard subset: Evaluates models' entity identification, causal reasoning, and recommendation generation using only provided text and images.
Contextual subset: Assesses the models' abilities to reason over implicit contextual information (e.g., geographic or seasonal details) necessary for accurate interpretation and response.

MIRAGE-MMST Leaderboard

Open-Source Proprietary

Name	Identification			Management					Overall
Name	Acc	Reason	WS	Acc	Rel	Comp	Pars	WS	Overall

This leaderboard benchmarks large vision-language models on the MMST-Standard dataset. The benchmark utilizes LLMs as Judges method. Scores are averaged over three open-source reasoning models: DeepSeek-R1-Distill-Llama-70B, Qwen2-32B, and Phi-4-Reasoning.

The evaluation comprises two tasks:

Identification (ID) Task
- Acc – identification accuracy (0 – 1)
- Reason – reasoning accuracy (0 – 4)
- ID-WS – weighted score $WS = $\dfrac{2 \times Acc + \tfrac{Reason}{4}}{3} \times 100$$
Management (MG) Task
- Metrics: Acc (Accuracy), Rel (Relevance), Comp (Completeness), Pars (Parsimony) — each 0 – 4
- MG-WS – weighted score $WS = $\dfrac{2 \times Acc + Rel + Comp + Pars}{20} \times 100$$

Overall – average of the two weighted scores: $\tfrac{ID\text{-}WS + MG\text{-}WS}{2}$

Metric Descriptions

Identification Task Metrics

Accuracy (Acc): Measures the correctness of entity identification (e.g., plant species, pests, diseases)
Reasoning (Reason): Evaluates the quality of causal explanations for observed symptoms and conditions

Management Task Metrics

Accuracy (Acc): Measures the correctness of management recommendations
Relevance (Rel): Assesses how well the recommendations address the specific problem
Completeness (Comp): Evaluates whether all necessary management steps are included
Parsimony (Pars): Measures the efficiency and practicality of the recommendations (Occam's Razor)

MIRAGE-MMMT: Multimodal Multiturn Benchmark

MIRAGE-MMMT Overview

MIRAGE-MMMT is a multimodal decision-making task, grounded in real-world agricultural consultations. Users pose complex, often image-supported questions about plant health, pest identification, growing conditions, and other agronomic concerns. Each dialogue reflects a practical scenario in which the expert must reason over conversation history and visual context to decide: (1) whether to respond with guidance based on what is known, or (2) whether to pause and seek additional input to resolve a knowledge gap. This introduces a decision-making challenge tightly coupled with natural language generation.

MIRAGE-MMMT Task Example

MIRAGE-MMMT Leaderboard

Open-Source Proprietary

Name	Zero-Shot			Chain-of-Thought
Name	Acc%	Clarify	Respond	Acc%	Clarify	Respond

This leaderboard benchmarks large vision-language models on the MMMT task of the MIRAGE Benchmark. The benchmark utilizes a combination of ground-truth labels and LLMs as Judges method.

Metric Descriptions

Decision Accuracy (Acc %): Percentage of turns where the model correctly chooses between Clarify and Respond actions, matching the expected decision.
Goal Relevance for Clarify (%): Percentage of clarification questions that directly target missing facts essential to achieving the user's goal.
Goal Relevance for Respond (%): Percentage of responses that appropriately address the user's goal using known facts effectively.

Error Analysis

Error Analysis Overview

Our comprehensive error analysis reveals key insights into model performance across different agricultural domains and task types. The analysis examines common failure patterns, domain-specific challenges, and areas where current vision-language models struggle in agricultural expert consultation scenarios.

These visualizations summarize systematic failure patterns in agricultural AI across three tiers: (1) GPT-4.1-only failures, highlighting core domain challenges like fine-grained species distinctions and poor image quality; (2) Qwen-only failures, revealing model-specific issues like vision-language misalignment and confidence errors; and (3) joint failures, showing frontier challenges such as ambiguous visuals and rare species. Bar charts break down the frequency of each error type, while the heatmap offers a side-by-side comparison across models. Together, these plots reveal where models struggle most—and why—helping guide future improvements.

Error Distribution Heatmap

Comparison of error categories across GPT-4.1, Qwen2.5, and joint model failures

Tier 1: Fundamental Domain Challenges (GPT-4.1)

Failures reflecting core agricultural AI challenges such as fine-grained taxonomy and visual ambiguity

Tier 2: Systematic Gaps in Qwen2.5

Errors where Qwen2.5 fails but GPT-4.1 succeeds, including misalignment and reasoning bias

Tier 3: Joint Failures – Frontier Cases

Hardest cases involving rare species, overlapping symptoms, and ambiguous visual inputs

Key Findings

Domain-Specific Challenges

Plant disease identification shows the highest error rates due to visual similarity between different diseases
Pest identification challenges arise from small-scale visual features and seasonal variations
Management recommendations often lack contextual awareness of geographic and seasonal factors

Model Limitations

Limited fine-grained visual understanding for agricultural entities
Insufficient domain knowledge for causal reasoning in plant health
Difficulty in generating practical, actionable recommendations

MIRAGE Benchmark Examples

MIRAGE-MMST Standard Benchmark

MIRAGE-MMST Contextual Benchmark

Reasoning-LLM-as-A-Judge

Acknowledgements

This work is partly supported by the Amazon-Illinois Center on AI for Interactive Conversational Experiences Award, AIFARMS National AI Institute and Center for Digital Agriculture at the University of Illinois. We thank the AskExtension team for providing the data. This work used Delta advanced computing and data resource at University of Illinois Urbana-Champaign and its National Center for Supercomputing Applications through allocation CIS250434 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by U.S. National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.

Citation

If you find this work useful, please cite our paper:

@article{dongre2025mirage,
  title={MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations},
  author={Dongre, Vardhan and Gui, Chi and Garg, Shubham and Nayyeri, Hooshang and Tur, Gokhan and Hakkani-T{\"u}r, Dilek and Adve, Vikram S},
  journal={arXiv preprint arXiv:2506.20100},
  year={2025}
}

A Benchmark for Multimodal Information‑Seeking and Reasoning in Agricultural Expert‑Guided Conversations

MIRAGE Audio Overview

Abstract

MIRAGE-MMST: Multimodal Singleturn Benchmark

MIRAGE-MMST Overview

MIRAGE-MMST Leaderboard

Metric Descriptions

Identification Task Metrics

Management Task Metrics

MIRAGE-MMMT: Multimodal Multiturn Benchmark

MIRAGE-MMMT Overview

MIRAGE-MMMT Task Example

MIRAGE-MMMT Leaderboard

Metric Descriptions

Error Analysis

Error Analysis Overview

Error Distribution Heatmap

Tier 1: Fundamental Domain Challenges (GPT-4.1)

Tier 2: Systematic Gaps in Qwen2.5

Tier 3: Joint Failures – Frontier Cases

Key Findings

Domain-Specific Challenges

Model Limitations

MIRAGE Benchmark Examples

MIRAGE-MMST Standard Benchmark

MIRAGE-MMST Contextual Benchmark

Reasoning-LLM-as-A-Judge

Acknowledgements

Citation