MIRAGE Logo

A Benchmark for Multimodal Information‑Seeking and Reasoning in Agricultural Expert‑Guided Conversations

1University of Illinois Urbana-Champaign   2Amazon

* Equal contributions

MIRAGE Benchmark Overview

Abstract

MIRAGE is a new multimodal benchmark designed to evaluate vision-language models in realistic expert consultation settings. MIRAGE incorporates natural user queries, expert responses, and images derived from real interactions between real users and domain experts. It presents challenges such as underspecified information and rare biological entities, which current models struggle with, highlighting a need for improved grounded reasoning, clarification abilities, and long-form response generation. The benchmark includes both single-turn (MIRAGE-MMST) and multi-turn (MIRAGE-MMMT) tasks, assessing not just accuracy, diagnostic parsimony but also the ability to simulate expert conversational decisions, such as whether to clarify or respond.

Drawing on approximately 285 000 real-world agricultural consultations from the AskExtension platform (218 431 single-turn Q&A and 66 962 multi-turn dialogues) and encompassing over 7 000 unique plant, pest, and disease entities, MIRAGE offers both multimodal single-turn and multimodal multi-turn challenges.

MIRAGE provides a rigorous testbed for evaluating vision–language models on critical AI capabilities: grounded reasoning, through complex cause-effect inference tasks; multimodal understanding, by requiring fine-grained recognition from real-world user images; and conversational decision-making, by simulating dynamic, multi-turn expert consultations that challenge models to clarify ambiguities or deliver immediate guidance.

IconMIRAGE-MMST: Multimodal Singleturn Benchmark

MIRAGE-MMST Overview

MIRAGE-MMST is a benchmark designed to assess multimodal vision-language models on expert-level, single-turn agricultural consultations. Each instance includes a natural-language question, user-submitted images, and associated metadata (e.g., timestamp, location). Models must identify relevant agronomic entities, reason causally about observed visual symptoms, and generate explanatory or actionable management recommendations.

The benchmark features two subsets:

MIRAGE-MMST Statistics

MIRAGE-MMST Leaderboard

Open-Source Proprietary
Name Identification Management Overall
Acc Reason WS Acc Rel Comp Pars WS

This leaderboard benchmarks large vision-language models on the MMST-Standard dataset. The benchmark utilizes LLMs as Judges method. Scores are averaged over three open-source reasoning models: DeepSeek-R1-Distill-Llama-70B, Qwen2-32B, and Phi-4-Reasoning.

The evaluation comprises two tasks:

  • Identification (ID) Task
    • Acc – identification accuracy (0 – 1)
    • Reason – reasoning accuracy (0 – 4)
    • ID-WS – weighted score WS = \(\dfrac{2 \times Acc + \tfrac{Reason}{4}}{3} \times 100\)
  • Management (MG) Task
    • Metrics: Acc (Accuracy), Rel (Relevance), Comp (Completeness), Pars (Parsimony) — each 0 – 4
    • MG-WS – weighted score WS = \(\dfrac{2 \times Acc + Rel + Comp + Pars}{20} \times 100\)

Overall – average of the two weighted scores: \(\tfrac{ID\text{-}WS + MG\text{-}WS}{2}\)

Metric Descriptions

Identification Task Metrics

  • Accuracy (Acc): Measures the correctness of entity identification (e.g., plant species, pests, diseases)
  • Reasoning (Reason): Evaluates the quality of causal explanations for observed symptoms and conditions

Management Task Metrics

  • Accuracy (Acc): Measures the correctness of management recommendations
  • Relevance (Rel): Assesses how well the recommendations address the specific problem
  • Completeness (Comp): Evaluates whether all necessary management steps are included
  • Parsimony (Pars): Measures the efficiency and practicality of the recommendations (Occam's Razor)

IconMIRAGE-MMMT: Multimodal Multiturn Benchmark

MIRAGE-MMMT Overview

MIRAGE-MMMT is a multimodal decision-making task, grounded in real-world agricultural consultations. Users pose complex, often image-supported questions about plant health, pest identification, growing conditions, and other agronomic concerns. Each dialogue reflects a practical scenario in which the expert must reason over conversation history and visual context to decide: (1) whether to respond with guidance based on what is known, or (2) whether to pause and seek additional input to resolve a knowledge gap. This introduces a decision-making challenge tightly coupled with natural language generation.

MIRAGE-MMMT Statistics

MIRAGE-MMMT Task Example

MIRAGE-MMMT Task Example

MIRAGE-MMMT Leaderboard

Open-Source Proprietary
Name Zero-Shot Chain-of-Thought
Acc% Clarify Respond Acc% Clarify Respond

This leaderboard benchmarks large vision-language models on the MMMT task of the MIRAGE Benchmark. The benchmark utilizes a combination of ground-truth labels and LLMs as Judges method.

Metric Descriptions

  • Decision Accuracy (Acc %): Percentage of turns where the model correctly chooses between Clarify and Respond actions, matching the expected decision.
  • Goal Relevance for Clarify (%): Percentage of clarification questions that directly target missing facts essential to achieving the user's goal.
  • Goal Relevance for Respond (%): Percentage of responses that appropriately address the user's goal using known facts effectively.

Error Analysis

Error Analysis Overview

Our comprehensive error analysis reveals key insights into model performance across different agricultural domains and task types. The analysis examines common failure patterns, domain-specific challenges, and areas where current vision-language models struggle in agricultural expert consultation scenarios.

These visualizations summarize systematic failure patterns in agricultural AI across three tiers: (1) GPT-4.1-only failures, highlighting core domain challenges like fine-grained species distinctions and poor image quality; (2) Qwen-only failures, revealing model-specific issues like vision-language misalignment and confidence errors; and (3) joint failures, showing frontier challenges such as ambiguous visuals and rare species. Bar charts break down the frequency of each error type, while the heatmap offers a side-by-side comparison across models. Together, these plots reveal where models struggle most—and why—helping guide future improvements.

Error Distribution Heatmap

Error Distribution Heatmap

Comparison of error categories across GPT-4.1, Qwen2.5, and joint model failures

Tier 1: Fundamental Domain Challenges (GPT-4.1)

GPT-4.1 Domain Errors

Failures reflecting core agricultural AI challenges such as fine-grained taxonomy and visual ambiguity

Tier 2: Systematic Gaps in Qwen2.5

Qwen2.5 Failures

Errors where Qwen2.5 fails but GPT-4.1 succeeds, including misalignment and reasoning bias

Tier 3: Joint Failures – Frontier Cases

Joint Model Failures

Hardest cases involving rare species, overlapping symptoms, and ambiguous visual inputs

Key Findings

Domain-Specific Challenges

  • Plant disease identification shows the highest error rates due to visual similarity between different diseases
  • Pest identification challenges arise from small-scale visual features and seasonal variations
  • Management recommendations often lack contextual awareness of geographic and seasonal factors

Model Limitations

  • Limited fine-grained visual understanding for agricultural entities
  • Insufficient domain knowledge for causal reasoning in plant health
  • Difficulty in generating practical, actionable recommendations

IconMIRAGE Benchmark Examples

MIRAGE-MMST Standard Benchmark

MIRAGE-MMST Contextual Benchmark

Reasoning-LLM-as-A-Judge

Acknowledgements

This work is partly supported by the Amazon-Illinois Center on AI for Interactive Conversational Experiences Award, AIFARMS National AI Institute and Center for Digital Agriculture at the University of Illinois. We thank the AskExtension team for providing the data. This work used Delta advanced computing and data resource at University of Illinois Urbana-Champaign and its National Center for Supercomputing Applications through allocation CIS250434 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by U.S. National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.

Citation

If you find this work useful, please cite our paper:

@article{dongre2025mirage,
  title={MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations},
  author={Dongre, Vardhan and Gui, Chi and Garg, Shubham and Nayyeri, Hooshang and Tur, Gokhan and Hakkani-T{\"u}r, Dilek and Adve, Vikram S},
  journal={arXiv preprint arXiv:2506.20100},
  year={2025}
}