Logo MIRAGE

A Benchmark for Multimodal Information‑Seeking and Reasoning in Agricultural Expert‑Guided Conversations

1University of Illinois Urbana-Champaign   2Amazon

* Equal contributions

MIRAGE Benchmark Overview

Abstract

MIRAGE is a new multimodal benchmark designed to evaluate vision-language models in realistic expert consultation settings. MIRAGE incorporates natural user queries, expert responses, and images derived from real interactions between real users and domain experts. It presents challenges such as underspecified information and rare biological entities, which current models struggle with, highlighting a need for improved grounded reasoning, clarification abilities, and long-form response generation. The benchmark includes both single-turn (MIRAGE-MMST) and multi-turn (MIRAGE-MMMT) tasks, assessing not just accuracy, diagnostic parsimony but also the ability to simulate expert conversational decisions, such as whether to clarify or respond.

Drawing on approximately 285 000 real-world agricultural consultations from the AskExtension platform (218 431 single-turn Q&A and 66 962 multi-turn dialogues) and encompassing over 7 000 unique plant, pest, and disease entities, MIRAGE offers both multimodal single-turn and multimodal multi-turn challenges.

MIRAGE provides a rigorous testbed for evaluating vision–language models on critical AI capabilities: grounded reasoning, through complex cause-effect inference tasks; multimodal understanding, by requiring fine-grained recognition from real-world user images; and conversational decision-making, by simulating dynamic, multi-turn expert consultations that challenge models to clarify ambiguities or deliver immediate guidance.

IconMIRAGE-MMST: Multimodal Singleturn Benchmark

MIRAGE-MMST Overview

MIRAGE-MMST is a benchmark designed to assess multimodal vision-language models on expert-level, single-turn agricultural consultations. Each instance includes a natural-language question, user-submitted images, and associated metadata (e.g., timestamp, location). Models must identify relevant agronomic entities, reason causally about observed visual symptoms, and generate explanatory or actionable management recommendations.

The benchmark features two subsets:

MIRAGE-MMST Statistics

MIRAGE-MMST Leaderboard

Open-Source Proprietary
Name Identification Management Overall
Acc Reason WS Acc Rel Comp Pars WS

This leaderboard benchmarks large vision-language models on the MMST-Standard dataset. The benchmark utilizes LLMs as Judges method. Scores are averaged over three open-source reasoning models: DeepSeek-R1-Distill-Llama-70B, Qwen2-32B, and Phi-3-Reasoning.

The evaluation comprises two tasks:

  • Identification (ID) Task
    • Acc – identification accuracy (0 – 1)
    • Reason – reasoning accuracy (0 – 4)
    • ID-WS – weighted score WS = \(\dfrac{2 \times Acc + \tfrac{Reason}{4}}{3} \times 100\)
  • Management (MG) Task
    • Metrics: Acc (Accuracy), Rel (Relevance), Comp (Completeness), Pars (Parsimony) — each 0 – 4
    • MG-WS – weighted score WS = \(\dfrac{2 \times Acc + Rel + Comp + Pars}{20} \times 100\)

Overall – average of the two weighted scores: \(\tfrac{ID\text{-}WS + MG\text{-}WS}{2}\)

Metric Descriptions

Identification Task Metrics

  • Accuracy (Acc): Measures the correctness of entity identification (e.g., plant species, pests, diseases)
  • Reasoning (Reason): Evaluates the quality of causal explanations for observed symptoms and conditions

Management Task Metrics

  • Accuracy (Acc): Measures the correctness of management recommendations
  • Relevance (Rel): Assesses how well the recommendations address the specific problem
  • Completeness (Comp): Evaluates whether all necessary management steps are included
  • Parsimony (Pars): Measures the efficiency and practicality of the recommendations (Occam's Razor)

IconMIRAGE-MMMT: Multimodal Multiturn Benchmark

MIRAGE-MMMT Overview

MIRAGE-MMMT is a multimodal decision-making task, grounded in real-world agricultural consultations. Users pose complex, often image-supported questions about plant health, pest identification, growing conditions, and other agronomic concerns. Each dialogue reflects a practical scenario in which the expert must reason over conversation history and visual context to decide: (1) whether to respond with guidance based on what is known, or (2) whether to pause and seek additional input to resolve a knowledge gap. This introduces a decision-making challenge tightly coupled with natural language generation.

MIRAGE-MMMT Statistics

MIRAGE-MMMT Task Example

MIRAGE-MMMT Task Example

MIRAGE-MMMT Leaderboard

Leaderboard — coming soon

IconMIRAGE Benchmark Examples

MIRAGE-MMST Standard Benchmark

MIRAGE-MMST Contextual Benchmark

Reasoning-LLM-as-A-Judge

Acknowledgements

This work is partly supported by the Amazon-Illinois Center on AI for Interactive Conversational Experiences Award, AIFARMS National AI Institute and Center for Digital Agriculture at the University of Illinois. We thank the AskExtension team for providing the data. This work used Delta advanced computing and data resource at University of Illinois Urbana-Champaign and its National Center for Supercomputing Applications through allocation CIS250434 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by U.S. National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.