Hierarchical Waterfall Evaluation of Query Classification, Retrieval & Generation in Multi-Agent LLM Systems

Introduction

As organizations increasingly deploy multi-agent LLM systems to handle complex queries and workflows, the need for robust evaluation frameworks becomes crucial. Based on our recent work implementing hypothesis-driven experimentation for a large enterprise client, this post shares insights on building a hierarchical waterfall evaluation framework focused on query classification and retrieval-augmented generation (RAG) in multi-agent systems.

This blog explores how we established structured evaluation pipelines, worked with domain experts, and implemented key metrics that went beyond simple accuracy scores to provide actionable insights for system improvement.

The Evaluation Challenge

Multi-agent systems, where different specialist AI agents handle different aspects of user queries, introduce unique evaluation challenges:

Routing Accuracy: If queries aren’t routed to the right agent(s), the entire system breaks down, potentially producing plausible but incorrect answers that can deceive users without their knowledge
Multi-label Classification: Queries often require multiple agents to work together
Retrieval Quality: Agents need to find the right documents and content chunks from knowledge sources, making retrieval accuracy a critical factor in multi-agent systems just as it is in traditional RAG implementations
Answer Generation: Final responses must be accurate, relevant, and properly grounded

Our approach focused on developing comprehensive evaluation pipelines that addressed each of these aspects. We prioritized providing transparent, actionable feedback rather than opaque performance scores. Because our current implementation primarily uses retrieval-based agents, retrieval quality is a key metric in our evaluation framework. However, the framework is extensible—additional metrics, such as task completion rates and execution accuracy, can be incorporated for other types of agents.

Evaluation Framework Development

A key focus of our engagement was implementing hypothesis-driven experimentation, which required establishing a robust evaluation framework. This framework was built on high-quality evaluation data collected from multiple sources including subject matter experts (SMEs), existing system data, actual user queries, and supplementary synthetic data.

We prioritized creating evaluation datasets that genuinely represented real-world use cases rather than merely generating evaluation metrics. This approach ensured our improvements directly impacted business outcomes and user experience.

The data annotation process involved close collaboration with SMEs who provided critical annotations including:

Query classification
Appropriate agent selection
Link/document retrieval for knowledge indexes
Chunk retrieval within documents
Reference answers

We treated evaluation data with the same rigor as production code. This included implementing versioning, comprehensive documentation, and detailed process tracking. This methodical approach prevents metric misinterpretation and ensures teams maintain context when analyzing results.

Multi-Agent Evaluation Framework

Fig 1: Comprehensive multi-agent evaluation framework – Our hierarchical approach evaluates routing, retrieval, and generation components to provide a complete assessment of system performance.

Hierarchical Schema for Evaluation Data

To structure our evaluation data effectively, we developed a simple hierarchical schema that follows our sequential evaluation process:

EvaluationData/
├── Queries/
│   ├── all_queries.json               # All test queries
│   └── query_metadata.json            # Domain, difficulty, query type metadata
│
├── AgentClassification/
│   ├── ground_truth_labels.json       # Correct agent assignments for each query
│   ├── model_predictions.json         # System's agent selections
│   └── classification_results.json    # Accuracy, precision, recall, F1 metrics
│
├── Retrieval/                         # Only evaluated for correctly classified queries
│   ├── document_retrieval/
│   │   ├── ground_truth_links.json    # Expected documents/links
│   │   ├── retrieved_links.json       # Actually retrieved documents
│   │   └── document_metrics.json      # Precision@k, recall@k scores
│   │
│   └── chunk_retrieval/
│       ├── ground_truth_chunks.json   # Expected content chunks
│       ├── retrieved_chunks.json      # Actually retrieved chunks
│       └── chunk_metrics.json         # Rouge scores, precision, recall
│
└── Generation/                        # Final answer evaluation
    ├── reference_answers.json         # SME-provided reference answers
    ├── system_responses.json          # Generated answers
    └── answer_metrics.json            # Groundedness, similarity, factuality scores

This sequential schema ensures we:

First evaluate if queries are routed to the correct agent(s)
Only for correctly routed queries, measure document and chunk retrieval accuracy
Finally assess answer quality based on groundedness and alignment with references

This approach prevents misattribution of errors and allows us to clearly identify which stage of the pipeline may need improvement.

Multi-Label Router Classification Evaluation Pipeline

User queries sent to multi-agent applications need to be routed to the relevant AI Agents and then the relevant agent(s) invoked to generate a response. This involves classifying the user queries into categories that align with each AI Agent or multiple agents.

The router agent was responsible for deciding which AI Agents needed to be invoked based on the user query. Each agent is assigned a label, so multi-label classification refers to evaluating the multi-agent participation requirements for a user query.

Multi-Class vs. Multi-Label

Multi-Class
- Each data point (query) is assigned exactly one label/class/agent
- Example: A router picks only one agent (e.g., knowledge base, generic LLM, or web search) for each query, even if more than one agent might apply
- This is typically simpler to implement and interpret because each query has a single “best” choice
Multi-Label
- Each data point (query) can be assigned multiple labels/classes
- Example: A query could be routed to both “knowledge base” and “web search” if appropriate
- More challenging because you have to handle overlaps between classes, and your metrics (precision, recall, F1) become more complex in a multi-label setting

Generally, multi-label classification is harder than multi-class because you must learn not just to pick the correct single label, but to identify all correct labels simultaneously. This requires a more robust approach, especially to ensure that the router captures how labels/agents relate to each other.

Our evaluation pipeline for multi-label classification serves as the critical entry point and primary accuracy bottleneck for the entire multi-agent system. If queries aren’t routed to the correct agents at this stage, even perfect downstream components will fail to produce accurate results.

Rather than providing just another “black box” metric like F1 score, our pipeline generates detailed diagnostic error analysis reports that illuminate exactly where and how the system is failing, enabling targeted improvements. Specifically, the pipeline evaluates how effectively the multi-agent router classifies each query. It compares predicted agent assignments against expert-labeled ground truth data to identify both strengths and areas for improvement.

Error Analysis Framework

The error analysis report serves as the cornerstone of our development process, addressing critical questions:

Which specific query types consistently trigger incorrect agent selection?
What patterns exist in queries where the system performs well vs. poorly?
Are there particular agent combinations the system struggles to identify?
Do specific keywords or phrasings correlate with routing errors?
Which errors would have the highest business impact if fixed?

Error Analysis Driven Development

Fig 2: Error analysis driven development approach – Our evaluation pipeline focuses on detailed error analysis rather than single metrics, enabling iterative improvements through targeted insights.

Our pipeline provides more than standard metrics (precision, recall, F1):

Per-query analysis showing exactly which agent assignments were correct and incorrect
Detailed examination of multi-label cases where some but not all agents were correctly assigned
Pattern recognition across error categories to identify systematic failure modes
Performance breakdowns by query type, complexity, and domain

The analysis results include several key components:

Dataset Statistics: General statistics about the overall dataset
Missed Entries: Data queries with ground truth but no prediction entries
Filtered Data: Any data queries excluded from evaluation
Class Distribution: The distribution between different agent classes
Multi-Label Data: Analysis of queries requiring multiple agents
Incorrect Predictions: Breakdown of errors as partial matches or complete misses
Overall Metrics: Exact match accuracy, average precision, recall, and F1-scores
Per-Class Performance: Metrics broken down by agent type

Implementation and Example

We implemented this approach in a Python evaluation script that processes ground truth and prediction data to generate detailed reports. Here’s a simplified version of our implementation:

#!/usr/bin/env python3
"""
General Multi-Label Classification Evaluation with Detailed Analysis

Usage:
    python eval_multilabel_full.py \
      --gt ground_truth.json \
      --pred predictions.json \
      --outdir results/ \
      --remove unknown outofscope
"""

import json
import argparse
from pathlib import Path
from collections import Counter
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report

def load_json(path):
    return json.loads(Path(path).read_text())

def norm(raw):
    """Normalize a raw label field into a list of lowercase labels."""
    if raw is None:
        return []
    if isinstance(raw, str):
        raw = [x.strip() for x in raw.split(',')]
    if isinstance(raw, (list, tuple)):
        return [x.lower() for x in raw]
    return [str(raw).lower()]

def align(gt, pred):
    """Align GT and predictions by ID, return common IDs, missing, extra, and maps."""
    gt_map = {d['id']: d for d in gt}
    pred_map = {d['id']: d for d in pred}
    common = sorted(set(gt_map) & set(pred_map))
    missing = sorted(set(gt_map) - set(pred_map))
    extra   = sorted(set(pred_map) - set(gt_map))
    return common, missing, extra, gt_map, pred_map

def filter_idxs(labels_list, remove):
    """Return indices of entries whose labels do not include any remove-categories."""
    keep = []
    for i, labs in enumerate(labels_list):
        if not any(r in labs for r in remove):
            keep.append(i)
    return keep

def analyze(gt_raw, pred_raw, remove, outdir):
    # Align by ID
    common, missing, extra, gt_map, pred_map = align(gt_raw, pred_raw)
    outdir = Path(outdir)
    outdir.mkdir(parents=True, exist_ok=True)

    # Extract & normalize labels
    gt = [norm(gt_map[i].get('labels')) for i in common]
    pr = [norm(pred_map[i].get('labels')) for i in common]
    ids = list(common)

    # Filter out unwanted categories
    keep = filter_idxs(gt, remove)
    gt = [gt[i] for i in keep]
    pr = [pr[i] for i in keep]
    ids = [ids[i] for i in keep]

    # Build class distributions
    def dist(lst):
        cnt = Counter([l for row in lst for l in row])
        total = sum(cnt.values())
        return [
            {'label': lab, 'count': cnt[lab], 'pct': cnt[lab]/total*100 if total else 0}
            for lab, _ in cnt.most_common()
        ]

    stats = {
        'dataset': {
            'gt_total':   len(gt_raw),
            'pred_total': len(pred_raw),
            'common':     len(common),
            'missing':    missing,
            'extra':      extra
        },
        'distribution': {
            'gt': dist(gt),
            'pred': dist(pr)
        }
    }

    # Multi-label analysis
    multi = [i for i, labs in enumerate(gt) if len(labs) > 1]
    ml_exact = sum(1 for i in multi if set(gt[i]) == set(pr[i]))

    # Single→multi cases
    s2m = [i for i, labs in enumerate(gt) if len(labs) == 1 and len(pr[i]) > 1]
    s2m_corr = sum(1 for i in s2m if gt[i][0] in pr[i])

    # Collect incorrect predictions
    incorrect = []
    for i, (g0, p0) in enumerate(zip(gt, pr)):
        if set(g0) != set(p0):
            incorrect.append({
                'id': ids[i],
                'gt': g0,
                'pred': p0,
                'partial': bool(set(g0) & set(p0)),
                'missed': [x for x in g0 if x not in p0],
                'extra': [x for x in p0 if x not in g0],
            })

    # Compute multi-label metrics
    all_classes = sorted({lab for row in gt for lab in row})
    mlb = MultiLabelBinarizer(classes=all_classes)
    y_true = mlb.fit_transform(gt)
    y_pred = mlb.transform(pr)
    report = classification_report(
        y_true, y_pred,
        target_names=mlb.classes_,
        output_dict=True,
        zero_division=0
    )
    exact_match_acc = sum(1 for g0, p0 in zip(gt, pr) if set(g0) == set(p0)) / len(gt) if gt else 0

    # Save metrics.json
    out_metrics = {
        **stats,
        'exact_match': exact_match_acc,
        'report': report,
        'multi_label': {
            'total': len(multi),
            'exact_match': ml_exact,
            'accuracy': ml_exact / len(multi) if multi else None
        },
        'single_to_multi': {
            'total': len(s2m),
            'correct': s2m_corr,
            'pct_correct': (s2m_corr / len(s2m) * 100) if s2m else None
        },
        'incorrect': incorrect
    }
    with open(outdir / 'metrics.json', 'w') as f:
        json.dump(out_metrics, f, indent=2)

    # Write human-readable report
    rpt = outdir / 'analysis_report.txt'
    with rpt.open('w') as f:
        f.write("=== DATASET STATISTICS ===\n")
        f.write(f"GT entries: {stats['dataset']['gt_total']}\n")
        f.write(f"Pred entries: {stats['dataset']['pred_total']}\n")
        f.write(f"Common IDs: {stats['dataset']['common']}\n")
        f.write(f"Missing IDs: {len(missing)}\n")
        f.write(f"Extra IDs: {len(extra)}\n\n")

        f.write("=== CLASS DISTRIBUTIONS ===\n")
        for side in ['gt', 'pred']:
            f.write(f"-- {side.upper()} --\n")
            for d in stats['distribution'][side]:
                f.write(f"  {d['label']}: {d['count']} ({d['pct']:.1f}%)\n")
            f.write("\n")

        f.write("=== OVERALL METRICS ===\n")
        f.write(f"Exact-match accuracy: {exact_match_acc:.3f}\n")
        f.write(f"Weighted F1-score: {report['weighted avg']['f1-score']:.3f}\n\n")

        f.write("=== MULTI-LABEL ANALYSIS ===\n")
        f.write(f"Total multi-label: {len(multi)}\n")
        f.write(f"Exact matches: {ml_exact}\n\n")

        f.write("=== SINGLE→MULTI INSPECTION ===\n")
        f.write(f"Total single→multi: {len(s2m)}, correct includes: {s2m_corr}\n\n")

        f.write("=== INCORRECT PREDICTIONS ===\n")
        for e in incorrect:
            f.write(
                f"- ID {e['id']}: GT={e['gt']} PRED={e['pred']} "
                f"partial={e['partial']} missed={e['missed']} extra={e['extra']}\n"
            )

    print(f"Metrics and report saved to: {outdir}")

if __name__ == "__main__":

Example Usage

Below are example JSON files that demonstrate how to use the evaluation script:

ground_truth.json

[
  { "id": "1", "labels": ["billing", "search"] },
  { "id": "2", "labels": ["faq"] },
  { "id": "3", "labels": ["unknown"] },
  { "id": "4", "labels": ["faq", "billing"] }
]

predictions.json

[
  { "id": "1", "labels": ["search", "billing"] },
  { "id": "2", "labels": ["faq"] },
  { "id": "3", "labels": ["fileupload"] },
  { "id": "4", "labels": ["faq"] }
]

Place these files alongside your script, then run:

python eval_multilabel_full.py --gt ground_truth.json --pred predictions.json --outdir my_results

Your my_results/metrics.json and my_results/analysis_report.txt will contain the full diagnostics that enable teams to identify and address specific issues in the multi-agent routing system.

The weighted average F1-score provides a good overall performance tracker, while per-class metrics help identify which query types need improvement. This approach enabled our teams to iteratively improve routing accuracy from the initial 70% range to the high 80% range by focusing on the most impactful error categories.

Retrieval and Generation Metrics

Once we’ve established that queries are correctly routed to the appropriate agents, we need to evaluate two critical downstream components: retrieval accuracy and answer generation quality. These evaluations only proceed for queries that were correctly classified in the routing stage.

Retrieval Evaluation

For agents that retrieve information before generating answers, we implemented two levels of retrieval evaluation:

Document-Level Retrieval Metrics:
- Precision@k (P@k): Proportion of retrieved documents that are relevant
- Recall@k (R@k): Proportion of relevant documents that are retrieved
- F1@k: Harmonic mean of precision and recall at k documents
We used these metrics to evaluate how effectively agents found the correct documents or knowledge sources/links for each query. These evaluations were applied to both our web search agent and our internal knowledge retrieval agent that queried the enterprise knowledge index.
Chunk-Level Retrieval Metrics:
- Rouge-L scores measuring overlap between retrieved text chunks and ground truth chunks
- Semantic similarity between retrieved chunks and reference chunks using embedding models
- Coverage assessment evaluating whether retrieved chunks contain all necessary information

This approach helped us identify whether retrieval errors were occurring at the document selection stage or during chunk extraction. This insight provided valuable feedback to improve our indexing and chunking approaches while creating the knowledge index.

Answer Generation Evaluation

For the final generated responses, we developed a multifaceted evaluation approach:

Reference-Based Metrics:
- Lexical similarity using Rouge, BLEU, and GLEU scores to measure word and n-gram overlap
- Semantic similarity using embedding-based models (BERTScore, SentenceBERT) to capture meaning beyond exact matches
- Structure and style assessment comparing formatting, citations, and organizational elements
Reference-Free Metrics:
- Groundedness evaluation ensuring answers don’t contain hallucinated information
- Factual correctness assessment using frameworks like RAGAS to verify claims
- Answer relevancy scoring to determine if responses address the original query

These comprehensive metrics gave us a holistic view of answer quality and identified specific improvement areas without relying on single aggregate scores.

Implementation Details

For generation evaluation, we used a combination of:

Custom metrics calculated directly on the response texts
Off-the-shelf libraries like RAGAS for automated assessment
Human evaluation for a subset of responses to calibrate automated metrics

We found that factuality and groundedness were particularly important for enterprise applications, where incorrect information poses significant risks. Our evaluation framework prioritized these aspects alongside traditional metrics like relevance and completeness.

For reference-based evaluations, we worked with SMEs to develop gold-standard reference answers. We also implemented embedding-based similarity scores to assess semantic alignment between system responses and reference answers.

Additionally, we incorporated reference-free evaluators focusing on factuality and relevancy, providing multiple implementation options that the team could select based on specific use cases.

Evaluation Complexities

Web Search Evaluation Challenges

When evaluating web search accuracy, we encountered a significant challenge: web links change over time, and equivalent or better content might be available at different URLs than those in our ground truth data. For web search agents, we shifted focus from exact URL matching to chunk-based retrieval evaluation and answer accuracy metrics.

This approach acknowledges the dynamic nature of the web while still ensuring the agent retrieves high-quality, relevant information regardless of its specific source URL.

SME Bias and Inter-Annotator Agreement

Another critical insight involved potential bias in SME-prepared reference answers. To mitigate this, we proposed to implement:

Inter-annotator agreement protocols requiring multiple SMEs to review each reference answer
Multiple review cycles to ensure consensus
Clear guidelines for what constitutes a good answer in each domain
Build a clear criteria/protocol for evaluation data annotation to be shared with SMEs

Domain-Specific Evaluation Design

We recognized that effective evaluation must be domain-centric, not AI-centric. This core principle guided our entire approach: the people who will use and benefit from the system should determine how it’s evaluated.

In each domain, we consistently placed subject matter experts (SMEs) at the center of our evaluation frameworks. Rather than imposing metrics defined by AI researchers, we collaborated with domain experts to establish criteria that reflected real-world requirements and usage contexts.

This approach ensured that:

Evaluation criteria aligned with actual business needs and domain-specific requirements
Systems were measured on metrics that mattered to end-users
Results were interpretable and actionable for stakeholders without AI expertise

For every agent in our multi-agent system, we partnered with the respective domain SMEs to define appropriate success metrics. This centered the actual users’ needs rather than emphasizing abstract AI metrics that might not translate to real-world value.

Conclusion

Developing effective evaluation frameworks for multi-agent LLM systems requires:

A comprehensive approach that addresses routing, retrieval, and generation
Collaboration with domain experts to establish relevant evaluation criteria
Detailed error analysis that goes beyond simple accuracy metrics
Recognition of domain-specific nuances and challenges
Iterative improvement based on insights from evaluation reports

By implementing these principles, we were able to create evaluation pipelines that provided actionable insights and drove significant improvements in our multi-agent system performance.

The future of AI evaluation must continue moving toward domain-expert-driven approaches rather than AI-centric metrics. Only by centering the actual users and use cases can we ensure our systems deliver real-world value.

Acknowledgements

Special thanks to the ISE crew—Juan, Jarre, Abdo, Court, Paul, Nidhi, and Ankur—for their valuable contributions to this project.

Hierarchical Waterfall Evaluation of Query Classification, Retrieval & Generation in Multi-Agent LLM Systems

Introduction

The Evaluation Challenge

Evaluation Framework Development

Hierarchical Schema for Evaluation Data

Multi-Label Router Classification Evaluation Pipeline

Multi-Class vs. Multi-Label

Error Analysis Framework

Implementation and Example

Example Usage

Retrieval and Generation Metrics

Retrieval Evaluation

Answer Generation Evaluation

Implementation Details

Evaluation Complexities

Web Search Evaluation Challenges

SME Bias and Inter-Annotator Agreement

Domain-Specific Evaluation Design

Conclusion

Acknowledgements

Author

Read next

Efficient Ground Truth Generation for Search Evaluation

Durable Functions for Indexing in RAG: A Practical Python Approach

Introduction

The Evaluation Challenge

Evaluation Framework Development

Hierarchical Schema for Evaluation Data

Multi-Label Router Classification Evaluation Pipeline

Multi-Class vs. Multi-Label

Error Analysis Framework

Implementation and Example

Example Usage

Retrieval and Generation Metrics

Retrieval Evaluation

Answer Generation Evaluation

Implementation Details

Evaluation Complexities

Web Search Evaluation Challenges

SME Bias and Inter-Annotator Agreement

Domain-Specific Evaluation Design

Conclusion

Acknowledgements

Author

Read next

Efficient Ground Truth Generation for Search Evaluation

Durable Functions for Indexing in RAG: A Practical Python Approach

Stay informed