Introduction
As organizations increasingly deploy multi-agent LLM systems to handle complex queries and workflows, the need for robust evaluation frameworks becomes crucial. Based on our recent work implementing hypothesis-driven experimentation for a large enterprise client, this post shares insights on building a hierarchical waterfall evaluation framework focused on query classification and retrieval-augmented generation (RAG) in multi-agent systems.
This blog explores how we established structured evaluation pipelines, worked with domain experts, and implemented key metrics that went beyond simple accuracy scores to provide actionable insights for system improvement.
The Evaluation Challenge
Multi-agent systems, where different specialist AI agents handle different aspects of user queries, introduce unique evaluation challenges:
- Routing Accuracy: If queries aren’t routed to the right agent(s), the entire system breaks down, potentially producing plausible but incorrect answers that can deceive users without their knowledge
- Multi-label Classification: Queries often require multiple agents to work together
- Retrieval Quality: Agents need to find the right documents and content chunks from knowledge sources, making retrieval accuracy a critical factor in multi-agent systems just as it is in traditional RAG implementations
- Answer Generation: Final responses must be accurate, relevant, and properly grounded
Our approach focused on developing comprehensive evaluation pipelines that addressed each of these aspects. We prioritized providing transparent, actionable feedback rather than opaque performance scores. Because our current implementation primarily uses retrieval-based agents, retrieval quality is a key metric in our evaluation framework. However, the framework is extensible—additional metrics, such as task completion rates and execution accuracy, can be incorporated for other types of agents.
Evaluation Framework Development
A key focus of our engagement was implementing hypothesis-driven experimentation, which required establishing a robust evaluation framework. This framework was built on high-quality evaluation data collected from multiple sources including subject matter experts (SMEs), existing system data, actual user queries, and supplementary synthetic data.
We prioritized creating evaluation datasets that genuinely represented real-world use cases rather than merely generating evaluation metrics. This approach ensured our improvements directly impacted business outcomes and user experience.
The data annotation process involved close collaboration with SMEs who provided critical annotations including:
- Query classification
- Appropriate agent selection
- Link/document retrieval for knowledge indexes
- Chunk retrieval within documents
- Reference answers
We treated evaluation data with the same rigor as production code. This included implementing versioning, comprehensive documentation, and detailed process tracking. This methodical approach prevents metric misinterpretation and ensures teams maintain context when analyzing results.
Fig 1: Comprehensive multi-agent evaluation framework – Our hierarchical approach evaluates routing, retrieval, and generation components to provide a complete assessment of system performance.
Hierarchical Schema for Evaluation Data
To structure our evaluation data effectively, we developed a simple hierarchical schema that follows our sequential evaluation process:
EvaluationData/
├── Queries/
│ ├── all_queries.json # All test queries
│ └── query_metadata.json # Domain, difficulty, query type metadata
│
├── AgentClassification/
│ ├── ground_truth_labels.json # Correct agent assignments for each query
│ ├── model_predictions.json # System's agent selections
│ └── classification_results.json # Accuracy, precision, recall, F1 metrics
│
├── Retrieval/ # Only evaluated for correctly classified queries
│ ├── document_retrieval/
│ │ ├── ground_truth_links.json # Expected documents/links
│ │ ├── retrieved_links.json # Actually retrieved documents
│ │ └── document_metrics.json # Precision@k, recall@k scores
│ │
│ └── chunk_retrieval/
│ ├── ground_truth_chunks.json # Expected content chunks
│ ├── retrieved_chunks.json # Actually retrieved chunks
│ └── chunk_metrics.json # Rouge scores, precision, recall
│
└── Generation/ # Final answer evaluation
├── reference_answers.json # SME-provided reference answers
├── system_responses.json # Generated answers
└── answer_metrics.json # Groundedness, similarity, factuality scores
This sequential schema ensures we:
- First evaluate if queries are routed to the correct agent(s)
- Only for correctly routed queries, measure document and chunk retrieval accuracy
- Finally assess answer quality based on groundedness and alignment with references
This approach prevents misattribution of errors and allows us to clearly identify which stage of the pipeline may need improvement.
Multi-Label Router Classification Evaluation Pipeline
User queries sent to multi-agent applications need to be routed to the relevant AI Agents and then the relevant agent(s) invoked to generate a response. This involves classifying the user queries into categories that align with each AI Agent or multiple agents.
The router agent was responsible for deciding which AI Agents needed to be invoked based on the user query. Each agent is assigned a label, so multi-label classification refers to evaluating the multi-agent participation requirements for a user query.
Multi-Class vs. Multi-Label
- Multi-Class
- Each data point (query) is assigned exactly one label/class/agent
- Example: A router picks only one agent (e.g., knowledge base, generic LLM, or web search) for each query, even if more than one agent might apply
- This is typically simpler to implement and interpret because each query has a single “best” choice
- Multi-Label
- Each data point (query) can be assigned multiple labels/classes
- Example: A query could be routed to both “knowledge base” and “web search” if appropriate
- More challenging because you have to handle overlaps between classes, and your metrics (precision, recall, F1) become more complex in a multi-label setting
Generally, multi-label classification is harder than multi-class because you must learn not just to pick the correct single label, but to identify all correct labels simultaneously. This requires a more robust approach, especially to ensure that the router captures how labels/agents relate to each other.
Our evaluation pipeline for multi-label classification serves as the critical entry point and primary accuracy bottleneck for the entire multi-agent system. If queries aren’t routed to the correct agents at this stage, even perfect downstream components will fail to produce accurate results.
Rather than providing just another “black box” metric like F1 score, our pipeline generates detailed diagnostic error analysis reports that illuminate exactly where and how the system is failing, enabling targeted improvements. Specifically, the pipeline evaluates how effectively the multi-agent router classifies each query. It compares predicted agent assignments against expert-labeled ground truth data to identify both strengths and areas for improvement.
Error Analysis Framework
The error analysis report serves as the cornerstone of our development process, addressing critical questions:
- Which specific query types consistently trigger incorrect agent selection?
- What patterns exist in queries where the system performs well vs. poorly?
- Are there particular agent combinations the system struggles to identify?
- Do specific keywords or phrasings correlate with routing errors?
- Which errors would have the highest business impact if fixed?
Fig 2: Error analysis driven development approach – Our evaluation pipeline focuses on detailed error analysis rather than single metrics, enabling iterative improvements through targeted insights.
Our pipeline provides more than standard metrics (precision, recall, F1):
- Per-query analysis showing exactly which agent assignments were correct and incorrect
- Detailed examination of multi-label cases where some but not all agents were correctly assigned
- Pattern recognition across error categories to identify systematic failure modes
- Performance breakdowns by query type, complexity, and domain
The analysis results include several key components:
- Dataset Statistics: General statistics about the overall dataset
- Missed Entries: Data queries with ground truth but no prediction entries
- Filtered Data: Any data queries excluded from evaluation
- Class Distribution: The distribution between different agent classes
- Multi-Label Data: Analysis of queries requiring multiple agents
- Incorrect Predictions: Breakdown of errors as partial matches or complete misses
- Overall Metrics: Exact match accuracy, average precision, recall, and F1-scores
- Per-Class Performance: Metrics broken down by agent type
Implementation and Example
We implemented this approach in a Python evaluation script that processes ground truth and prediction data to generate detailed reports. Here’s a simplified version of our implementation:
#!/usr/bin/env python3
"""
General Multi-Label Classification Evaluation with Detailed Analysis
Usage:
python eval_multilabel_full.py \
--gt ground_truth.json \
--pred predictions.json \
--outdir results/ \
--remove unknown outofscope
"""
import json
import argparse
from pathlib import Path
from collections import Counter
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
def load_json(path):
return json.loads(Path(path).read_text())
def norm(raw):
"""Normalize a raw label field into a list of lowercase labels."""
if raw is None:
return []
if isinstance(raw, str):
raw = [x.strip() for x in raw.split(',')]
if isinstance(raw, (list, tuple)):
return [x.lower() for x in raw]
return [str(raw).lower()]
def align(gt, pred):
"""Align GT and predictions by ID, return common IDs, missing, extra, and maps."""
gt_map = {d['id']: d for d in gt}
pred_map = {d['id']: d for d in pred}
common = sorted(set(gt_map) & set(pred_map))
missing = sorted(set(gt_map) - set(pred_map))
extra = sorted(set(pred_map) - set(gt_map))
return common, missing, extra, gt_map, pred_map
def filter_idxs(labels_list, remove):
"""Return indices of entries whose labels do not include any remove-categories."""
keep = []
for i, labs in enumerate(labels_list):
if not any(r in labs for r in remove):
keep.append(i)
return keep
def analyze(gt_raw, pred_raw, remove, outdir):
# Align by ID
common, missing, extra, gt_map, pred_map = align(gt_raw, pred_raw)
outdir = Path(outdir)
outdir.mkdir(parents=True, exist_ok=True)
# Extract & normalize labels
gt = [norm(gt_map[i].get('labels')) for i in common]
pr = [norm(pred_map[i].get('labels')) for i in common]
ids = list(common)
# Filter out unwanted categories
keep = filter_idxs(gt, remove)
gt = [gt[i] for i in keep]
pr = [pr[i] for i in keep]
ids = [ids[i] for i in keep]
# Build class distributions
def dist(lst):
cnt = Counter([l for row in lst for l in row])
total = sum(cnt.values())
return [
{'label': lab, 'count': cnt[lab], 'pct': cnt[lab]/total*100 if total else 0}
for lab, _ in cnt.most_common()
]
stats = {
'dataset': {
'gt_total': len(gt_raw),
'pred_total': len(pred_raw),
'common': len(common),
'missing': missing,
'extra': extra
},
'distribution': {
'gt': dist(gt),
'pred': dist(pr)
}
}
# Multi-label analysis
multi = [i for i, labs in enumerate(gt) if len(labs) > 1]
ml_exact = sum(1 for i in multi if set(gt[i]) == set(pr[i]))
# Single→multi cases
s2m = [i for i, labs in enumerate(gt) if len(labs) == 1 and len(pr[i]) > 1]
s2m_corr = sum(1 for i in s2m if gt[i][0] in pr[i])
# Collect incorrect predictions
incorrect = []
for i, (g0, p0) in enumerate(zip(gt, pr)):
if set(g0) != set(p0):
incorrect.append({
'id': ids[i],
'gt': g0,
'pred': p0,
'partial': bool(set(g0) & set(p0)),
'missed': [x for x in g0 if x not in p0],
'extra': [x for x in p0 if x not in g0],
})
# Compute multi-label metrics
all_classes = sorted({lab for row in gt for lab in row})
mlb = MultiLabelBinarizer(classes=all_classes)
y_true = mlb.fit_transform(gt)
y_pred = mlb.transform(pr)
report = classification_report(
y_true, y_pred,
target_names=mlb.classes_,
output_dict=True,
zero_division=0
)
exact_match_acc = sum(1 for g0, p0 in zip(gt, pr) if set(g0) == set(p0)) / len(gt) if gt else 0
# Save metrics.json
out_metrics = {
**stats,
'exact_match': exact_match_acc,
'report': report,
'multi_label': {
'total': len(multi),
'exact_match': ml_exact,
'accuracy': ml_exact / len(multi) if multi else None
},
'single_to_multi': {
'total': len(s2m),
'correct': s2m_corr,
'pct_correct': (s2m_corr / len(s2m) * 100) if s2m else None
},
'incorrect': incorrect
}
with open(outdir / 'metrics.json', 'w') as f:
json.dump(out_metrics, f, indent=2)
# Write human-readable report
rpt = outdir / 'analysis_report.txt'
with rpt.open('w') as f:
f.write("=== DATASET STATISTICS ===\n")
f.write(f"GT entries: {stats['dataset']['gt_total']}\n")
f.write(f"Pred entries: {stats['dataset']['pred_total']}\n")
f.write(f"Common IDs: {stats['dataset']['common']}\n")
f.write(f"Missing IDs: {len(missing)}\n")
f.write(f"Extra IDs: {len(extra)}\n\n")
f.write("=== CLASS DISTRIBUTIONS ===\n")
for side in ['gt', 'pred']:
f.write(f"-- {side.upper()} --\n")
for d in stats['distribution'][side]:
f.write(f" {d['label']}: {d['count']} ({d['pct']:.1f}%)\n")
f.write("\n")
f.write("=== OVERALL METRICS ===\n")
f.write(f"Exact-match accuracy: {exact_match_acc:.3f}\n")
f.write(f"Weighted F1-score: {report['weighted avg']['f1-score']:.3f}\n\n")
f.write("=== MULTI-LABEL ANALYSIS ===\n")
f.write(f"Total multi-label: {len(multi)}\n")
f.write(f"Exact matches: {ml_exact}\n\n")
f.write("=== SINGLE→MULTI INSPECTION ===\n")
f.write(f"Total single→multi: {len(s2m)}, correct includes: {s2m_corr}\n\n")
f.write("=== INCORRECT PREDICTIONS ===\n")
for e in incorrect:
f.write(
f"- ID {e['id']}: GT={e['gt']} PRED={e['pred']} "
f"partial={e['partial']} missed={e['missed']} extra={e['extra']}\n"
)
print(f"Metrics and report saved to: {outdir}")
if __name__ == "__main__":
Example Usage
Below are example JSON files that demonstrate how to use the evaluation script:
ground_truth.json
[
{ "id": "1", "labels": ["billing", "search"] },
{ "id": "2", "labels": ["faq"] },
{ "id": "3", "labels": ["unknown"] },
{ "id": "4", "labels": ["faq", "billing"] }
]
predictions.json
[
{ "id": "1", "labels": ["search", "billing"] },
{ "id": "2", "labels": ["faq"] },
{ "id": "3", "labels": ["fileupload"] },
{ "id": "4", "labels": ["faq"] }
]
Place these files alongside your script, then run:
python eval_multilabel_full.py --gt ground_truth.json --pred predictions.json --outdir my_results
Your my_results/metrics.json
and my_results/analysis_report.txt
will contain the full diagnostics that enable teams to identify and address specific issues in the multi-agent routing system.
The weighted average F1-score provides a good overall performance tracker, while per-class metrics help identify which query types need improvement. This approach enabled our teams to iteratively improve routing accuracy from the initial 70% range to the high 80% range by focusing on the most impactful error categories.
Retrieval and Generation Metrics
Once we’ve established that queries are correctly routed to the appropriate agents, we need to evaluate two critical downstream components: retrieval accuracy and answer generation quality. These evaluations only proceed for queries that were correctly classified in the routing stage.
Retrieval Evaluation
For agents that retrieve information before generating answers, we implemented two levels of retrieval evaluation:
- Document-Level Retrieval Metrics:
- Precision@k (P@k): Proportion of retrieved documents that are relevant
- Recall@k (R@k): Proportion of relevant documents that are retrieved
- F1@k: Harmonic mean of precision and recall at k documents
We used these metrics to evaluate how effectively agents found the correct documents or knowledge sources/links for each query. These evaluations were applied to both our web search agent and our internal knowledge retrieval agent that queried the enterprise knowledge index.
- Chunk-Level Retrieval Metrics:
- Rouge-L scores measuring overlap between retrieved text chunks and ground truth chunks
- Semantic similarity between retrieved chunks and reference chunks using embedding models
- Coverage assessment evaluating whether retrieved chunks contain all necessary information
This approach helped us identify whether retrieval errors were occurring at the document selection stage or during chunk extraction. This insight provided valuable feedback to improve our indexing and chunking approaches while creating the knowledge index.
Answer Generation Evaluation
For the final generated responses, we developed a multifaceted evaluation approach:
- Reference-Based Metrics:
- Lexical similarity using Rouge, BLEU, and GLEU scores to measure word and n-gram overlap
- Semantic similarity using embedding-based models (BERTScore, SentenceBERT) to capture meaning beyond exact matches
- Structure and style assessment comparing formatting, citations, and organizational elements
- Reference-Free Metrics:
- Groundedness evaluation ensuring answers don’t contain hallucinated information
- Factual correctness assessment using frameworks like RAGAS to verify claims
- Answer relevancy scoring to determine if responses address the original query
These comprehensive metrics gave us a holistic view of answer quality and identified specific improvement areas without relying on single aggregate scores.
Implementation Details
For generation evaluation, we used a combination of:
- Custom metrics calculated directly on the response texts
- Off-the-shelf libraries like RAGAS for automated assessment
- Human evaluation for a subset of responses to calibrate automated metrics
We found that factuality and groundedness were particularly important for enterprise applications, where incorrect information poses significant risks. Our evaluation framework prioritized these aspects alongside traditional metrics like relevance and completeness.
For reference-based evaluations, we worked with SMEs to develop gold-standard reference answers. We also implemented embedding-based similarity scores to assess semantic alignment between system responses and reference answers.
Additionally, we incorporated reference-free evaluators focusing on factuality and relevancy, providing multiple implementation options that the team could select based on specific use cases.
Evaluation Complexities
Web Search Evaluation Challenges
When evaluating web search accuracy, we encountered a significant challenge: web links change over time, and equivalent or better content might be available at different URLs than those in our ground truth data. For web search agents, we shifted focus from exact URL matching to chunk-based retrieval evaluation and answer accuracy metrics.
This approach acknowledges the dynamic nature of the web while still ensuring the agent retrieves high-quality, relevant information regardless of its specific source URL.
SME Bias and Inter-Annotator Agreement
Another critical insight involved potential bias in SME-prepared reference answers. To mitigate this, we proposed to implement:
- Inter-annotator agreement protocols requiring multiple SMEs to review each reference answer
- Multiple review cycles to ensure consensus
- Clear guidelines for what constitutes a good answer in each domain
- Build a clear criteria/protocol for evaluation data annotation to be shared with SMEs
Domain-Specific Evaluation Design
We recognized that effective evaluation must be domain-centric, not AI-centric. This core principle guided our entire approach: the people who will use and benefit from the system should determine how it’s evaluated.
In each domain, we consistently placed subject matter experts (SMEs) at the center of our evaluation frameworks. Rather than imposing metrics defined by AI researchers, we collaborated with domain experts to establish criteria that reflected real-world requirements and usage contexts.
This approach ensured that:
- Evaluation criteria aligned with actual business needs and domain-specific requirements
- Systems were measured on metrics that mattered to end-users
- Results were interpretable and actionable for stakeholders without AI expertise
For every agent in our multi-agent system, we partnered with the respective domain SMEs to define appropriate success metrics. This centered the actual users’ needs rather than emphasizing abstract AI metrics that might not translate to real-world value.
Conclusion
Developing effective evaluation frameworks for multi-agent LLM systems requires:
- A comprehensive approach that addresses routing, retrieval, and generation
- Collaboration with domain experts to establish relevant evaluation criteria
- Detailed error analysis that goes beyond simple accuracy metrics
- Recognition of domain-specific nuances and challenges
- Iterative improvement based on insights from evaluation reports
By implementing these principles, we were able to create evaluation pipelines that provided actionable insights and drove significant improvements in our multi-agent system performance.
The future of AI evaluation must continue moving toward domain-expert-driven approaches rather than AI-centric metrics. Only by centering the actual users and use cases can we ensure our systems deliver real-world value.
Acknowledgements
Special thanks to the ISE crew—Juan, Jarre, Abdo, Court, Paul, Nidhi, and Ankur—for their valuable contributions to this project.