【observability】【evaluation01】AIMon的LlamaIndex扩展用于LLM响应评估-尧图网络科技

案例目标

本案例展示了如何使用AIMon的评估器来评估LlamaIndex框架中语言模型(LLM)生成的响应质量和准确性。主要目标是：

演示如何使用AIMon的幻觉检测评估器识别模型生成的不受上下文支持的信息
展示如何使用指南评估器确保模型响应遵循预定义的指令和指南
介绍如何使用上下文相关性评估器评估提供的上下文在支持模型响应方面的相关性和准确性
提供一个完整的RAG(检索增强生成)应用评估流程

注意：本案例特别关注使用幻觉评估器、指南评估器和上下文相关性评估器来评估RAG应用程序。

技术栈与核心依赖

本案例使用了以下技术栈和核心依赖：

LlamaIndexAIMonOpenAI APIdatasetsrequests

主要依赖包：

pip install requests datasets aimon-llamaindex llama-index-embeddings-openai llama-index-llms-openai

核心组件：

AIMon评估器：包括幻觉评估器、指南评估器和上下文相关性评估器
LlamaIndex：用于构建RAG应用程序的核心框架
OpenAI模型：使用gpt-4o-mini作为LLM，text-embedding-3-small作为嵌入模型
MeetingBank数据集：用作上下文信息的会议记录数据集

环境配置

1. 安装依赖

%%capture
!pip install requests datasets aimon-llamaindex llama-index-embeddings-openai llama-index-llms-openai

2. 配置API密钥

import os
import json
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

重要：需要在Google Colab secrets中配置OPENAI_API_KEY和AIMON_API_KEY，并授予notebook访问权限。AIMon API密钥可以从这里获取。

3. 加载数据集

from datasets import load_dataset
meetingbank = load_dataset("huuuyeah/meetingbank")

案例实现

1. 数据准备

从MeetingBank数据集中提取会议记录，并将其转换为LlamaIndex的Document对象：

from llama_index.core import Document
def extract_and_create_documents(transcripts):
documents = []
for transcript in transcripts:
try:
doc = Document(text=transcript)
documents.append(doc)
except Exception as e:
print(f"Failed to create document")
return documents
transcripts = [meeting["transcript"] for meeting in meetingbank["train"]]
documents = extract_and_create_documents(transcripts[:5]) # 只使用5个记录以保持示例简洁

2. 构建向量索引

设置嵌入模型并生成文档嵌入：

from llama_index.embeddings.openai import OpenAIEmbedding
from aimon_llamaindex import generate_embeddings_for_docs, build_index, build_retriever
embedding_model = OpenAIEmbedding(
model="text-embedding-3-small",
embed_batch_size=100,
max_retries=3
)
nodes = generate_embeddings_for_docs(documents, embedding_model)
index = build_index(nodes)
retriever = build_retriever(index, similarity_top_k=5)

3. 配置LLM

from llama_index.llms.openai import OpenAI
llm = OpenAI(
model="gpt-4o-mini",
temperature=0.4,
system_prompt="""
Please be professional and polite. Answer the user's question in a single line. Even if the context lacks information to answer the question, make sure that you answer the user's question based on your own knowledge.
""",
)

4. 定义查询和指令

user_query = "Which council bills were amended for zoning regulations?"
user_instructions = [
"Keep the response concise, preferably under the 100 word limit."
]
# 更新LLM的系统提示
llm.system_prompt += (
f"Please comply to the following instructions {user_instructions}."
)

5. 获取响应

from aimon_llamaindex import get_response
llm_response = get_response(user_query, retriever, llm)

6. 配置AIMon客户端

from aimon import Client
aimon_client = Client(
auth_header="Bearer {}".format(userdata.get("AIMON_API_KEY"))
)

7. 运行评估

指南评估：

from aimon_llamaindex.evaluators import GuidelineEvaluator
guideline_evaluator = GuidelineEvaluator(aimon_client)
evaluation_result = guideline_evaluator.evaluate(
user_query,
llm_response,
user_instructions
)

幻觉检测评估：

from aimon_llamaindex.evaluators import HallucinationEvaluator
hallucination_evaluator = HallucinationEvaluator(aimon_client)
evalution_result = hallucination_evaluator.evaluate(user_query, llm_response)

上下文相关性评估：

from aimon_llamaindex.evaluators import ContextRelevanceEvaluator
evaluator = ContextRelevanceEvaluator(aimon_client)
task_definition = (
"Find the relevance of the context data used to generate this response."
)
evaluation_result = evaluator.evaluate(
user_query,
llm_response,
task_definition
)

案例效果

指南评估结果

指南评估器检查模型响应是否遵循了用户提供的指令。评估结果显示：

{
"extractions": [],
"instructions_list": [
{
"explanation": "",
"follow_probability": 0.982,
"instruction": "Keep the response concise, preferably under the 100 word limit.",
"label": true
}
],
"score": 1.0
}

评估得分为1.0，表明模型完全遵循了保持响应简洁的指令。

幻觉检测评估结果

幻觉检测评估器识别模型生成的不受上下文支持的信息：

{
"is_hallucinated": "False",
"score": 0.22446,
"sentences": [
{
"score": 0.22446,
"text": "The council bills amended for zoning regulations include the small lot moratorium and the text amendment related to off-street parking exemptions for preexisting small lots. These amendments aim to balance the interests of local neighborhoods, health institutions, and developers."
}
]
}

幻觉分数为0.22446（范围0.0-1.0），表明响应中幻觉内容较少，信息相对可靠。

上下文相关性评估结果

上下文相关性评估器评估用于生成响应的上下文数据的相关性：

[
{
"explanations": [
"Document 1 discusses a council bill related to zoning regulations, specifically mentioning a text amendment that aims to balance neighborhood interests with developer needs. However, it primarily focuses on parking issues and personal experiences rather than detailing specific zoning regulation amendments or the council bills directly related to them, which makes it less relevant to the query.",
"Document 2 mentions zoning and development issues, including the need for mass transit and affordability, but it does not provide specific information on which council bills were amended for zoning regulations...",
// ... 其他文档解释
],
"query": "Which council bills were amended for zoning regulations?",
"relevance_scores": [
40.5,
40.25,
44.25,
38.5,
43.0
]
}
]

评估提供了每个文档的相关性分数和解释，帮助用户了解上下文数据与查询的相关程度。

案例实现思路

本案例的实现思路遵循以下步骤：

环境准备：安装必要的依赖库，配置API密钥，确保可以访问OpenAI和AIMon服务。
数据准备：从MeetingBank数据集中加载会议记录，并将其转换为LlamaIndex可处理的Document对象。
向量索引构建：使用OpenAI的嵌入模型为文档生成向量表示，并构建向量索引以支持高效检索。
LLM配置：设置OpenAI的gpt-4o-mini模型，配置系统提示和用户指令，确保模型能够按照要求生成响应。
响应生成：使用构建的检索器和LLM对用户查询生成响应。
评估配置：配置AIMon客户端，准备使用各种评估器。
多维度评估：使用指南评估器、幻觉检测评估器和上下文相关性评估器对生成的响应进行全方位评估。
结果分析：分析评估结果，了解模型响应的质量和可靠性。

关键思路：通过多维度评估，全面了解RAG系统的性能，识别可能的问题并指导系统优化。

扩展建议

1. 扩展评估维度

除了本案例中使用的三种评估器外，AIMon还提供了其他评估器，可以考虑添加：

完整性评估器：检查响应是否完全解决了查询或任务的所有方面
简洁性评估器：评估响应是否简洁而完整，避免不必要的冗长
毒性评估器：标记响应中有害、冒犯性或不适当的语言

2. 批量评估

扩展案例以支持批量评估多个查询和响应，提供更全面的系统性能评估：

# 批量评估示例
queries = ["Query 1", "Query 2", "Query 3"]
results = []
for query in queries:
response = get_response(query, retriever, llm)
guideline_result = guideline_evaluator.evaluate(query, response, user_instructions)
hallucination_result = hallucination_evaluator.evaluate(query, response)
context_result = context_evaluator.evaluate(query, response, task_definition)
results.append({
"query": query,
"response": response,
"guideline_score": guideline_result["score"],
"hallucination_score": hallucination_result["score"],
"context_relevance": context_result
})