论文分享➲ arXiv2026 | H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions-尧图网络科技

H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions

📄 Paper • 🤗 Dataset • 🏆 Leaderboard • 🌐 Project Page • 💻 Code

一、为什么我们需要 H2HMem？

二、H2HMem 是什么？

🎯 核心特点

三、数据集设计

📌 数据规模

📌 对话结构

📌 构建流程（很关键）

四、任务设计（核心创新）

🧠 Memory Recall

1. UPR（Basic Recall）

2. CRR（Cross-modal Retrieval）

3. KR（Knowledge Resolution）

🧠 Memory Reasoning

4. MCR（Multimodal Reasoning）

5. RET（Reference Tracking）

6. TR（Temporal Reasoning）

🧠 Memory Application

7. TTL（Test-time Learning）

8. CD（Conflict Detection）

9. AR（Answer Refusal）

五、实验结果及发现

❗ 1. 多人对话显著更难

❗ 2. 最大问题不是“记不住”，而是“对不齐”

❗ 3. Retrieval 不等于理解

六、核心结论

七、总结

一、为什么我们需要 H2HMem？

近年来，agents的从聊天机器人拓展到了新的场景：

🏥 医疗记录助手（旁听医生对话）
💼 会议纪要 AI
🎓 教学课堂助理
🧑‍🤝‍🧑 多人对话分析系统

在 Human-Assitant Interaction 和 Human-Human Interaction 中 agents 身份的不同：

在Human-Human Interaction场景中的关键能力：

👉在复杂的人类对话中持续记住、理解并利用信息

❌ 现有 Memory Benchmark 的局限

大多数 benchmark 都是：

Human ↔ AI 对话
单人交互
单模态或弱多模态
无复杂说话人结构

而Human-Human Interaction场景中是：

❗ AI 在“旁听人类对话”，而不是直接参与对话

因此，论文提出了一个新的基准：

🧠 H2HMem Benchmark

二、H2HMem 是什么？

H2HMem（Human-to-Human Multimodal Memory Benchmark）是一个用于评测：

👉 多模态智能体在多人对话环境中的长期记忆能力

🎯 核心特点

H2HMem 同时具备：

🧑‍🤝‍🧑 Human-Human conversation（双人及多人）
🖼️ Multimodal data（文本 + 图片）
🔁 Multi-session long-term memory
🧠 Memory reasoning + retrieval + application

三、数据集设计

📌 数据规模

20 个双人对话（dyadic）
5 个多人对话（multi-party）
300+ sessions
7000+ dialogue rounds
1000+ images
2000+ QA pairs

📌 对话结构

每个 conversation 包含：

多个 session（跨时间）
多个 topic（如 travel / food / shopping）
多模态输入（图片 + 文本）

📌 构建流程（很关键）

论文采用了 human-in-the-loop pipeline：Human as a director and LLM as a scriptwriter.

Persona 生成（人物设定）
场景与话题规划
图片收集与校验
对话生成（LLM + image caption）
QA 自动生成 + 人工验证

四、任务设计（核心创新）

H2HMem 将 memory 任务系统化为9 大任务类型：

🧠 Memory Recall

1. UPR（Basic Recall）

简单事实回忆

2. CRR（Cross-modal Retrieval）

图文结合检索

3. KR（Knowledge Resolution）

处理信息更新/冲突

🧠 Memory Reasoning

4. MCR（Multimodal Reasoning）

图 + 文联合推理

5. RET（Reference Tracking）

“this / that” 指代解析

6. TR（Temporal Reasoning）

时间顺序理解

🧠 Memory Application

7. TTL（Test-time Learning）

利用记忆解决新问题

8. CD（Conflict Detection）

判断信息是否冲突

9. AR（Answer Refusal）

信息不存在时拒答

👉 这一设计的关键意义是：

不再只测“记住没”，而是测“理解 + 对齐 + 推理 + 更新”

五、实验结果及发现

论文实验发现了几个关键结论：

❗ 1. 多人对话显著更难

在 multi-party setting 中：

KR 性能从 0.49 → 0.25
性能大幅下降

👉 说明：多人交互会严重干扰 memory system

❗ 2. 最大问题不是“记不住”，而是“对不齐”

错误主要来自：

🖼️ Modal misalignment（图文对不齐）
👤 Speaker attribution error（说话人混乱）

👉 模型经常：

记住了，但不知道是谁说的

❗ 3. Retrieval 不等于理解

虽然模型可以 retrieve 信息：

但无法：

过滤噪声
理解上下文关系
处理冲突信息

六、核心结论

❗ Memory systems are not failing because they forget,
but because they fail to reconstruct coherent multimodal interaction history.

换句话说：

❌ 不是“记忆容量问题”
✔ 是“结构化理解问题”

七、总结

H2HMem 提供了一个非常重要的方向：

👉 未来 AI 记忆系统不只是“RAG + 向量库”，而是“结构化交互历史建模”

如果您对我们的工作感兴趣，希望您能为我们的Github仓库点一个star，以便更多的人关注到我们的工作。

资讯详情

H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions

一、为什么我们需要 H2HMem？

🧠 H2HMem Benchmark

二、H2HMem 是什么？

🎯 核心特点

三、数据集设计

📌 数据规模

📌 对话结构

📌 构建流程（很关键）

四、任务设计（核心创新）

🧠 Memory Recall

1. UPR（Basic Recall）

2. CRR（Cross-modal Retrieval）

3. KR（Knowledge Resolution）

🧠 Memory Reasoning

4. MCR（Multimodal Reasoning）

5. RET（Reference Tracking）

6. TR（Temporal Reasoning）

🧠 Memory Application

7. TTL（Test-time Learning）

8. CD（Conflict Detection）

9. AR（Answer Refusal）

五、实验结果及发现

❗ 1. 多人对话显著更难

❗ 2. 最大问题不是“记不住”，而是“对不齐”

❗ 3. Retrieval 不等于理解

六、核心结论

七、总结

相关新闻