【论文阅读】Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

Abstract:

Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documents into chunks and assemble answers from chunk-level outputs, but this introduces an aggregation bottleneck: as the number of chunks grows, systems must still combine and reason over an increasingly large body of extracted evidence. We present SLIDERS, a framework for question answering over long document collections through structured reasoning. SLIDERS extracts salient information into a relational database, enabling scalable reasoning over persistent structured state via SQL rather than concatenated text. To make this locally extracted representation globally coherent, SLIDERS introduces a data reconciliation stage that leverages provenance, extraction rationales, and metadata to detect and repair duplicated, inconsistent, and incomplete records. SLIDERS outperforms all baselines on three existing long-context benchmarks, despite all of them fitting within the context window of strong base LLMs, exceeding GPT-4.1 by 6.6 points on average. It also improves over the next best baseline by ∼ 19 and ∼ 32 points on two new benchmarks at 3.9M and 36M tokens, respectively.

在真实的金融、医疗、法律或社会科学分析场景中，回答一个问题往往需要同时参考多个文档，甚至需要综合同一文档中相距很远的多个章节。关键证据通常不是集中出现的，而是以碎片化的形式分布在不同页面、表格、段落或文档之间。

近年来，LLM的上下文窗口不断扩大，已经达到百万 token 级别，这很容易让人产生一种错觉：只要模型的上下文窗口足够大，长文档问答面临的困难就自然能得到解决。然而，事实并非如此，长文档问答的挑战并不会随着上下文窗口的增大而消失，因为问题的关键不只是“能不能读完”，更是“能不能可靠的整合分散证据”。

即使将所有相关内容一次性输入 LLM，模型仍然需要在大量文本中完成证据筛选、信息对齐、跨段落比较和综合推理。随着输入长度增加，相关证据在上下文中所占比例会越来越低，模型注意力容易被大量无关内容稀释，导致关键细节被忽略。同时，超长上下文推理也会带来显著的计算成本和延迟，在真实业务场景中很难长期承受。更重要的是，当前 LLM 并不能稳定地组合来自远距离章节或不同文档的信息，容易产生遗漏、重复或相互矛盾的结论。

因此，目前更主流的做法通常是将长文档切分为多个 chunk，并将这些 chunk 存入向量数据库中；当用户提出问题时，系统先根据语义相似度检索出若干相关片段，再交给 LLM 生成答案。这种 RAG 或 chunk-based 方法确实缓解了上下文窗口限制，也降低了单次输入的长度，但它并没有从根本上解决长文档问答中的证据整合问题。

一方面，如果检索阶段遗漏了关键证据，后续推理就会建立在不完整信息之上，即使模型生成的答案看起来合理，也可能得出错误结论。另一方面，随着文档规模和 chunk 数量增加，系统最终仍然需要对越来越多的 chunk-level 结果进行聚合、去重、对齐和冲突处理。也就是说，chunking 只是把“原始文档太长”的问题，转化成了“中间证据太多、难以聚合”的问题。这也是这篇论文所说的Aggregation Bottleneck

这正是这篇论文所强调的核心瓶颈：长文档问答真正困难的地方，并不是简单地扩大上下文窗口，而是如何构建一种可靠机制，对分散在大量文档和片段中的证据进行组织、整合和推理。

Menu

Share

【论文阅读】Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

如何在阿里云申请使用免费SSL证书并在Nginx服务器部署

Elsevier LaTeX 模板使用指引

Ubuntu 安装 tldr

Prompt Engineering 和 Context Engineering

Harness 与 Harness Engineering

LaTeX 编辑数学公式-详细教程

RAG系统幻觉问题的诊断与优化

如何评估RAG应用？检索质量与生成质量的评价指标

Agent Skills

大模型文本生成中 temperature 参数的数学机制