ysf
ysf
Published on 2025-09-08 / 0 Visits
0

【论文阅读】Late Chunking:Contextual Chunk Embeddings Using Long-Context Embedding Models

LATE CHUNKING: CONTEXTUAL CHUNK EMBEDDINGS USING LONG-CONTEXT EMBEDDING MODELS

Abstract

Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be “over-compressed” in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in sub-optimal representations. In this paper, we introduce a novel method called “late chunking”, which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling - hence the term “late” in its naming. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks. The method is generic enough to be applied to a wide range of long-context embedding models and works without additional training. To further increase the effectiveness of late chunking, we propose a dedicated fine-tuning approach for embedding models.

这篇文章的核心不是提出一个新的大模型架构,而是重新安排“chunking”和“embedding”的顺序:先让长上下文 embedding model 读完整文,再在 token embedding 层面对 chunk 做 pooling,从而让每个 chunk embedding 带上全文上下文。

在常见 RAG 流程中,长文档会先被切成多个短 chunk,然后每个 chunk 被单独编码成向量,存入向量数据库。这样做的好处是检索粒度细,LLM 后续拿到的文本也短。但问题是:chunk 被独立编码时,它看不到前后文,因此代词、指代词、省略信息、跨段依赖都会丢失