11种Spring AI文档切割策略全解析_Java

项目源码：github.com/xifyuw/spring-ai-course/tree/main/phase-15

一、前言

1.1 为什么文档切割如此重要

在 rag（检索增强生成）系统中，文档切割（document splitting）是最关键的前置步骤之一：

影响检索精度：切割得当，检索更准确
影响上下文完整性：保持语义连贯性
影响 token 利用率：避免浪费 llm 的上下文窗口
影响响应质量：好的切割 = 更好的回答

1.2 本文将收获什么

读完本文，你将掌握：

11 种切割策略的核心原理和适用场景
5 种新高级策略的详细使用方法
如何根据文档类型选择最佳策略
实战中的性能优化技巧

二、环境准备

2.1 项目依赖

<dependencies>
    <!-- spring ai openai -->
    <dependency>
        <groupid>org.springframework.ai</groupid>
        <artifactid>spring-ai-starter-model-openai</artifactid>
    </dependency>
    <!-- spring ai elasticsearch vector store -->
    <dependency>
        <groupid>org.springframework.ai</groupid>
        <artifactid>spring-ai-starter-vector-store-elasticsearch</artifactid>
    </dependency>
    <!-- webflux -->
    <dependency>
        <groupid>org.springframework.boot</groupid>
        <artifactid>spring-boot-starter-webflux</artifactid>
    </dependency>
</dependencies>

2.2 配置文件

# application.yml
spring:
  ai:
    openai:
      api-key: ${openai_api_key}
      embedding:
        options:
          model: text-embedding-3-small
    vectorstore:
      elasticsearch:
        index-name: document-store
        dimensions: 1536

三、基础策略回顾

在深入新策略之前，先快速回顾 6 种基础策略：

策略	特点	适用场景
recursive	递归字符切割，保持语义完整性	通用文本
markdown	识别标题层级，保留文档结构	markdown 文档
token	估算 token 数量，适配 llm 限制	需要精确控制 token
semantic	基于 embedding 相似度切割	高质量语义切割
smart_paragraph	段落+字符混合切割	长段落文档
character	固定字符切割，简单快速	快速处理

四、新策略详解

4.1 agentic 切割 - 智能主题边界

4.1.1 核心思想

agentic 切割借鉴了 llm agent 的理念，使用启发式规则在主题边界处进行切割，确保每个块都围绕一个完整的主题。

4.1.2 实现原理

/**
 * agentic 切割核心逻辑
 * 
 * 1. 首先使用递归切割获得初步块
 * 2. 识别主题转换标记词
 * 3. 在主题边界处优化切割点
 */
public list<document> splitagentic(string text, string filename, splitoptions options) {
    // 第一步：递归切割获得初步块
    list<textchunk> initialchunks = recursivesplitinternal(
        normalizetext(text),
        0,
        chunksize * 2, // 更大的初始块
        chunkoverlap,
        0
    );
    
    // 第二步：使用启发式规则优化切割点
    for (textchunk chunk : initialchunks) {
        // 识别主题转换标记词
        string optimizedcontent = optimizechunkboundary(content, chunksize);
        // ...
    }
}

4.1.3 主题标记词识别

// 主题转换标记词（中英文）
string[] topicmarkers = {
    // 标点符号
    "\n\n", "\n", ". ", "? ", "! ", "。", "？", "！",
    // 中文序列词
    "首先", "其次", "然后", "接下来", "最后",
    "第一", "第二", "第三", "另一方面", "此外", "总之"
};

4.1.4 使用示例

# 上传长文档，使用 agentic 切割
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=agentic&chunksize=1500" \
  -f "file=@long-article.txt"

返回的元数据：

{
  "split_method": "agentic",
  "original_separator": "\n\n",
  "optimization_applied": true,
  "chunk_group": 0
}

4.1.5 适用场景

长篇文章：确保每个块围绕一个主题
技术文档：保持概念完整性
论文/报告：避免在论证中间切割

4.2 代码感知切割 - 程序员的福音

4.2.1 核心思想

代码感知切割专门针对源代码文件，能够识别：

函数/方法边界
类/接口/枚举定义
import/依赖语句
代码注释

4.2.2 支持的语言

语言	扩展名	识别能力
java	`.java`	类、接口、枚举、方法、注解
python	`.py`	类、函数、装饰器
javascript/typescript	`.js`, `.ts`	函数、类、箭头函数
go	`.go`	函数、结构体、接口
rust	`.rs`	函数、结构体、枚举、trait
c/c++	`.c`, `.cpp`, `.cc`	函数、类、结构体
c#	`.cs`	类、接口、方法、属性
php	`.php`	类、函数、命名空间
ruby	`.rb`	类、方法、模块
swift	`.swift`	类、函数、结构体
kotlin	`.kt`	类、函数、接口
scala	`.scala`	类、特质、函数

4.2.3 实现原理

/**
 * 代码感知切割
 * 
 * 1. 检测编程语言
 * 2. 解析代码块（类、方法等）
 * 3. 提取 import 语句
 * 4. 按代码结构切割
 */
public list<document> splitcodeaware(string text, string filename, splitoptions options) {
    // 检测语言
    string language = detectlanguage(filename);
    
    // 解析代码块
    list<codeblock> codeblocks = parsecodeblocks(text, language);
    
    // 提取 imports
    list<string> imports = extractimports(text, language);
    
    // 构建文档
    for (codeblock block : codeblocks) {
        map<string, object> metadata = buildcodemetadata(
            filename, chunkindex, block, imports, language
        );
        // ...
    }
}

4.2.4 代码块解析示例

// java 代码块解析正则
pattern pattern = pattern.compile(
    "(?:(public|private|protected|static|final|abstract|class|interface|enum|record|void|@\\w+)\\s+)?" +
    "(?:<[^>]+>\\s*)?" +
    "(\\w+)\\s*\\([^)]*\\)\\s*(?:throws\\s+\\w+(?:\\s*,\\s*\\w+)*)?\\s*\\{[^}]*\\}" +
    "|(?:public|private|protected)?\\s*(?:static\\s+)?(?:final\\s+)?(?:abstract\\s+)?" +
    "(?:class|interface|enum|record)\\s+(\\w+)",
    pattern.dotall
);

4.2.5 使用示例

# 上传 java 文件
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=code_aware" \
  -f "file=@userservice.java"
# 上传 python 文件
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=code_aware" \
  -f "file=@main.py"

返回的元数据：

{
  "split_method": "code_aware",
  "language": "java",
  "block_type": "method",
  "block_name": "getuserbyid",
  "start_line": 45,
  "imports_count": 12,
  "has_imports": true
}

4.2.6 适用场景

代码库 rag：构建代码知识库
代码审查：按方法检索相关代码
技术文档：配合代码示例的文档

4.3 层次结构切割 - 文档树构建

4.3.1 核心思想

层次结构切割将文档构建成树状结构：

文档 (root)
├── 章节 1 (chapter)
│ ├── 小节 1.1 (section)
│ │ ├── 段落 1 (paragraph)
│ │ └── 段落 2 (paragraph)
│ └── 小节 1.2 (section)
│ └── 段落 3 (paragraph)
└── 章节 2 (chapter)
└── 小节 2.1 (section)
└── 段落 4 (paragraph)

4.3.2 实现原理

/**
 * 层次结构切割
 * 
 * 构建文档树：文档 → 章节 → 小节 → 段落
 */
public list<document> splithierarchical(string text, string filename, splitoptions options) {
    // 构建层次树
    documentnode root = buildhierarchicaltree(text, filename);
    
    // 展平树为文档列表
    list<document> documents = new arraylist<>();
    flattentree(root, documents, filename, 0, maxchunksize);
    
    return documents;
}

/**
 * 构建层次树
 */
private documentnode buildhierarchicaltree(string text, string filename) {
    documentnode root = new documentnode("root", "document", text, 0, 0);
    
    // 第一层：按章节分割
    string[] chapters = text.split(
        "(?=(?:^|\\n)\\s*(?:第[一二三四五六七八九十\\d]+[章节篇]|chapter\\s+\\d+|\\d+\\.\\s+[^\\n]+))"
    );
    
    // 第二层：按小节分割
    // 第三层：按段落分割
    // ...
}

4.3.3 章节识别模式

模式	示例
中文章节	第一章, 第二节, 第三篇
英文章节	chapter 1, chapter 2
数字章节	1. 引言, 2. 背景

4.3.4 使用示例

# 上传结构化长文档
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=hierarchical&chunksize=2000" \
  -f "file=@book.txt"

返回的元数据：

{
  "split_method": "hierarchical",
  "node_id": "section_0_1",
  "node_type": "section",
  "hierarchy_level": 2,
  "node_index": 1,
  "has_children": true,
  "children_count": 3
}

4.3.5 适用场景

书籍/教材：保持章节结构
技术手册：按章节检索
法律文档：保持条款层次
论文：保持章节逻辑

4.4 滑动窗口切割 - rag 神器

4.4.1 核心思想

滑动窗口切割使用高重叠率（默认 50%）的窗口滑动切割文本，确保：

无信息遗漏：每个位置都被多个窗口覆盖
上下文连续性：相邻块有大量重叠
检索召回率提升：适合 rag 场景

4.4.2 实现原理

/**
 * 滑动窗口切割
 * 
 * 特点：
 * - 高重叠率（默认 50%）
 * - 确保没有信息遗漏
 * - 适合检索增强生成 (rag)
 */
public list<document> splitslidingwindow(string text, string filename, splitoptions options) {
    int windowsize = options.getchunksize() != null ? options.getchunksize() : default_chunk_size;
    int overlapratio = options.getchunkoverlap() != null ? options.getchunkoverlap() : 50;
    
    // 计算步长
    int stride = math.max(1, windowsize * (100 - overlapratio) / 100);
    
    while (startindex < normalizedtext.length()) {
        int endindex = math.min(startindex + windowsize, normalizedtext.length());
        
        // 尝试在句子边界结束
        if (endindex < normalizedtext.length()) {
            endindex = findbestsplitpoint(normalizedtext, endindex);
        }
        
        // 滑动窗口
        startindex += stride;
    }
}

4.4.3 重叠率计算

窗口大小	重叠率	步长	效果
1000	50%	500	中等重叠
1000	70%	300	高重叠，高召回
1000	30%	700	低重叠，高效率

4.4.4 使用示例

# 高重叠率（70%）- 适合高精度检索
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=sliding_window&chunksize=1000&chunkoverlap=70" \
  -f "file=@document.txt"
# 中等重叠率（50%）- 平衡方案
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=sliding_window&chunksize=1000&chunkoverlap=50" \
  -f "file=@document.txt"

返回的元数据：

{
  "split_method": "sliding_window",
  "window_start": 0,
  "window_end": 1000,
  "window_size": 1000,
  "stride": 500,
  "overlap_ratio": 50,
  "overlap_with_previous": 0
}

4.4.5 适用场景

rag 系统：提升检索召回率
问答系统：确保答案不被切割
长文档检索：避免边界信息丢失
关键信息提取：多重覆盖确保完整

4.5 结构化数据切割 - 数据文件专用

4.5.1 核心思想

结构化数据切割专门针对 csv、json、xml、yaml 等结构化格式，保持记录完整性。

4.5.2 自动格式检测

/**
 * 检测结构化数据格式
 */
private string detectstructuredformat(string filename, string content) {
    string lower = filename.tolowercase();
    if (lower.endswith(".csv")) return "csv";
    if (lower.endswith(".json")) return "json";
    if (lower.endswith(".xml")) return "xml";
    if (lower.endswith(".yaml") || lower.endswith(".yml")) return "yaml";
    
    // 根据内容检测
    string trimmed = content.trim();
    if (trimmed.startswith("[") || trimmed.startswith("{")) return "json";
    if (trimmed.startswith("<?xml") || trimmed.startswith("<")) return "xml";
    // ...
}

4.5.3 csv 切割

/**
 * csv 切割 - 保留表头
 */
private list<document> splitcsv(string text, string filename, splitoptions options) {
    int maxrows = options.getchunksize() != null ? options.getchunksize() / 50 : 100;
    
    // 提取表头
    string header = lines[0];
    list<string> headers = parsecsvline(header);
    
    // 每个块都包含表头
    for (string line : datalines) {
        if (rowcount >= maxrows) {
            // 保存当前块
            // 新块以表头开始
            currentchunk.append(header).append("\n");
        }
        currentchunk.append(line).append("\n");
    }
}

4.5.4 json 切割

/**
 * json 切割 - 按对象/条目分割
 */
private list<document> splitjson(string text, string filename, splitoptions options) {
    // json 数组：提取每个对象
    if (trimmed.startswith("[")) {
        list<string> objects = extractjsonobjects(trimmed);
        processjsonobjects(objects, filename, documents, maxsize);
    }
    // json 对象：按顶层键分割
    else if (trimmed.startswith("{")) {
        list<string> entries = extractjsonentries(trimmed);
        processjsonobjects(entries, filename, documents, maxsize);
    }
}

4.5.5 xml 切割

/**
 * xml 切割 - 按元素分割
 */
private list<document> splitxml(string text, string filename, splitoptions options) {
    // 提取 xml 声明
    string declaration = "";
    int declend = text.indexof("?>");
    if (declend > 0) {
        declaration = text.substring(0, declend + 2) + "\n";
    }
    
    // 提取主要元素
    list<string> elements = extractxmlelements(text);
    
    // 每个块包含声明
    for (string elem : elements) {
        currentchunk.append(declaration);
        currentchunk.append(elem);
    }
}

4.5.6 使用示例

# csv 文件
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=structured_data" \
  -f "file=@users.csv"
# json 文件
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=structured_data" \
  -f "file=@data.json"
# xml 文件
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=structured_data" \
  -f "file=@config.xml"
# yaml 文件
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=structured_data" \
  -f "file=@config.yaml"

返回的元数据（csv）：

{
  "split_method": "structured_data",
  "format": "csv",
  "start_index": 1,
  "end_index": 100,
  "record_count": 100,
  "headers": ["id", "name", "email"],
  "header_count": 3
}

返回的元数据（json）：

{
  "split_method": "structured_data",
  "format": "json",
  "start_index": 0,
  "end_index": 49,
  "record_count": 50
}

4.5.7 适用场景

数据文件处理：csv、json、xml、yaml
配置文件检索：按配置项检索
日志分析：结构化日志处理
api 文档：openapi/swagger 规范

五、策略选择指南

5.1 按文档类型选择

文档类型
├── 普通文本
│ ├── 短文本 → character
│ ├── 长文章 → recursive / agentic
│ └── 需要高召回 → sliding_window
├── markdown
│ ├── 技术文档 → markdown
│ └── 博客文章 → markdown / agentic
├── 代码文件
│ ├── java/python/js → code_aware
│ └── 通用代码 → recursive
├── 结构化数据
│ ├── csv → structured_data
│ ├── json → structured_data
│ ├── xml → structured_data
│ └── yaml → structured_data
├── 书籍/论文
│ ├── 教材 → hierarchical
│ ├── 论文 → hierarchical
│ └── 手册 → hierarchical
└── 特殊需求
├── 精确 token 控制 → token
├── 语义连贯性 → semantic
└── 段落完整性 → smart_paragraph

5.2 按使用场景选择

场景	推荐策略	原因
快速原型	character	简单快速
生产环境	recursive	平衡效果与性能
高质量 rag	sliding_window	高召回率
代码知识库	code_aware	保持代码结构
企业文档	hierarchical	保持文档层次
数据检索	structured_data	记录完整性
语义搜索	semantic	基于语义切割

5.3 参数调优建议

策略	chunksize	chunkoverlap	说明
recursive	1000-1500	200	标准配置
agentic	1500-2000	200	更大的块保持主题
code_aware	2000-3000	0	按代码块自然分割
hierarchical	2000-5000	0	按层次自然分割
sliding_window	1000	50-70	高重叠率
structured_data	-	-	按记录数控制

六、api 使用示例

6.1 完整 api 列表

# 基础策略
post /api/vector-store/upload?splitstrategy=recursive
post /api/vector-store/upload?splitstrategy=markdown
post /api/vector-store/upload?splitstrategy=token
post /api/vector-store/upload?splitstrategy=semantic
post /api/vector-store/upload?splitstrategy=smart_paragraph
post /api/vector-store/upload?splitstrategy=character
# 新策略
post /api/vector-store/upload?splitstrategy=agentic
post /api/vector-store/upload?splitstrategy=code_aware
post /api/vector-store/upload?splitstrategy=hierarchical
post /api/vector-store/upload?splitstrategy=sliding_window
post /api/vector-store/upload?splitstrategy=structured_data

6.2 完整请求示例

# 1. agentic 切割 - 长文章
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=agentic&chunksize=1500&chunkoverlap=200" \
  -h "content-type: multipart/form-data" \
  -f "file=@article.txt"
# 2. 代码感知切割 - java 文件
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=code_aware" \
  -f "file=@userservice.java"
# 3. 层次结构切割 - 书籍
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=hierarchical&chunksize=3000" \
  -f "file=@book.txt"
# 4. 滑动窗口切割 - rag 场景
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=sliding_window&chunksize=1000&chunkoverlap=70" \
  -f "file=@knowledge-base.txt"
# 5. 结构化数据切割 - csv
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=structured_data" \
  -f "file=@products.csv"
# 6. 结构化数据切割 - json
curl -x post "http://localhost:8080/api/vector-store/upload?splitstrategy=structured_data" \
  -f "file=@api-docs.json"

6.3 响应示例

{
  "success": true,
  "message": "file uploaded and processed successfully",
  "data": {
    "filename": "article.txt",
    "chunkcount": 12,
    "totalcharacters": 15420,
    "splitmethod": "agentic",
    "message": "processed with agentic strategy"
  }
}

七、性能优化建议

7.1 大文件处理

// 对于大文件，建议：
// 1. 使用流式处理
// 2. 增加 chunksize 减少块数
// 3. 使用异步处理

// 推荐配置
splitoptions options = new splitoptions()
    .withchunksize(2000)      // 更大的块
    .withchunkoverlap(100);   // 较小的重叠

7.2 批量处理

// 批量上传多个文件
for (file file : files) {
    // 根据文件类型选择策略
    splitstrategy strategy = selectstrategy(file);
    
    documentsplitterservice.split(
        readfile(file),
        file.getname(),
        strategy,
        options
    );
}

7.3 缓存优化

// 对于重复处理的文档，可以缓存切割结果
@cacheable(value = "documentchunks", key = "#filename + '_' + #strategy")
public list<document> split(string text, string filename, splitstrategy strategy) {
    // ...
}

八、总结

8.1 11 种策略一览

策略	类型	核心优势
recursive	基础	语义完整性
markdown	基础	结构保留
token	基础	token 精确控制
semantic	基础	语义相似度
smart_paragraph	基础	段落完整性
character	基础	简单快速
agentic	新增	智能主题边界
code_aware	新增	代码结构识别
hierarchical	新增	文档树构建
sliding_window	新增	高召回率
structured_data	新增	数据完整性

8.2 核心要点

没有最好的策略，只有最适合的策略
根据文档类型选择：代码用 code_aware，数据用 structured_data
根据场景选择：rag 用 sliding_window，书籍用 hierarchical
参数调优很重要：chunksize 和 chunkoverlap 需要根据实际需求调整

8.3 后续学习

深入理解 embedding 原理
学习向量数据库优化
探索多模态文档处理
研究 agentic rag 架构

以上就是11种spring ai文档切割策略全解析的详细内容，更多关于spring ai文档切割的资料请关注代码网其它相关文章！


验证码：

11种Spring AI文档切割策略全解析

2026年03月13日 • Java •我要评论

一、前言

1.1 为什么文档切割如此重要

1.2 本文将收获什么

二、环境准备

2.1 项目依赖

2.2 配置文件

三、基础策略回顾

四、新策略详解

4.1 agentic 切割 - 智能主题边界

4.1.1 核心思想

4.1.2 实现原理

4.1.3 主题标记词识别

4.1.4 使用示例

4.1.5 适用场景

4.2 代码感知切割 - 程序员的福音

4.2.1 核心思想

4.2.2 支持的语言

4.2.3 实现原理

4.2.4 代码块解析示例

4.2.5 使用示例

4.2.6 适用场景

4.3 层次结构切割 - 文档树构建

4.3.1 核心思想

4.3.2 实现原理

4.3.3 章节识别模式

4.3.4 使用示例

4.3.5 适用场景

4.4 滑动窗口切割 - rag 神器

4.4.1 核心思想

4.4.2 实现原理

4.4.3 重叠率计算

4.4.4 使用示例

4.4.5 适用场景

4.5 结构化数据切割 - 数据文件专用

4.5.1 核心思想

4.5.2 自动格式检测

4.5.3 csv 切割

4.5.4 json 切割

4.5.5 xml 切割

4.5.6 使用示例

4.5.7 适用场景

五、策略选择指南

5.1 按文档类型选择

5.2 按使用场景选择

5.3 参数调优建议

六、api 使用示例

6.1 完整 api 列表

6.2 完整请求示例

6.3 响应示例

七、性能优化建议

7.1 大文件处理

7.2 批量处理

7.3 缓存优化

八、总结

8.1 11 种策略一览

8.2 核心要点

8.3 后续学习

相关文章:

springboot启动读取数据字典缓存到redis实现方式

SpringBoot中SpringSecurity安全框架的基本配置与使用方式

发表评论