一、背景

做RAG项目，首先需要将语料(pdf、word、excel、html、markdown)切割，然后通过embedding编码为向量，存入向量库（milvus）；

本文介绍如何将pdf进行精细切割（例如，分割成不同的段落、标题、表格或其他结构或需要从图像中提取文本），主要使用Unstructured工具

二、搭建Unstructured环境

1）安装相关软件及软件包

Poppler （PDF 分析）

Linux的：apt-get install poppler-utils
苹果电脑：brew install poppler
Windows：https://github.com/oschwartz10612/poppler-windows

注意：下载完，解压，配置环境变量（D:\installedsoft\poppler-windows\poppler-24.08.0\Library\bin）

Tesseract （OCR）

Linux的：apt-get install tesseract-ocr
苹果电脑：brew install tesseract
Windows：https://github.com/UB-Mannheim/tesseract/wiki#tesseract-installer-for-windows （直接安装exe文件即可）

另外还需要安装软件包：

langchain-unstructured
langchain-milvus
langchain_community
ipython
python-dotenv
sentence-transformers

安装软件包若遇到问题，请自行deepseek寻求解决方案！

三、API介绍及案例

1）API介绍

UnstructuredLoader 是 LangChain 中用于加载非结构化文档（如 PDF、Word、HTML 等）的工具。以下是代码中各个参数的解释：

file_path=pdf_file指定要加载的 PDF 文件路径。可以是本地文件路径或 URL。
strategy="hi_res"解析策略，决定如何处理文档内容。可选值包括："fast"：快速解析，适合简单文档，但可能忽略复杂布局。"hi_res"：高精度解析，适合复杂布局（如多栏、表格、图片），但速度较慢"auto"：自动选择策略（默认）。
partition_via_api=True是否通过 Unstructured API 进行文档分区（即拆分文档为结构化块）。若为 True，需提供 api_key 并依赖网络请求；若为 False，则使用本地解析逻辑（需安装额外依赖）。
coordinates=True是否保留文本在原始文档中的坐标信息（如位置、边界框）。这对需要精确定位文本的应用（如表格提取）很有用。
api_key='IhWKAZRBmZ14c8tmCsOLabqwIKLJ2e'Unstructured API 的访问密钥，用于通过云端服务处理文档。若无此密钥，需本地运行分区逻辑

2）案例

2.1 解析pdf文件为List[Document],并保存到json文件

from IPython.core.display import HTML
from IPython.core.display_functions import display
from langchain_unstructured import UnstructuredLoader
import json
#1.pdf加载路径
pdf_file=r'D:\aiproject\mymilvus\datas\layout-parser-paper.pdf'
#2.初始化UnstructuredLoader实例
loader = UnstructuredLoader(
    file_path=pdf_file,
    strategy="hi_res",
)
def write_json(data,file_name):
    with open(r'D:\aiproject\mymilvus\datas\output\\'+file_name,'w',encoding='utf-8') as f:
        json.dump(data,f,ensure_ascii=False,indent=4)
docs=[]
counter=0
#3.暂时将解析出的document放到json->后续放到milvus
for doc in loader.lazy_load():
    docs.append(doc)
    json_file_name=str(doc.metadata.get('page_number'))+"_"+str(counter)+".json"
    counter+=1
    #doc转字典
    write_json(doc.model_dump(),json_file_name)
print(f'document数量:{len(docs)}')
print(f'第1个document元数据:{docs[0].metadata}')
print(f'第1个document文本内容:{docs[0].page_content}')

运行结果：

原语料第五页是含表格：

解析后的表格Document如下：

{"id": null,
"metadata": {
"source": "D:\\aiproject\\mymilvus\\datas\\layout-parser-paper.pdf",
"detection_class_prob": 0.9028143882751465,
"coordinates": {
"points": [[379.4666442871094,383.0013427734375],
[379.4666442871094,570.6318969726562],
[1321.4498291015625,570.6318969726562],
[1321.4498291015625,383.0013427734375]],
"system": "PixelSpace",
"layout_width": 1700,
"layout_height": 2200},
"links": [{"text": "[ 38 ]","url": "cite.zhong2019publaynet","start_index": 10},
{"text": "[ 3 ]","url": "cite.antonacopoulos2009realistic","start_index": 21},
{"text": "[ 17 ]","url": "cite.newspaper_navigator_dataset","start_index": 35},
{"text": "[ 18 ]","url": "cite.li2019tablebank","start_index": 50},
{"text": "[ 31 ]","url": "cite.shen2020large","start_index": 65}],
"last_modified": "2025-04-07T09:54:09",
"filetype": "application/pdf",
"languages": ["eng"],
"page_number": 5,
"parent_id": "5a1838a8f40b4523094652cf14ab974c",
"file_directory": "D:\\aiproject\\mymilvus\\datas",
"filename": "layout-parser-paper.pdf",
"category": "Table",
"element_id": "cb534ba64da736dc53d60b660f5e1153"
},
"page_content": "Dataset Base Model1 Large Model Notes PubLayNet [38] F / M M Layouts of modern scientific documents PRImA [3] M - Layouts of scanned modern magazines and scientific reports Newspaper [17] F - Layouts of scanned US newspapers from the 20th century TableBank [18] F F Table region on modern scientific and business document HJDataset [31] F / M - Layouts of history Japanese documents",
"type": "Document"}

2.2Document格式

metadata
page_number
category
element_id
parent_id
page_content

点击这里复制本文地址以上内容由jaq123整理呈现，请务必在转载分享时注明本文地址！如对内容有疑问，请联系我们，谢谢！

base64转pdf

上一篇：「干货」Vue+Element前端导入导出Excel

下一篇：《告别白嫖 vite 默认配置!这才是生产级打包该有的样子》

PDF文件内容精细切割（pdf文件怎么切割成单页啊）