PDF文件内容精细切割(pdf文件怎么切割成单页啊)
一、背景
做RAG项目,首先需要将语料(pdf、word、excel、html、markdown)切割,然后通过embedding编码为向量,存入向量库(milvus);
本文介绍如何将pdf进行精细切割(例如,分割成不同的段落、标题、表格或其他结构 或需要从图像中提取文本),主要使用Unstructured工具
二、搭建Unstructured环境
1)安装相关软件及软件包
Poppler (PDF 分析)
- Linux的:apt-get install poppler-utils
- 苹果电脑:brew install poppler
- Windows:https://github.com/oschwartz10612/poppler-windows
注意:下载完,解压,配置环境变量(D:\installedsoft\poppler-windows\poppler-24.08.0\Library\bin)
Tesseract (OCR)
- Linux的:apt-get install tesseract-ocr
- 苹果电脑:brew install tesseract
- Windows:https://github.com/UB-Mannheim/tesseract/wiki#tesseract-installer-for-windows (直接安装exe文件即可)
另外还需要安装软件包:
langchain-unstructured
langchain-milvus
langchain_community
ipython
python-dotenv
sentence-transformers
安装软件包若遇到问题,请自行deepseek寻求解决方案!
三、API介绍及案例
1)API介绍
UnstructuredLoader 是 LangChain 中用于加载非结构化文档(如 PDF、Word、HTML 等)的工具。以下是代码中各个参数的解释:
- file_path=pdf_file指定要加载的 PDF 文件路径。可以是本地文件路径或 URL。
- strategy="hi_res"解析策略,决定如何处理文档内容。可选值包括:"fast":快速解析,适合简单文档,但可能忽略复杂布局。"hi_res":高精度解析,适合复杂布局(如多栏、表格、图片),但速度较慢"auto":自动选择策略(默认)。
- partition_via_api=True是否通过 Unstructured API 进行文档分区(即拆分文档为结构化块)。若为 True,需提供 api_key 并依赖网络请求;若为 False,则使用本地解析逻辑(需安装额外依赖)。
- coordinates=True是否保留文本在原始文档中的坐标信息(如位置、边界框)。这对需要精确定位文本的应用(如表格提取)很有用。
- api_key='IhWKAZRBmZ14c8tmCsOLabqwIKLJ2e'Unstructured API 的访问密钥,用于通过云端服务处理文档。若无此密钥,需本地运行分区逻辑
2)案例
2.1 解析pdf文件为List[Document],并保存到json文件
from IPython.core.display import HTML
from IPython.core.display_functions import display
from langchain_unstructured import UnstructuredLoader
import json
#1.pdf加载路径
pdf_file=r'D:\aiproject\mymilvus\datas\layout-parser-paper.pdf'
#2.初始化UnstructuredLoader实例
loader = UnstructuredLoader(
file_path=pdf_file,
strategy="hi_res",
)
def write_json(data,file_name):
with open(r'D:\aiproject\mymilvus\datas\output\\'+file_name,'w',encoding='utf-8') as f:
json.dump(data,f,ensure_ascii=False,indent=4)
docs=[]
counter=0
#3.暂时将解析出的document放到json->后续放到milvus
for doc in loader.lazy_load():
docs.append(doc)
json_file_name=str(doc.metadata.get('page_number'))+"_"+str(counter)+".json"
counter+=1
#doc转字典
write_json(doc.model_dump(),json_file_name)
print(f'document数量:{len(docs)}')
print(f'第1个document元数据:{docs[0].metadata}')
print(f'第1个document文本内容:{docs[0].page_content}')
运行结果:
原语料第五页是含表格:
解析后的表格Document如下:
{"id": null,
"metadata": {
"source": "D:\\aiproject\\mymilvus\\datas\\layout-parser-paper.pdf",
"detection_class_prob": 0.9028143882751465,
"coordinates": {
"points": [[379.4666442871094,383.0013427734375],
[379.4666442871094,570.6318969726562],
[1321.4498291015625,570.6318969726562],
[1321.4498291015625,383.0013427734375]],
"system": "PixelSpace",
"layout_width": 1700,
"layout_height": 2200},
"links": [{"text": "[ 38 ]","url": "cite.zhong2019publaynet","start_index": 10},
{"text": "[ 3 ]","url": "cite.antonacopoulos2009realistic","start_index": 21},
{"text": "[ 17 ]","url": "cite.newspaper_navigator_dataset","start_index": 35},
{"text": "[ 18 ]","url": "cite.li2019tablebank","start_index": 50},
{"text": "[ 31 ]","url": "cite.shen2020large","start_index": 65}],
"last_modified": "2025-04-07T09:54:09",
"filetype": "application/pdf",
"languages": ["eng"],
"page_number": 5,
"parent_id": "5a1838a8f40b4523094652cf14ab974c",
"file_directory": "D:\\aiproject\\mymilvus\\datas",
"filename": "layout-parser-paper.pdf",
"category": "Table",
"element_id": "cb534ba64da736dc53d60b660f5e1153"
},
"page_content": "Dataset Base Model1 Large Model Notes PubLayNet [38] F / M M Layouts of modern scientific documents PRImA [3] M - Layouts of scanned modern magazines and scientific reports Newspaper [17] F - Layouts of scanned US newspapers from the 20th century TableBank [18] F F Table region on modern scientific and business document HJDataset [31] F / M - Layouts of history Japanese documents",
"type": "Document"}
2.2Document格式
- metadata
- page_number
- category
- element_id
- parent_id
- page_content