PDF文件内容精细切割(pdf文件怎么切割成单页啊)

PDF文件内容精细切割(pdf文件怎么切割成单页啊)

编程文章jaq1232025-07-08 0:42:132A+A-

一、背景

做RAG项目,首先需要将语料(pdf、word、excel、html、markdown)切割,然后通过embedding编码为向量,存入向量库(milvus);

本文介绍如何将pdf进行精细切割(例如,分割成不同的段落、标题、表格或其他结构 或需要从图像中提取文本),主要使用Unstructured工具

二、搭建Unstructured环境

1)安装相关软件及软件包

Poppler (PDF 分析)

  • Linux的:apt-get install poppler-utils
  • 苹果电脑:brew install poppler
  • Windows:https://github.com/oschwartz10612/poppler-windows

注意:下载完,解压,配置环境变量(D:\installedsoft\poppler-windows\poppler-24.08.0\Library\bin)

Tesseract (OCR)

  • Linux的:apt-get install tesseract-ocr
  • 苹果电脑:brew install tesseract
  • Windows:https://github.com/UB-Mannheim/tesseract/wiki#tesseract-installer-for-windows (直接安装exe文件即可)

另外还需要安装软件包

langchain-unstructured
langchain-milvus
langchain_community
ipython
python-dotenv
sentence-transformers

安装软件包若遇到问题,请自行deepseek寻求解决方案!

三、API介绍及案例

1)API介绍

UnstructuredLoader 是 LangChain 中用于加载非结构化文档(如 PDF、Word、HTML 等)的工具。以下是代码中各个参数的解释:

  1. file_path=pdf_file指定要加载的 PDF 文件路径。可以是本地文件路径或 URL。
  2. strategy="hi_res"解析策略,决定如何处理文档内容。可选值包括:"fast":快速解析,适合简单文档,但可能忽略复杂布局。"hi_res":高精度解析,适合复杂布局(如多栏、表格、图片),但速度较慢"auto":自动选择策略(默认)。
  3. partition_via_api=True是否通过 Unstructured API 进行文档分区(即拆分文档为结构化块)。若为 True,需提供 api_key 并依赖网络请求;若为 False,则使用本地解析逻辑(需安装额外依赖)。
  4. coordinates=True是否保留文本在原始文档中的坐标信息(如位置、边界框)。这对需要精确定位文本的应用(如表格提取)很有用。
  5. api_key='IhWKAZRBmZ14c8tmCsOLabqwIKLJ2e'Unstructured API 的访问密钥,用于通过云端服务处理文档。若无此密钥,需本地运行分区逻辑

2)案例

2.1 解析pdf文件为List[Document],并保存到json文件

from IPython.core.display import HTML
from IPython.core.display_functions import display
from langchain_unstructured import UnstructuredLoader
import json
#1.pdf加载路径
pdf_file=r'D:\aiproject\mymilvus\datas\layout-parser-paper.pdf'
#2.初始化UnstructuredLoader实例
loader = UnstructuredLoader(
    file_path=pdf_file,
    strategy="hi_res",
)
def write_json(data,file_name):
    with open(r'D:\aiproject\mymilvus\datas\output\\'+file_name,'w',encoding='utf-8') as f:
        json.dump(data,f,ensure_ascii=False,indent=4)
docs=[]
counter=0
#3.暂时将解析出的document放到json->后续放到milvus
for doc in loader.lazy_load():
    docs.append(doc)
    json_file_name=str(doc.metadata.get('page_number'))+"_"+str(counter)+".json"
    counter+=1
    #doc转字典
    write_json(doc.model_dump(),json_file_name)
print(f'document数量:{len(docs)}')
print(f'第1个document元数据:{docs[0].metadata}')
print(f'第1个document文本内容:{docs[0].page_content}')

运行结果:

原语料第五页是含表格:

解析后的表格Document如下:

{"id": null,
"metadata": {
"source": "D:\\aiproject\\mymilvus\\datas\\layout-parser-paper.pdf",
"detection_class_prob": 0.9028143882751465,
"coordinates": {
"points": [[379.4666442871094,383.0013427734375],
[379.4666442871094,570.6318969726562],
[1321.4498291015625,570.6318969726562],
[1321.4498291015625,383.0013427734375]],
"system": "PixelSpace",
"layout_width": 1700,
"layout_height": 2200},
"links": [{"text": "[ 38 ]","url": "cite.zhong2019publaynet","start_index": 10},
{"text": "[ 3 ]","url": "cite.antonacopoulos2009realistic","start_index": 21},
{"text": "[ 17 ]","url": "cite.newspaper_navigator_dataset","start_index": 35},
{"text": "[ 18 ]","url": "cite.li2019tablebank","start_index": 50},
{"text": "[ 31 ]","url": "cite.shen2020large","start_index": 65}],
"last_modified": "2025-04-07T09:54:09",
"filetype": "application/pdf",
"languages": ["eng"],
"page_number": 5,
"parent_id": "5a1838a8f40b4523094652cf14ab974c",
"file_directory": "D:\\aiproject\\mymilvus\\datas",
"filename": "layout-parser-paper.pdf",
"category": "Table",
"element_id": "cb534ba64da736dc53d60b660f5e1153"
},
"page_content": "Dataset Base Model1 Large Model Notes PubLayNet [38] F / M M Layouts of modern scientific documents PRImA [3] M - Layouts of scanned modern magazines and scientific reports Newspaper [17] F - Layouts of scanned US newspapers from the 20th century TableBank [18] F F Table region on modern scientific and business document HJDataset [31] F / M - Layouts of history Japanese documents",
"type": "Document"}


2.2Document格式

  • metadata
  • page_number
  • category
  • element_id
  • parent_id
  • page_content
点击这里复制本文地址 以上内容由jaq123整理呈现,请务必在转载分享时注明本文地址!如对内容有疑问,请联系我们,谢谢!

苍茫编程网 © All Rights Reserved.  蜀ICP备2024111239号-21