【深度实战】用DeepSeek API打造PDF智能问答系统！附完整项目代码，比LangChain更轻量！

Edward.W

1058人浏览 · 2025-04-16 22:00:25

Edward.W · 2025-04-16 22:00:25 发布

大家好，我是持续分享AI落地方案的Edward.W。上期教大家接入DeepSeek基础API后，很多小伙伴私信问如何实现文档问答。今天带来一篇王炸级教程——基于DeepSeek API的PDF问答系统，相比LangChain方案更轻量高效！

🔥 本文价值点：

纯Python实现，无需复杂框架

支持中英文PDF混合解析

包含文本分块优化策略（解决大模型上下文限制）

完整项目代码已上传GitHub（文末获取）

👉 收藏数是点赞的5倍！先Mark再阅读！

一、系统架构设计

graph TD
    A[PDF上传] --> B[文本提取]
    B --> C[智能分块]
    C --> D[向量存储]
    D --> E[问题检索]
    E --> F[DeepSeek生成回答]

二、核心代码实现

1. 环境准备

pip install PyPDF2 sentence-transformers numpy

2. PDF解析模块

from PyPDF2 import PdfReader
import re

def extract_text_from_pdf(pdf_path):
    """支持中英文PDF提取，保留段落结构"""
    text = ""
    with open(pdf_path, "rb") as f:
        reader = PdfReader(f)
        for page in reader.pages:
            # 优化换行处理
            page_text = page.extract_text()
            page_text = re.sub(r'(\w+)-\n(\w+)', r'\1\2', page_text)  # 处理英文换行符
            page_text = re.sub(r'(?<!\n)\n(?!\n)', ' ', page_text)  # 处理中文换行
            text += page_text + "\n\n"
    return text.strip()

3. 文本分块优化策略

from sentence_transformers import SentenceTransformer
import numpy as np

class TextChunker:
    def __init__(self):
        self.encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
    
    def semantic_chunk(self, text, max_length=1000):
        """基于语义相似度的智能分块"""
        sentences = [s.strip() for s in text.split('.') if s.strip()]
        if not sentences:
            return []
            
        # 计算句子嵌入
        embeddings = self.encoder.encode(sentences)
        
        # 动态分块算法
        chunks = []
        current_chunk = []
        current_len = 0
        
        for sent, emb in zip(sentences, embeddings):
            sent_len = len(sent)
            if current_len + sent_len > max_length and current_chunk:
                # 计算与当前块的相似度
                chunk_emb = np.mean([self.encoder.encode(s) for s in current_chunk], axis=0)
                sim = np.dot(emb, chunk_emb) / (np.linalg.norm(emb) * np.linalg.norm(chunk_emb))
                
                if sim < 0.7:  # 相似度阈值
                    chunks.append(". ".join(current_chunk) + ".")
                    current_chunk = []
                    current_len = 0
            
            current_chunk.append(sent)
            current_len += sent_len
        
        if current_chunk:
            chunks.append(". ".join(current_chunk) + ".")
            
        return chunks

4. 问答引擎实现

import requests
import numpy as np
from typing import List, Dict

class PDFQA:
    def __init__(self, api_key):
        self.api_key = api_key
        self.api_url = "https://api.deepseek.com/v1/chat/completions"
        self.encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
        self.chunks = []
        self.chunk_embeddings = []
    
    def index_document(self, pdf_path):
        """建立文档索引"""
        text = extract_text_from_pdf(pdf_path)
        chunker = TextChunker()
        self.chunks = chunker.semantic_chunk(text)
        self.chunk_embeddings = self.encoder.encode(self.chunks)
    
    def search_relevant_chunks(self, query: str, top_k: int = 3) -> List[str]:
        """语义搜索最相关文本块"""
        query_embedding = self.encoder.encode(query)
        scores = np.dot(self.chunk_embeddings, query_embedding)
        top_indices = np.argsort(scores)[-top_k:][::-1]
        return [self.chunks[i] for i in top_indices]
    
    def ask(self, question: str) -> str:
        """生成带引用的回答"""
        relevant_chunks = self.search_relevant_chunks(question)
        context = "\n\n".join([f"[参考{i+1}] {chunk}" for i, chunk in enumerate(relevant_chunks)])
        
        prompt = f"""基于以下文档内容回答问题：
        {context}
        
        问题：{question}
        要求：
        1. 回答需包含具体数据/细节
        2. 标注引用来源如[参考1]
        3. 不确定时回答"文档中未明确说明\"""" 
        
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }
        
        response = requests.post(
            self.api_url,
            headers=headers,
            json={
                "model": "deepseek-chat",
                "messages": [{"role": "user", "content": prompt}],
                "temperature": 0.3  # 降低随机性
            }
        )
        
        return response.json()["choices"][0]["message"]["content"]

三、实战演示

测试代码

qa = PDFQA("your_api_key")
qa.index_document("财政报告.pdf")

question = "今年研发投入增长率是多少？"
answer = qa.ask(question)
print(f"Q: {question}")
print(f"A: {answer}")

示例输出

Q: 今年研发投入增长率是多少？
A: 根据2023年度财政报告[参考1]：
公司研发投入增长率为15.7%（见第三章第二节），
主要投向AI和大模型领域[参考2]。

四、性能优化技巧

缓存机制：存储已处理文档的嵌入向量

import pickle

# 保存索引
with open('doc_index.pkl', 'wb') as f:
    pickle.dump({'chunks': self.chunks, 'embeddings': self.chunk_embeddings}, f)

# 加载索引
with open('doc_index.pkl', 'rb') as f:
    data = pickle.load(f)

2. 混合检索策略：

def hybrid_search(self, query, top_k=3):
    # 语义搜索
    semantic_results = self.search_relevant_chunks(query, top_k)
    
    # 关键词搜索（作为fallback）
    keyword_results = [
        chunk for chunk in self.chunks 
        if any(word.lower() in chunk.lower() 
              for word in query.split())
    ][:top_k]
    
    return list(dict.fromkeys(semantic_results + keyword_results))[:top_k]

3. 异步处理（使用aiohttp）：

import aiohttp

async def async_ask(self, question):
    async with aiohttp.ClientSession() as session:
        async with session.post(
            self.api_url,
            headers=headers,
            json=payload
        ) as response:
            return await response.json()

五、与LangChain方案对比

特性	本方案	LangChain方案
依赖复杂度	★☆☆☆☆ (轻量)	★★★★☆ (重量级)
中文支持	★★★★★ (优化处理)	★★★☆☆
启动速度	秒级	可能需要加载多个模型
定制灵活性	★★★★★	★★★☆☆
分布式支持	需自行实现	内置支持

六、常见问题排查

🛠 问题1：PDF提取文字乱码
✅ 解决方案：尝试改用pdfplumber库：

import pdfplumber
with pdfplumber.open(pdf_path) as pdf:
    text = "\n".join([page.extract_text() for page in pdf.pages])

🛠 问题2：API返回超时
✅ 解决方案：

检查网络连接
添加重试机制：

from tenacity import retry, stop_after_attempt

@retry(stop=stop_after_attempt(3))
def ask_with_retry(self, prompt):
    return self.ask(prompt)

七、项目扩展方向

多文档管理：建立文档库索引系统
表格解析：集成camelot处理PDF表格
可视化界面：用Gradio快速搭建Web界面
知识图谱：从文档中提取实体关系

如果觉得本文有帮助，请务必三连支持！下期将带来《用DeepSeek API实现自动化周报生成系统》，敬请期待！

💬 互动话题：你希望这个PDF问答系统增加什么功能？在评论区告诉我，可能会出现在下期更新中！

魔乐社区

魔乐社区（Modelers.cn) 是一个中立、公益的人工智能社区，提供人工智能工具、模型、数据的托管、展示与应用协同服务，为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作，由全产业链共同建设、共同运营、共同享有，推动国产AI生态繁荣发展。

更多推荐

小参数・大码力・易部署 | Qwen3.6-27B上线魔乐社区，基于昇腾的部署教程来了

继一周前模型开源发布后，千问再度开源Qwen3.6-27B —— 一个拥有270亿参数的稠密多模态模型，也是社区呼声最高的模型规格。Qwen3.6-27B 依然支持多模态思考与非思考模式，在智能体编程方面达到了旗舰级表现，全面超越前代开源旗舰 Qwen3.5-397B-A17B（总参数397B / 激活参数17B的MoE模型）。作为稠密架构，它无需MoE路由即可部署，是开发者在实用、可广泛部署规模