【大模型部署】基于知识图谱构建的检索增强生成-GraphRAG（尝鲜篇）

AI-智能

978人浏览 · 2025-01-22 14:56:06

AI-智能 · 2025-01-22 14:56:06 发布

之前专栏有介绍过LLM应用的利器RAG，通过它的实现原理，我们可以看出它有个很大的缺点就是在检索过程中只是对切片片段进行召回，所以也只能回答局部文档问题，无法回答知识库全局问题。

微软前段时间开源了GraphRAG（Graph-based Retrieval Augmented Generation）是在通用RAG基础上结合了知识图谱。GraphRAG的主要流程：利用LLM从知识库中提取实体以及实体关系；利用LLM对实体联系进行聚类，生成社区摘要；在回答用户问题时利用LLM结合社区摘进行回答。所以基于GraphRAG的问答是提供了更丰富的上下文和关系信息，提升生成RAG的检索生成效果，整个pipline如下图所示。

paper: https://arxiv.org/pdf/2404.16130

code: https://github.com/microsoft/graphrag

本文就手把手教你如何使用GraphRAG~

1. 搭建环境

GraphRAG的环境非常简单，安装graphrag包就行。

代码语言：bash

复制

conda create -n graphrag python=3.10
git clone https://github.com/microsoft/graphrag.git
pip install graphrag

2. 数据和配置

2.1 准备知识库

知识图谱我马上能想到的最好的例子就是红楼梦人物关系图了。所以可以让大模型生成一份红楼梦四大家族简介和关键人物介绍，然后保存到txt就可以。当然你也可以直接用demo数据或者业务数据（需要注意的是GraphRAG整个流程是非常消耗token的）。

我们创建目录htmtest/input，将txt文件放入目录

代码语言：bash

复制

mkdir -p ./hlmtest/input

2.2 配置环境变量

再GraphRAG的根目录执行如下命令

代码语言：bash

复制

python -m graphrag.index --init --root ./hlmtest

会得到默认的.env和setting.yaml文件，结构如下：

（1）配置API_KEY

.env 文件中是配置你的openai_key

代码语言：bash

复制

GRAPHRAG_API_KEY=your_openai_key

（2）配置项目中参数

settings.yaml 文件配置主要是配置模型，切片，prompt等项目参数设置。

比如LLM模块，我使用的azure账号，配置如下：

代码语言：bash

复制

llm:
  api_key: ${GRAPHRAG_API_KEY}
  type:  azure_openai_chat # or openai_chat
  model: model_name
  model_supports_json: ture # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  api_base: your_api_base
  api_version: your_api_version
  deployment_name:  your_deployment_name

比如文档格式和切片规则模块，这里我使用默认值。通过参数可以看出输入模块不管是文件支持还是切片都有很大优化空间。

代码语言：bash

复制

chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
    
input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

再看实体抽取等模块，我们可以看到自动生成了几个prompt。

代码语言：javascript

复制

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  # 注意这里可以根据你实际数据去定义自己的实体类型以及修改对应的prompt示例
  entity_types: [organization,person,geo,event]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

其中几个prompt大家可以重点看看，大致就能了解大模型生成的实体，关系和report的输出结构了。

我们看看实体抽取的prompt:

代码语言：python

代码运行次数：0

复制

Cloud Studio 代码运行


-Goal-
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
 Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>)

3. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.

4. When finished, output {completion_delimiter}

######################
-Examples-
######################
Example 1:

Entity_types: [person, technology, mission, organization, location]
Text:
while Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor's authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan's shared commitment to discovery was an unspoken rebellion against Cruz's narrowing vision of control and order.

Then Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. ��If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us.��

The underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor's, a wordless clash of wills softening into an uneasy truce.

It was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths
################
Output:
("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is a character who experiences frustration and is observant of the dynamics among other characters."){record_delimiter}
("entity"{tuple_delimiter}"Taylor"{tuple_delimiter}"person"{tuple_delimiter}"Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective."){record_delimiter}
("entity"{tuple_delimiter}"Jordan"{tuple_delimiter}"person"{tuple_delimiter}"Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device."){record_delimiter}
("entity"{tuple_delimiter}"Cruz"{tuple_delimiter}"person"{tuple_delimiter}"Cruz is associated with a vision of control and order, influencing the dynamics among other characters."){record_delimiter}
("entity"{tuple_delimiter}"The Device"{tuple_delimiter}"technology"{tuple_delimiter}"The Device is central to the story, with potential game-changing implications, and is revered by Taylor."){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Taylor"{tuple_delimiter}"Alex is affected by Taylor's authoritarian certainty and observes changes in Taylor's attitude towards the device."{tuple_delimiter}7){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Jordan"{tuple_delimiter}"Alex and Jordan share a commitment to discovery, which contrasts with Cruz's vision."{tuple_delimiter}6){record_delimiter}
("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"Jordan"{tuple_delimiter}"Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."{tuple_delimiter}8){record_delimiter}
("relationship"{tuple_delimiter}"Jordan"{tuple_delimiter}"Cruz"{tuple_delimiter}"Jordan's commitment to discovery is in rebellion against Cruz's vision of control and order."{tuple_delimiter}5){record_delimiter}
("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"The Device"{tuple_delimiter}"Taylor shows reverence towards the device, indicating its importance and potential impact."{tuple_delimiter}9){completion_delimiter}
#############################
Example 2:

Entity_types: [person, technology, mission, organization, location]
Text:
They were no longer mere operatives; they had become guardians of a threshold, keepers of a message from a realm beyond stars and stripes. This elevation in their mission could not be shackled by regulations and established protocols��it demanded a new perspective, a new resolve.

Tension threaded through the dialogue of beeps and static as communications with Washington buzzed in the background. The team stood, a portentous air enveloping them. It was clear that the decisions they made in the ensuing hours could redefine humanity's place in the cosmos or condemn them to ignorance and potential peril.

Their connection to the stars solidified, the group moved to address the crystallizing warning, shifting from passive recipients to active participants. Mercer's latter instincts gained precedence�� the team's mandate had evolved, no longer solely to observe and report but to interact and prepare. A metamorphosis had begun, and Operation: Dulce hummed with the newfound frequency of their daring, a tone set not by the earthly
#############
Output:
("entity"{tuple_delimiter}"Washington"{tuple_delimiter}"location"{tuple_delimiter}"Washington is a location where communications are being received, indicating its importance in the decision-making process."){record_delimiter}
("entity"{tuple_delimiter}"Operation: Dulce"{tuple_delimiter}"mission"{tuple_delimiter}"Operation: Dulce is described as a mission that has evolved to interact and prepare, indicating a significant shift in objectives and activities."){record_delimiter}
("entity"{tuple_delimiter}"The team"{tuple_delimiter}"organization"{tuple_delimiter}"The team is portrayed as a group of individuals who have transitioned from passive observers to active participants in a mission, showing a dynamic change in their role."){record_delimiter}
("relationship"{tuple_delimiter}"The team"{tuple_delimiter}"Washington"{tuple_delimiter}"The team receives communications from Washington, which influences their decision-making process."{tuple_delimiter}7){record_delimiter}
("relationship"{tuple_delimiter}"The team"{tuple_delimiter}"Operation: Dulce"{tuple_delimiter}"The team is directly involved in Operation: Dulce, executing its evolved objectives and activities."{tuple_delimiter}9){completion_delimiter}
#############################
Example 3:

Entity_types: [person, role, technology, organization, event, location, concept]
Text:
their voice slicing through the buzz of activity. "Control may be an illusion when facing an intelligence that literally writes its own rules," they stated stoically, casting a watchful eye over the flurry of data.

"It's like it's learning to communicate," offered Sam Rivera from a nearby interface, their youthful energy boding a mix of awe and anxiety. "This gives talking to strangers' a whole new meaning."

Alex surveyed his team��each face a study in concentration, determination, and not a small measure of trepidation. "This might well be our first contact," he acknowledged, "And we need to be ready for whatever answers back."

Together, they stood on the edge of the unknown, forging humanity's response to a message from the heavens. The ensuing silence was palpable��a collective introspection about their role in this grand cosmic play, one that could rewrite human history.

The encrypted dialogue continued to unfold, its intricate patterns showing an almost uncanny anticipation
#############
Output:
("entity"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"person"{tuple_delimiter}"Sam Rivera is a member of a team working on communicating with an unknown intelligence, showing a mix of awe and anxiety."){record_delimiter}
("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is the leader of a team attempting first contact with an unknown intelligence, acknowledging the significance of their task."){record_delimiter}
("entity"{tuple_delimiter}"Control"{tuple_delimiter}"concept"{tuple_delimiter}"Control refers to the ability to manage or govern, which is challenged by an intelligence that writes its own rules."){record_delimiter}
("entity"{tuple_delimiter}"Intelligence"{tuple_delimiter}"concept"{tuple_delimiter}"Intelligence here refers to an unknown entity capable of writing its own rules and learning to communicate."){record_delimiter}
("entity"{tuple_delimiter}"First Contact"{tuple_delimiter}"event"{tuple_delimiter}"First Contact is the potential initial communication between humanity and an unknown intelligence."){record_delimiter}
("entity"{tuple_delimiter}"Humanity's Response"{tuple_delimiter}"event"{tuple_delimiter}"Humanity's Response is the collective action taken by Alex's team in response to a message from an unknown intelligence."){record_delimiter}
("relationship"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"Intelligence"{tuple_delimiter}"Sam Rivera is directly involved in the process of learning to communicate with the unknown intelligence."{tuple_delimiter}9){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"First Contact"{tuple_delimiter}"Alex leads the team that might be making the First Contact with the unknown intelligence."{tuple_delimiter}10){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Humanity's Response"{tuple_delimiter}"Alex and his team are the key figures in Humanity's Response to the unknown intelligence."{tuple_delimiter}8){record_delimiter}
("relationship"{tuple_delimiter}"Control"{tuple_delimiter}"Intelligence"{tuple_delimiter}"The concept of Control is challenged by the Intelligence that writes its own rules."{tuple_delimiter}7){completion_delimiter}
#############################
-Real Data-
######################
Entity_types: {entity_types}
Text: {input_text}
######################
Output:

大家务必基于自己的数据去调整prompt，否则生成的效果会很差。如果数据集是中文，可以尝试将prompt

3 运行GraphRAG的pipline

设置好env和setting后，我们就可以执行了~

再次提醒：整个过程LLM调用次数非常多，大家对token消耗现有个心里预期。有网友使用185K个字符的文档，使用GPT-4o，消耗大概7美元。

代码语言：javascript

复制

python -m graphrag.index --root ./hlmtest

执行成功会出现如下日志，可以看到工作流流程和进度：

我们看下运行后输出了什么

indexing-engine.log 这个文件记录了详细的pipline日志信息。

再看看生成的create_final_relationships和create_final_community_reports。这里我使用的是gpt3.5，prompt改为中文（否则效果很差）。

4 基于知识问答

基于以上pipline后，我们生成了实体图，关系图，社区摘要等中间文件。下面就基于graphrag进行问答。

代码语言：javascript

复制

python -m graphrag.query --root ./hlmtest --method global "贾元春和贾宝玉的关系？"

代码语言：javascript

复制

python -m graphrag.query --root ./hlmtest --method global "红楼梦中金陵十二钗有哪些人物？"

以上就是使用GraphRAG进行问答的整个过程_{大家可以先尝鲜体验，后面有时间，我们仔细分析源码，来看看每个步骤的实现逻辑}

如何系统的去学习大模型LLM ？

大模型时代，火爆出圈的LLM大模型让程序员们开始重新评估自己的本领。 “AI会取代那些行业？”“谁的饭碗又将不保了？”等问题热议不断。

事实上，抢你饭碗的不是AI，而是会利用AI的人。

继科大讯飞、阿里、华为等巨头公司发布AI产品后，很多中小企业也陆续进场！超高年薪，挖掘AI大模型人才！ 如今大厂老板们，也更倾向于会AI的人，普通程序员，还有应对的机会吗？

与其焦虑……

不如成为「掌握AI工具的技术人」，毕竟AI时代，谁先尝试，谁就能占得先机！

但是LLM相关的内容很多，现在网上的老课程老教材关于LLM又太少。所以现在小白入门就只能靠自学，学习成本和门槛很高。

针对所有自学遇到困难的同学们，我帮大家系统梳理大模型学习脉络，将这份 LLM大模型资料 分享出来：包括LLM大模型书籍、640套大模型行业报告、LLM大模型学习视频、LLM大模型学习路线、开源大模型学习教程等, 😝有需要的小伙伴，可以 扫描下方二维码领取🆓↓↓↓

👉CSDN大礼包🎁：全网最全《LLM大模型入门+进阶学习资源包》免费分享（安全链接，放心点击）👈

一、LLM大模型经典书籍

AI大模型已经成为了当今科技领域的一大热点，那以下这些大模型书籍就是非常不错的学习资源。

在这里插入图片描述

二、640套LLM大模型报告合集

这套包含640份报告的合集，涵盖了大模型的理论研究、技术实现、行业应用等多个方面。无论您是科研人员、工程师，还是对AI大模型感兴趣的爱好者，这套报告合集都将为您提供宝贵的信息和启示。(几乎涵盖所有行业)

在这里插入图片描述

三、LLM大模型系列视频教程

在这里插入图片描述

四、LLM大模型开源教程（LLaLA/Meta/chatglm/chatgpt）

在这里插入图片描述

LLM大模型学习路线 ↓

阶段1：AI大模型时代的基础理解

目标：了解AI大模型的基本概念、发展历程和核心原理。
内容：
- L1.1 人工智能简述与大模型起源
- L1.2 大模型与通用人工智能
- L1.3 GPT模型的发展历程
- L1.4 模型工程
- L1.4.1 知识大模型
- L1.4.2 生产大模型
- L1.4.3 模型工程方法论
- L1.4.4 模型工程实践
- L1.5 GPT应用案例

阶段2：AI大模型API应用开发工程

目标：掌握AI大模型API的使用和开发，以及相关的编程技能。
内容：
- L2.1 API接口
- L2.1.1 OpenAI API接口
- L2.1.2 Python接口接入
- L2.1.3 BOT工具类框架
- L2.1.4 代码示例
- L2.2 Prompt框架
- L2.3 流水线工程
- L2.4 总结与展望