神经语言模型作为特定领域的知识库_python_weixin

自然语言处理| 基于变压器的语言模型| 领域知识和领域本体 (Natural Language Processing | Transformer Based Language Models | Domain Knowledge and Domain Ontology)

The fundamental challenge of natural language processing (NLP) is resolution of the ambiguity that is present in the meaning of and intent carried by natural language. Ambiguity occurs at multiple levels of language understanding, as depicted below:

自然语言处理(NLP)的基本挑战是解决自然语言的含义和意图中存在的歧义。歧义发生在语言理解的多个层面，如下所示：

To resolve ambiguity within a text, algorithms use knowledge from the context within which the text appears. For example, the presence of the sentence “I visited the zoo.” before the sentence “I saw a bat” can be used to conclude that bat represents an animal and not a wooden club.

为了解决文本中的歧义，算法使用了文本出现的上下文中的知识。例如，句子“ 我参观了动物园”。 在句子“ 我看见了蝙蝠”之前可以用来推断蝙蝠代表动物而不是木制棍棒。

While in many situations neighboring text is sufficient for reducing ambiguity, typically it is not sufficient when dealing with text from specialized domains. Processing domain-specific text requires an understanding of a large number of domain-specific concepts and processes that NLP algorithms cannot glean from neighboring text alone.

尽管在许多情况下，相邻文本足以减少歧义，但通常在处理来自特定领域的文本时还不够。处理特定于域的文本需要理解大量的特定于域的概念和过程，而NLP算法不能仅从邻近的文本中收集这些过程。

As an example, in Title Insurance and Settlement domain, an algorithm may require an understanding of concepts like:

例如，在产权保险和和解域中，算法可能需要理解以下概念：

A 1003 is a form one uses to apply for mortgages

1003是一种用于申请抵押的表格
Balancing is a processing step on the path to closing a real estate transaction

平衡是完成房地产交易之路的处理步骤
When Balancing is completed, one next processing step could be the Release of Funds

平衡完成后，下一个处理步骤可能是释放资金

This is where domain-specific knowledge bases come in.

这是特定领域知识库的来源。

Domain-specific knowledge bases capture domain knowledge that NLP algorithms need to correctly interpret domain-specific text.

特定领域的知识库捕获NLP算法正确解释特定领域文本所需的领域知识。

Depending on the use case, knowledge housed in a knowledge base could be of a specific type or of multiple types.

根据用例，知识库中包含的知识可以是特定类型，也可以是多种类型。

Traditionally, knowledge bases have been modeled as graph-based ontologies and added as one of the components in text processing pipelines. However, with the advent of Transformer-based language models, like BERT and GPT-2, there has been research and evaluation[1][2] on the type and quality of knowledge inherent in a trained language model.

传统上，知识库已建模为基于图的本体，并已添加为文本处理管道中的组件之一。然而，随着基于BERT和GPT-2等基于Transformer的语言模型的出现，已经对经过训练的语言模型所固有的知识的类型和质量进行了研究和评估[1] [2]。

Note: Statistical language models are probability distributions over sequences of words. The probability distributions are used to predict the most likely word(s) at a position in a text sequence, given preceding and/or succeeding words of the sequence.

注意：统计语言模型是单词序列上的概率分布。 给定序列的前面和/或后面的单词，概率分布用于预测文本序列中某个位置上最可能出现的单词。

A favorable result of such evaluation can make a case for dropping the separate knowledge base component from text processing pipelines to significantly reduce their complexity. Focusing on this idea, this blog post describes the challenges of graph-based knowledge bases and provides an evaluation of domain knowledge present in language models that have been fine tuned on text from the title and settlement domain.

这种评估的有利结果可以证明有必要从文本处理管道中删除单独的知识库组件，以显着降低其复杂性。围绕这个想法，这篇博客文章描述了基于图的知识库所面临的挑战，并提供了对语言模型中存在的领域知识的评估，这些语言模型已经对来自标题和结算领域的文本进行了微调。

知识库中预期的知识类型 (Types of knowledge expected in a knowledge base)

To evaluate a knowledge base, one needs to test it on all types of knowledge that will be required to process text from the domain. Below is the list of select knowledge types that might be required in a knowledge base for domain-specific text processing:

要评估知识库，需要对处理域中文本所需的所有类型的知识进行测试。以下是知识库中特定于域的文本处理可能需要的部分选择的知识类型：

1.对特定业务实体的知识，例如工件，事件和参与者 (1. Knowledge of business-specific entities, like artifacts, events, and actors)

Examples: documents, file, notary, and specific types of documents, e.g. 1003
示例：文档，文件，公证和特定类型的文档，例如1003
This type of knowledge enables definition and application of entity-level domain attributes. For example, if a document is identified, one knows that it can be shared and it has a page count attribute
此类知识可以定义和应用实体级域属性。例如，如果标识了一个文档，则知道该文档可以共享并且具有页数属性

2.实体属性知识 (2. Knowledge of entity attributes)

Example: residential refinance borrowers have an attribute, e.g. marital status
示例：住宅再融资借款人具有属性，例如婚姻状况
This type of knowledge defines attributes associated with an entity type and facilitates interpretation of text and extraction of information
这种知识类型定义与实体类型相关的属性，并有助于文本的解释和信息的提取

3.了解实体之间和实体之间的关系 (3. Knowledge of relationships amongst and between entities)

Example: closing disclosure is part of a lender package
示例：完成披露是放贷人计划的一部分
This type of knowledge helps evaluate impact on parent entities when child entities are modified
此类知识有助于在修改子实体时评估对父实体的影响

4.了解如何将它们全部融合到业务流程中 (4. Knowledge of how it all fits together into business processes)

Example: finding a notary leads to scheduling a closing appointment
示例：找到公证人导致安排结束约会
This type of knowledge dictates the next action given information about the business process state
这种类型的知识在给定有关业务流程状态的信息的情况下决定了下一步的操作

Depending on the business domain and use case, more types of knowledge may be required in the knowledge base.

根据业务领域和用例的不同，知识库中可能需要更多类型的知识。

典型的知识库设计，生命周期和成本 (Typical knowledge base design, lifecycle, and costs)

To create a knowledge base, one needs to define the knowledge that would be present in the knowledge base. Depending on the knowledge base, this is done either completely by human experts, for example, WordNet[3], or by automated algorithms that may or may not build upon human-defined knowledge, for example, YAGO[4].

要创建一个知识库，需要定义将存在于知识库中的知识。根据知识库的不同，这可以完全由人类专家(例如WordNet [3])完成，也可以完全由可能基于或不基于人类定义的知识的自动算法(例如YAGO)完成 (例如YAGO) [4]。

Knowledge, once defined, is modeled in knowledge bases as graphs or ontologies. Concepts like classes and individuals are modeled as nodes, and relations amongst them are modeled as edges of graphs. Classes express concepts like documents and events. Individuals express instances of classes, for example, 1003s, closing disclosures, and deeds all being instances of class documents. Edges capture relationships between classes and individuals. Examples of relationships that edges capture are is-type-of, is-instance-of, and has-attribute. In the example below, a directed edge between 1003 and Document is used to capture the knowledge that a 1003 is a type of Document.

知识一旦定义，就会在知识库中以图形或本体的形式进行建模。类和个体之 类的概念被建模为节点，它们之间的关系被建模为图的边缘。类表达诸如文档和事件之类的概念。个人表示类的实例，例如1003，结束公开内容和契约都是类文档的实例。优势捕捉了阶级与个人之间的关系。边缘捕获的关系的示例是is-type-of，is-instance-of和has-attribute。在下面的示例中，使用1003和文档之间的有向边来捕获有关1003是文档类型的知识。

The ontology or graph itself is represented using knowledge representation languages like RDF, RDFS, and OWL, and stored in formats like XML.

本体或图形本身使用诸如RDF，RDFS和OWL之类的知识表示语言表示，并以XML之类的格式存储。

There are many popular knowledge bases — like YAGO and Concept Net — that model a variety of relationships this way. They often pick up entities to model from other knowledge bases like Wikipedia and WordNet.

有许多流行的知识库(例如YAGO和Concept Net)以这种方式建模各种关系。他们通常会从Wikipedia和WordNet等其他知识库中选择要建模的实体。

Once a knowledge base is ready, different NLP models use different ways to incorporate it in their pipelines. Options include:

知识库准备就绪后，不同的NLP模型将使用不同的方法将其合并到其管道中。选项包括：

Using entity type information present in the knowledge base, to replace or augment entity occurrences in text
使用知识库中存在的实体类型信息来替换或增加文本中出现的实体
Using features created from graph-based measures, like distance between nodes of entities mentioned in text, as features in a model[5]
将基于图的度量创建的特征(例如文本中提到的实体的节点之间的距离)用作模型中的特征[5]
More advanced methods that tie in feature extraction from the graph into the backpropagation and loss optimization loop through techniques like graph embeddings and LSTM-based feature extraction[6]
通过图嵌入和基于LSTM的特征提取等技术，将图的特征提取与反向传播和损失优化循环联系起来的更高级的方法[6]

These methods are non-trivial and require significant effort to invent and incorporate in training and application phases.

这些方法并非易事，需要付出巨大的努力才能发明并纳入培训和应用阶段。

Based on the above, it is clear that knowledge bases, in the form of graphs and ontologies, are costly in terms of time, effort, and money, especially when human experts are involved.

基于上述内容，很明显，以图形和本体形式存在的知识库在时间，精力和金钱方面都是昂贵的，尤其是在涉及人类专家的情况下。

评估基于Transformer的语言模型是否带有内置的领域知识库 (Evaluating if Transformer-based language models carry built-in domain knowledge bases)

动机 (Motivation)

Since creating and incorporating ontology-based knowledge bases is costly, a better alternative is always welcomed.

由于创建和合并基于本体的知识库的成本很高，因此始终欢迎更好的选择。

In recent years, Transformer-based language models like BERT and GPT-2 have dominated the leaderboards across NLP competitions, tasks, and benchmarks. Impressively, they have achieved this with:

近年来，BERT和GPT-2等基于Transformer的语言模型在NLP竞赛，任务和基准测试中占据了排行榜的首位。令人印象深刻的是，他们通过以下方式实现了这一目标：

A single model forming the end-to-end pipeline (in contrast to multi-model text processing pipelines)
形成端到端管道的单个模型(与多模型文本处理管道相反)
Minimal fine-tuning of pre-trained models
预训练模型的最小调整
Use of a common base model across multiple tasks
跨多个任务使用通用基本模型

Given such success and ease of use, it would be ideal if a fine-tuned version of these models incorporated domain knowledge within them. If so, one can:

鉴于这种成功和易用性，如果这些模型的微调版本在其中包含领域知识，则将是理想的。如果是这样，可以：

Skip the high costs of creating a knowledge base
跳过创建知识库的高成本
Drop the separate knowledge base component and simplify the multi-model text processing pipeline
删除单独的知识库组件，并简化多模型文本处理流程
Retain the benefits of having a knowledge base, as knowledge captured within the language model will automatically come into play while using the model for downstream tasks
保留拥有知识库的好处，因为在将语言模型用于下游任务时，语言模型中捕获的知识将自动发挥作用
Get an additional boost of broader world knowledge built into the language model during its pre-training phase
在预训练阶段进一步增强语言模型中内置的更广泛的世界知识

评估方法 (Evaluation methodology)

To evaluate, we fine tuned a roBERTa-base and a gpt2-medium model (both from Hugging Face) on internal company data and explored the knowledge captured in them. The methodology used to evaluate the models was:

为了进行评估，我们根据公司内部数据对roBERTa基础和gpt2-medium模型(均来自Hugging Face )进行了微调，并探索了从中获取的知识。用于评估模型的方法是：

1. Decide on a type of knowledge that needs to be evaluated. Example: is-a-type-of relationships

1.确定需要评估的知识类型。示例：一种类型的关系

2. Generate specific instances of knowledge type to test presence of knowledge. Example: 1003 is a type of document.

2.生成知识类型的特定实例以测试知识的存在。示例：1003是一种文档。

3. Modify the knowledge instance by either masking the word or adding an additional fill-in-the-blank kind of question at the end of the instance. Example: 1003 is a type of _______.

3.通过掩盖单词或在实例末尾添加一个附加的空白问题来修改知识实例。示例：1003是_______的类型。

4. Ask the model to fill in the blank and evaluate the answer.

4.要求模型填写空白并评估答案。

Note, there are other methodologies that can be used depending on the type of knowledge that needs to be tested.

请注意，根据需要测试的知识类型，可以使用其他方法。

样本模型输入和输出 (Sample model inputs and outputs)

选择发现和分析 (Select findings and analysis)

RoBERTa捕获了多种类型的关系，并能够自动将学到的概念扩展到新的看不见的实体实例 (RoBERTa captured multiple types of relationships and was able to automatically extend concepts learnt to new unseen entity instances)

Is-type-of relationship:

Is-type-of关系：

4. Pre-Fine Tuning: Citibank is a type of bank .

4.精细调整：花旗银行是一种银行。

5. Post-Fine Tuning: Citibank is a type of lender .

5.后期调整：花旗银行是一种贷方。

Note, Citibank was not present in the dataset at all. It seems the model learned that banks play the role of lenders in the title insurance and settlement business. It then applied that learning to banks it knew from its pre-training.

注意， 花旗银行根本不在数据集中。该模型似乎了解到，银行在产权保险和结算业务中扮演放贷方的角色。然后，它将这种学习应用于从其预培训中知道的银行。

Entity-attribute relationship:

实体-属性关系：

6. Post-fine tuning: Notary’s _____ is _____

6.微调后：公证人的_____是_____

Top values suggested for the first blank: name, information, email, contact, info, address, confirmation, schedule, number

建议第一个空格的最高值： 姓名，信息，电子邮件，联系人，信息，地址，确认，时间表，电话号码

Entity-associated-with-action relationship

实体关联动作关系

7. ________ needs to be changed

7. ________需要更改

Values suggested for blank are: password, title, this, nothing, CD, amount, fee, it, name

建议为空白的值包括： 密码，标题，此，无，CD，金额，费用，它，名称

While there were a few examples where model results were not as convincing as the above examples, for most examples they were. It seems the language model nicely captures is-type-of, entity-attribute, and entity-associated-action relationships.

尽管有一些示例的模型结果不如上面的示例令人信服，但对于大多数示例而言，它们却是令人信服的。看来语言模型很好地捕捉了is-type-of，entity-attribute和entity-associated-action关系。

该模型所输出的知识虽然大多是明智的，但并不总是信息丰富，有用或理想中所期望的 (Knowledge output by the model, while mostly sensible, was not always informative, useful or what was ideally desired)

Part-of relationship

部分关系

8. Lender package is made up of pages .

8.贷方包由页面组成 。

Other top answers were: two, documents, trust, boxes

其他最重要的答案是： 两个，文档，信任，框

For the above example, while answers like pages and documents are not wrong, they are not very useful either. Answers like closing disclosure, 1003, or signature affidavit, would have been more informative. Similarly in example 7., answers like nothing and this are not informative. These non-useful answers were found across various evaluations and could roughly be classified into the following categories: pronouns, interrogative words like ‘what’, punctuations, and words repeated with case variations.

对于上面的示例，虽然诸如页面和文档之类的答案没有错，但它们也不是很有用。诸如结束披露 1003或签名誓章之类的答案本来可以提供更多信息。与示例7类似，答案什么也没有 ，这没有任何意义 。这些无用的答案可在各种评估中找到，大致可分为以下几类：代词，疑问词(如“ what”)，标点符号和带有大小写变化的重复词。

模型对业务流程有了很好的理解，但是需要智能的外部输入才能利用这种理解 (Models developed a good understanding of business process, but required intelligent external input to make use of that understanding)

Business process understanding

业务流程理解

Fine-tuned RoBERTa:

调整好的RoBERTa：

9。

Input: We are balanced.
输入：我们很平衡。
Augmentation: Please provide ________
扩充：请提供________
Model output (for next two words): payoff statement

模型输出(接下来的两个词)： 支付声明

Fine-tuned gpt2-medium:

微调的gpt2-medium：

10.

10。

Input: Loan amount is wrong.
输入：贷款金额错误。
Augmentation: Please provide ________
扩充：请提供________
Model output: updated fee sheet with correct loan amount

模型输出： 使用正确的贷款额更新的费用表

11.

11。

Input: Loan amount is wrong.
输入：贷款金额错误。
Augmentation: Please obtain ________
扩充：请获取________
Model output: deed correcting vesting

模型输出： 契约纠正归属

12.

Input: Notary is available.
输入：公证人可用。
Augmentation: Next step is ________
增强：下一步是________
Model output: signing

模型输出：签名

Note, in examples 9., 10. and 12. — when provided with a good augmentation — the model was able to deliver a sensible answer. In contrast, in example 11., where the augmentation wasn’t typical, the model did not generate an acceptable answer. This may turn out to be an issue if downstream tasks do not have an easy way to provide the required text augmentation.

请注意，在示例9、10和12中(当提供良好的扩展功能时)，该模型能够提供合理的答案。相反，在示例11中，如果不是典型的扩充，则模型不会生成可接受的答案。如果下游任务没有简便的方法来提供所需的文本扩展，则可能会成为问题。

超越语言模型的内置知识库 (Beyond language model’s built-in knowledge base)

While this blog primarily focused on evaluating language models’ built-in knowledge bases, if it is required and not cost-prohibitive, one may want to consider an approach similar to COMeT[7]. COMeT fine-tunes a pre-trained language model on an existing knowledge base to predict known and new knowledge base relationships.

尽管此博客主要关注评估语言模型的内置知识库，但如果需要且不是成本高昂的，则可能需要考虑一种类似于COMeT [7]的方法。 COMeT在现有知识库上微调了预先训练的语言模型，以预测已知和新的知识库关系。

Training on an existing knowledge base enriches the model with wider and more accurate knowledge than what was contained in the original knowledge base or the pre-trained language model. While discussion of COMeT’s approach is out of scope for this blog, an online demo of COMeT trained on ATOMIC[8] and ConceptNet[9] could be found here.

对现有知识库的培训比原始知识库或预先训练的语言模型所包含的知识更丰富，更准确。尽管对COMeT方法的讨论不在本博客的讨论范围之内，但可以在此处找到经过ATOMIC [8]和ConceptNet [9]培训的COMeT在线演示。

带走 (Takeaway)

Based on the above analysis we believe that language models — once fine tuned — do hold rich domain knowledge within them. However, there are challenges, like right augmentation text and identifying the more informative model outputs in making direct use of the knowledge.

根据以上分析，我们认为语言模型(经过微调)确实可以在其中包含丰富的领域知识。但是，存在一些挑战，例如，权利扩充文本以及在直接使用知识时识别更多信息的模型输出。

The article first appeared on statestitle.com

该文章首次出现在 statestitle.com上

[1] Petroni, Fabio, et al. “Language Models as Knowledge Bases?” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.

[1] Petroni，Fabio等。 “语言模型作为知识库？” 2019年自然语言处理经验方法会议和第9届国际自然语言处理联合会议(EMNLP-IJCNLP)的会议记录 。 2019。

[2] Davison, Joe, Joshua Feldman, and Alexander M. Rush. “Commonsense knowledge mining from pretrained models.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.

[2] Davison，Joe，Joshua Feldman和Alexander M. Rush。 “从预训练的模型中挖掘常识。” 2019年自然语言处理经验方法会议和第9届国际自然语言处理联合会议(EMNLP-IJCNLP)的会议记录 。 2019。

[3] Princeton University. “About WordNet.” WordNet. Princeton University. 2010.

[3]普林斯顿大学。 “关于WordNet。” 词网。普林斯顿大学。 2010。

[4] Suchanek, Fabian M., Gjergji Kasneci, and Gerhard Weikum. “Yago: a core of semantic knowledge.” Proceedings of the 16th International Conference on World Wide Web. 2007.

[4] Suchanek，Fabian M.，Gjergji Kasneci和Gerhard Weikum。 “ Yago：语义知识的核心。” 第16届国际万维网会议论文集 。 2007年。

[5] Xia, Jiangnan, Chen Wu, and Ming Yan. “Incorporating Relation Knowledge into Commonsense Reading Comprehension with Multi-task Learning.” Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2019.

[5]夏，江南，陈武和明艳。 “将关系知识整合到具有多任务学习的常识阅读理解中。” 第28届ACM国际信息与知识管理国际会议论文集 。 2019。

[6] Lin, Bill Yuchen, et al. “Kagnet: Knowledge-aware graph networks for commonsense reasoning.” arXiv preprint arXiv:1909.02151. 2019.

[6] Lin，Bill Yuchen等人。 “ Kagnet：常识性推理的知识感知图网络。” arXiv预印本arXiv：1909.02151。 2019。

[7] Bosselut, Antoine, et al. “COMET: Commonsense Transformers for Knowledge Graph Construction.” Association for Computational Linguistics (ACL). 2019.

[7] Bosselut，Antoine等。 “ COMET：用于知识图构建的常识变压器。” 计算语言学协会(ACL) 。 2019。

[8] Sap, Maarten, et al. “Atomic: An atlas of machine commonsense for if-then reasoning.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.

[8] Sap，Maarten等人。 “原子：如果-那么推理的机器常识地图集。” AAAI人工智能会议论文集。 卷 33. 2019。

[9] Speer, Robyn, Joshua Chin, and Catherine Havasi. “ConceptNet 5.5: an open multilingual graph of general knowledge.” Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. 2017.

[9] Speer，Robyn，Joshua Chin和Catherine Havasi。 “ ConceptNet 5.5：开放的多语言常识图。” 第三十届AAAI人工智能会议论文集 。 2017。

翻译自: https://towardsdatascience.com/neural-language-models-as-domain-specific-knowledge-bases-9b505b21de5

神经语言模型作为特定领域的知识库