自然语言处理--利用点积度量文本之间的重合度

如果能够度量两个文本之间的重合度，就可以很好地估计它们所用词的相似程度，而这也是它们语义上重合度的一个很好的估计。import numpy as npimport pandas as pdsentences = """Thomas Jefferson began building Monticello at the age of 26.\n"""sentences += """Constructi

糯米君_

478人浏览 · 2020-12-20 22:04:41

糯米君_ · 2020-12-20 22:04:41 发布

如果能够度量两个文本之间的重合度，就可以很好地估计它们所用词的相似程度，而这也是它们语义上重合度的一个很好的估计。

import numpy as np
import pandas as pd

sentences = """Thomas Jefferson began building Monticello at the age of 26.\n"""
sentences += """Construction was done mostly by local masons and carpenters.\n"""
sentences += "He moved into the South Pavilion in 1770.\n"
sentences += """Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."""
corpus = {}
for i, sent in enumerate(sentences.split('\n')):
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split())
# pd.DataFrame.from_records()专门用于从元组和字典中创建数据框
df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
print(df)

df = df.T
print(df.sent0.dot(df.sent1))
print(df.sent0.dot(df.sent2))
print(df.sent0.dot(df.sent3))