mysql 相似性检索_计算从4个mysql表中检索到的所有可能的文本对的余弦相似性

下面是计算一组文档之间成对余弦相似度的最小示例(假设您已成功地从数据库中检索到标题和文本)。在from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarity# Assume thats the data we have (4 short

汪德福尔

364人浏览 · 2021-02-07 20:29:13

汪德福尔 · 2021-02-07 20:29:13 发布

下面是计算一组文档之间成对余弦相似度的最小示例(假设您已成功地从数据库中检索到标题和文本)。在from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

# Assume thats the data we have (4 short documents)

data = [

'I like beer and pizza',

'I love pizza and pasta',

'I prefer wine over beer',

'Thou shalt not pass'

]

# Vectorise the data

vec = TfidfVectorizer()

X = vec.fit_transform(data) # `X` will now be a TF-IDF representation of the data, the first row of `X` corresponds to the first sentence in `data`

# Calculate the pairwise cosine similarities (depending on the amount of data that you are going to have this could take a while)

S = cosine_similarity(X)

'''

S looks as follows:

array([[ 1. , 0.4078538 , 0.19297924, 0. ],

[ 0.4078538 , 1. , 0. , 0. ],

[ 0.19297924, 0. , 1. , 0. ],

[ 0. , 0. , 0. , 1. ]])

The first row of `S` contains the cosine similarities to every other element in `X`.

For example the cosine similarity of the first sentence to the third sentence is ~0.193.

Obviously the similarity of every sentence/document to itself is 1 (hence the diagonal of the sim matrix will be all ones).

Given that all indices are consistent it is straightforward to extract the corresponding sentences to the similarities.

'''

魔乐社区

魔乐社区（Modelers.cn) 是一个中立、公益的人工智能社区，提供人工智能工具、模型、数据的托管、展示与应用协同服务，为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作，由全产业链共同建设、共同运营、共同享有，推动国产AI生态繁荣发展。

更多推荐

替你试过了，消费级显卡可以跑的开源文生图SOTA模型，顶级渲染、高密度文本绘图

魔乐社区

小参数・大码力・易部署 | Qwen3.6-27B上线魔乐社区，基于昇腾的部署教程来了

继一周前模型开源发布后，千问再度开源Qwen3.6-27B —— 一个拥有270亿参数的稠密多模态模型，也是社区呼声最高的模型规格。Qwen3.6-27B 依然支持多模态思考与非思考模式，在智能体编程方面达到了旗舰级表现，全面超越前代开源旗舰 Qwen3.5-397B-A17B（总参数397B / 激活参数17B的MoE模型）。作为稠密架构，它无需MoE路由即可部署，是开发者在实用、可广泛部署规模