解锁搜寻新境界!让文本语义匹配助你轻松找到你须要的所有!(疾速上手 baseline)
实现了多种类似度计算、匹配搜索算法,反对文本、图像,python3 开发,pip 装置,开箱即用。
-
文本类似度计算(文本匹配)
- 余弦类似(Cosine Similarity):两向量求余弦
- 点积(Dot Product):两向量归一化后求内积
- 汉明间隔(Hamming Distance),编辑间隔(Levenshtein Distance),欧氏间隔(Euclidean Distance),曼哈顿间隔(Manhattan Distance)等
-
语义模型
- [CoSENT 文本匹配模型]【举荐】
- BERT 模型(文本向量表征)
- SentenceBERT 文本匹配模型
-
字面模型
- [Word2Vec 文本浅层语义表征]【举荐】
- 同义词词林
- 知网 Hownet 义原匹配
- BM25、RankBM25
- TFIDF
- SimHash
-
图像类似度计算(图像匹配)
-
语义模型
- [CLIP(Contrastive Language-Image Pre-Training)]
- VGG(doing)
- ResNet(doing)
-
-
特征提取
- [pHash]【举荐】, dHash, wHash, aHash
- SIFT, Scale Invariant Feature Transform(SIFT)
- SURF, Speeded Up Robust Features(SURF)(doing)
-
图文类似度计算
- [CLIP(Contrastive Language-Image Pre-Training)]
-
匹配搜寻
- [SemanticSearch]:向量类似检索,应用 Cosine
Similarty + topk 高效计算,比一对一暴力计算快一个数量级
- [SemanticSearch]:向量类似检索,应用 Cosine
我的项目链接见文末
环境装置:
!pip install --upgrade pip -i https://mirrors.cloud.tencent.com/pypi/simple
!pip install -U similarities -i https://mirrors.cloud.tencent.com/pypi/simple
#装置依赖库
!pip install -r /home/mw/project/similarities-main/requirements.txt -i https://mirrors.cloud.tencent.com/pypi/simple
#装置高版本 torch, 装置完后重启内核
!pip install torch==1.12.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu -i https://mirrors.cloud.tencent.com/pypi/simple
1.1 文本语义类似度计算
from similarities import Similarity
#模型文件在 input 门路下,有中文和多语言两个模型,可自行抉择
m = Similarity(model_name_or_path="/home/mw/input/99556126636/zh_model/ 中文模型",max_seq_length=128)
r = m.similarity('明天的天气不错是晴天', '今天天气很好阳光明媚')
print(f"similarity score: {float(r)}")
2023-09-11 02:36:44.046 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
similarity score: 0.7727918028831482
Similarity 的默认办法:
Similarity(corpus: Union[List[str], Dict[str, str]] = None,
model_name_or_path="shibing624/text2vec-base-chinese",
max_seq_length=128)
- 返回值:余弦值
score
范畴是[-1, 1],值越大越类似 corpus
:搜寻用的 doc 集,仅搜寻时须要,输出格局:句子列表List[str]
或者 {corpus_id: sentence} 的Dict[str, str]
格局model_name_or_path
:模型名称或者模型门路,默认会从 HF model hub 下载并应用中文语义匹配模型 shibing624/text2vec-base-chinese,如果是多语言景,能够替换为多语言匹配模型 shibing624/text2vec-base-multilingualmax_seq_length
:输出句子的最大长度,最大为匹配模型反对的最大长度,BERT 系列是 512
1.2. 文本语义匹配搜寻
个别在文档候选集中找与 query 最类似的文本,罕用于 QA 场景的问句类似匹配、文本类似检索等工作。
from similarities import Similarity
# 1.Compute cosine similarity between two sentences.
sentences = ['明天的天气不错是晴天',
'今天天气很好阳光明媚']
corpus = [
'今天天气很好阳光明媚',
'在好天气里,我喜爱去漫步',
'这本书太无聊了,我无奈读上来',
'我喜爱去海边度假,感触阳光和海风',
'在旅行途中,咱们遇到了许多乏味的人',
'我喜爱随时随地用耳机听音乐',
]
model = Similarity(model_name_or_path="/home/mw/input/99556126636/zh_model/ 中文模型")
print(model)
similarity_score = model.similarity(sentences[0], sentences[1])
print(f"{sentences[0]} vs {sentences[1]}, score: {float(similarity_score):.4f}")
print('-' * 50 + '\n')
# 2.Compute similarity between two list
similarity_scores = model.similarity(sentences, corpus)
print(similarity_scores.numpy())
for i in range(len(sentences)):
for j in range(len(corpus)):
print(f"{sentences[i]} vs {corpus[j]}, score: {similarity_scores.numpy()[i][j]:.4f}")
print('-' * 50 + '\n')
# 3.Semantic Search
model.add_corpus(corpus)
res = model.most_similar(queries=sentences, topn=3)
print(res)
for q_id, c in res.items():
print('query:', sentences[q_id])
print("search top 3:")
for corpus_id, s in c.items():
print(f'\t{model.corpus[corpus_id]}: {s:.4f}')
2023-09-11 02:43:19.744 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
Similarity: Similarity, matching_model: <SentenceModel: /home/mw/input/99556126636/zh_model/ 中文模型, encoder_type: MEAN, max_seq_length: 128, emb_dim: 768>
明天的天气不错是晴天 vs 今天天气很好阳光明媚, score: 0.7728
--------------------------------------------------
2023-09-11 02:43:22.348 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 6
[[0.77279186 0.377486 0.2831661 0.3328314 0.33157927 0.271398]
[1. 0.4531002 0.22196919 0.42843264 0.31628954 0.28194088]]
明天的天气不错是晴天 vs 今天天气很好阳光明媚, score: 0.7728
明天的天气不错是晴天 vs 在好天气里,我喜爱去漫步, score: 0.3775
明天的天气不错是晴天 vs 这本书太无聊了,我无奈读上来, score: 0.2832
明天的天气不错是晴天 vs 我喜爱去海边度假,感触阳光和海风, score: 0.3328
明天的天气不错是晴天 vs 在旅行途中,咱们遇到了许多乏味的人, score: 0.3316
明天的天气不错是晴天 vs 我喜爱随时随地用耳机听音乐, score: 0.2714
今天天气很好阳光明媚 vs 今天天气很好阳光明媚, score: 1.0000
今天天气很好阳光明媚 vs 在好天气里,我喜爱去漫步, score: 0.4531
今天天气很好阳光明媚 vs 这本书太无聊了,我无奈读上来, score: 0.2220
今天天气很好阳光明媚 vs 我喜爱去海边度假,感触阳光和海风, score: 0.4284
今天天气很好阳光明媚 vs 在旅行途中,咱们遇到了许多乏味的人, score: 0.3163
今天天气很好阳光明媚 vs 我喜爱随时随地用耳机听音乐, score: 0.2819
--------------------------------------------------
2023-09-11 02:43:23.945 | INFO | similarities.similarity:add_corpus:155 - Add 6 docs, total: 6, emb len: 6
{0: {0: 0.772791862487793, 1: 0.377485990524292, 3: 0.33283141255378723}, 1: {0: 1.0, 1: 0.45310020446777344, 3: 0.4284326434135437}}
query: 明天的天气不错是晴天
search top 3:
今天天气很好阳光明媚: 0.7728
在好天气里,我喜爱去漫步: 0.3775
我喜爱去海边度假,感触阳光和海风: 0.3328
query: 今天天气很好阳光明媚
search top 3:
今天天气很好阳光明媚: 1.0000
在好天气里,我喜爱去漫步: 0.4531
我喜爱去海边度假,感触阳光和海风: 0.4284
余弦
score
的值范畴[-1, 1],值越大,示意该 query 与 corpus 的文本越类似。
1.3 多语言文本语义类似度计算和匹配搜寻
多语言:包含中、英、韩、日、德、意等多国语言
from similarities import Similarity
# Two lists of sentences
sentences1 = [
'The cat sits outside',
'A man is playing guitar',
'The new movie is awesome',
'花呗更改绑定银行卡',
'The quick brown fox jumps over the lazy dog.',
]
sentences2 = [
'The dog plays in the garden',
'A woman watches TV',
'The new movie is so great',
'如何更换花呗绑定银行卡',
'麻利的棕色狐狸跳过了懒狗',
]
model = Similarity(model_name_or_path="/home/mw/input/99556126636/mul_model/ 多语言模型")
# 应用的是多语言文本匹配模型
scores = model.similarity(sentences1, sentences2)
print('1:use Similarity compute cos scores\n')
for i in range(len(sentences1)):
for j in range(len(sentences2)):
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[j], scores[i][j]))
print('-' * 50 + '\n')
print('2:search\n')
# 2.Semantic Search
corpus = [
'The cat sits outside',
'A man is playing guitar',
'I love pasta',
'The new movie is awesome',
'The cat plays in the garden',
'A woman watches TV',
'The new movie is so great',
'Do you like pizza?',
'如何更换花呗绑定银行卡',
'麻利的棕色狐狸跳过了懒狗',
'猫在窗外',
'电影很棒',
]
model.add_corpus(corpus)
model.save_index('en_corpus_emb.json')
res = model.most_similar(queries=sentences1, topn=3)
print(res)
del model
model = Similarity(model_name_or_path="/home/mw/input/99556126636/mul_model/ 多语言模型")
model.load_index('en_corpus_emb.json')
res = model.most_similar(queries=sentences1, topn=3)
print(res)
for q_id, c in res.items():
print('query:', sentences1[q_id])
print("search top 3:")
for corpus_id, s in c.items():
print(f'\t{model.corpus[corpus_id]}: {s:.4f}')
2023-09-11 02:46:32.262 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
2023-09-11 02:46:33.748 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 12
1:use Similarity compute cos scores
The cat sits outside The dog plays in the garden Score: 0.6211
The cat sits outside A woman watches TV Score: 0.4926
The cat sits outside The new movie is so great Score: 0.5312
The cat sits outside 如何更换花呗绑定银行卡 Score: 0.4604
The cat sits outside 麻利的棕色狐狸跳过了懒狗 Score: 0.4951
A man is playing guitar The dog plays in the garden Score: 0.6483
A man is playing guitar A woman watches TV Score: 0.5747
A man is playing guitar The new movie is so great Score: 0.5524
A man is playing guitar 如何更换花呗绑定银行卡 Score: 0.5098
A man is playing guitar 麻利的棕色狐狸跳过了懒狗 Score: 0.5210
The new movie is awesome The dog plays in the garden Score: 0.5940
The new movie is awesome A woman watches TV Score: 0.5510
The new movie is awesome The new movie is so great Score: 0.9822
The new movie is awesome 如何更换花呗绑定银行卡 Score: 0.4767
The new movie is awesome 麻利的棕色狐狸跳过了懒狗 Score: 0.5523
花呗更改绑定银行卡 The dog plays in the garden Score: 0.4788
花呗更改绑定银行卡 A woman watches TV Score: 0.3842
花呗更改绑定银行卡 The new movie is so great Score: 0.4845
花呗更改绑定银行卡 如何更换花呗绑定银行卡 Score: 0.9377
花呗更改绑定银行卡 麻利的棕色狐狸跳过了懒狗 Score: 0.4546
The quick brown fox jumps over the lazy dog. The dog plays in the garden Score: 0.7547
The quick brown fox jumps over the lazy dog. A woman watches TV Score: 0.4952
The quick brown fox jumps over the lazy dog. The new movie is so great Score: 0.5761
The quick brown fox jumps over the lazy dog. 如何更换花呗绑定银行卡 Score: 0.4426
The quick brown fox jumps over the lazy dog. 麻利的棕色狐狸跳过了懒狗 Score: 0.9290
--------------------------------------------------
2:search
2023-09-11 02:46:34.448 | INFO | similarities.similarity:add_corpus:155 - Add 12 docs, total: 12, emb len: 12
2023-09-11 02:46:34.468 | DEBUG | similarities.similarity:save_index:230 - Save corpus embeddings to file: en_corpus_emb.json.
{0: {0: 0.9999998807907104, 10: 0.819859504699707, 4: 0.8006516695022583}, 1: {1: 1.0000001192092896, 4: 0.5819121599197388, 5: 0.5746968388557434}, 2: {3: 1.0, 6: 0.982224702835083, 11: 0.8939364552497864}, 3: {8: 0.9376938343048096, 1: 0.5211056470870972, 0: 0.49192243814468384}, 4: {9: 0.9290249943733215, 4: 0.657951831817627, 10: 0.6018596887588501}}
2023-09-11 02:46:37.260 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
{0: {0: 0.9999998807907104, 10: 0.819859504699707, 4: 0.8006516695022583}, 1: {1: 1.0000001192092896, 4: 0.5819121599197388, 5: 0.5746968388557434}, 2: {3: 1.0, 6: 0.982224702835083, 11: 0.8939364552497864}, 3: {8: 0.9376938343048096, 1: 0.5211056470870972, 0: 0.49192243814468384}, 4: {9: 0.9290249943733215, 4: 0.657951831817627, 10: 0.6018596887588501}}
query: The cat sits outside
search top 3:
The cat sits outside: 1.0000
猫在窗外: 0.8199
The cat plays in the garden: 0.8007
query: A man is playing guitar
search top 3:
A man is playing guitar: 1.0000
The cat plays in the garden: 0.5819
A woman watches TV: 0.5747
query: The new movie is awesome
search top 3:
The new movie is awesome: 1.0000
The new movie is so great: 0.9822
电影很棒: 0.8939
query: 花呗更改绑定银行卡
search top 3:
如何更换花呗绑定银行卡: 0.9377
A man is playing guitar: 0.5211
The cat sits outside: 0.4919
query: The quick brown fox jumps over the lazy dog.
search top 3:
麻利的棕色狐狸跳过了懒狗: 0.9290
The cat plays in the garden: 0.6580
猫在窗外: 0.6019
1.4. 疾速近似文本语义匹配搜寻(Annoy 和 Hnswlib:百万数据集)
反对 Annoy、Hnswlib 的近似语义匹配搜寻,罕用于百万数据集的匹配搜寻工作。
上亿级别的能够应用 Milvus 向量数据库,检索很快,上面样例是在 1000w 数据集下测试
- 成果预览:
性能比照:
| 硬件配置 | 向量库数据量 | 提取特色所需工夫 | milvus 检索所需工夫 | 排序所需工夫 | 总耗时 |
| ——– | ——– | ——– | ——– | ——– |——– |
| CPU 12 核 2.5GHz | 1000w 大小 15GB 左右 | 64.5ms | 258.3ms | 871.6 ms | 1.19s |
|CPU + Tesla V100 32G | 1000w 大小 15GB 左右 | 10ms | 213.6ms | 24.1ms | 0.25s |
- 我的项目专栏链接:欢送 fork
基于 Milvus+ERNIE+SimCSE+In-batch Negatives 样本策略的学术文献语义检索系统
上面展现 demo:
Annoy 和 Hnswlib 是两个罕用的近似语义匹配搜寻库,它们都能够用于高效地搜寻最靠近给定向量的街坊。
-
Annoy(Approximate Nearest Neighbors Oh Yeah):
- Annoy 是一种基于树结构的近似最近邻算法,其中树被构建为一种非凡的二叉搜寻树。它应用欧氏间隔进行近似街坊搜寻。
- Annoy 反对疾速插入和更新数据,并且占用较少的内存空间。它的搜寻速度快,尤其实用于高维向量的近似搜寻。
- Annoy 能够用于各种工作,如举荐零碎、图像和文本处理等。
- Annoy 的接口简略易用,可在多个编程语言中应用,如 Python、C++ 等。
-
Hnswlib(Hierarchical Navigable Small World Library):
- Hnswlib 也是一种基于树结构的近似最近邻算法,它应用一种叫做“层级可导航小世界”的数据结构。通过构建多个层级的索引构造,它可能疾速搜寻最靠近的街坊。
- Hnswlib 反对欧氏间隔和角度间隔等多种间隔度量形式,使得它实用于不同的利用场景。
- Hnswlib 的索引构造能够在线更新,反对高效地增加和删除向量。
- Hnswlib 提供了多线程搜寻性能,能够进步搜寻速度,并且能够在大规模数据集上进行高效搜寻。
- Hnswlib 在 C ++ 中实现,但也提供了 Python 的绑定接口。
综上所述,Annoy 是一种基于树结构的近似最近邻算法,实用于高维向量的近似搜寻;而 Hnswlib 是基于“层级可导航小世界”构造的近似最近邻算法,反对多种间隔度量形式,并实用于大规模数据集的高效搜寻
更多内容参考:
[举荐零碎 [九] 我的项目技术细节解说 z1:Elasticsearch 如何进行疾速检索 (ES 倒排索引和分词原理) 以及倒排索引在召回中的利用。](https://blog.csdn.net/sinat_39620217/article/details/129399015)
[举荐零碎 [九] 我的项目技术细节解说 z3:向量检索技术与 ANN 搜索算法[KD 树、Annoy、LSH 部分哈希、PQ 乘积量化、IVFPQ 倒排乘积量化、HNSW 层级图搜寻等],超级具体技术原理解说](https://blog.csdn.net/sinat_39620217/article/details/129410504)
import os
import sys
#加载门路导入函数
sys.path.append("/home/mw/project/similarities-main/similarities")
from fastsim import AnnoySimilarity
from fastsim import HnswlibSimilarity
#须要留神,请批改 /home/mw/project/similarities-main/similarities/fastsim.py 对应模型门路,批改为 model_name_or_path="/home/mw/input/99556126636/zh_model/ 中文模型",
sentences = ['明天的天气不错是晴天',
'今天天气很好阳光明媚']
corpus = [
'今天天气很好阳光明媚',
'在好天气里,我喜爱去漫步',
'这本书太无聊了,我无奈读上来',
'我喜爱去海边度假,感触阳光和海风',
'在旅行途中,咱们遇到了许多乏味的人',
'我喜爱随时随地用耳机听音乐',
]
def annoy_demo():
corpus_new = [i + str(id) for id, i in enumerate(corpus * 10)]
model = AnnoySimilarity(corpus=corpus_new)
print(model)
similarity_score = model.similarity(sentences[0], sentences[1])
print(f"{sentences[0]} vs {sentences[1]}, score: {float(similarity_score):.4f}")
model.add_corpus(corpus)
model.build_index()
model.save_index('annoy_model.bin')
print(model.most_similar("men 喜爱这首歌"))
# Semantic Search batch
del model
model = AnnoySimilarity()
model.load_index('annoy_model.bin')
print(model.most_similar("men 喜爱这首歌"))
queries = ["明天的天气不错是晴天", "men 喜爱这首歌"]
res = model.most_similar(queries, topn=3)
print(res)
for q_id, c in res.items():
print('query:', queries[q_id])
print("search top 3:")
for corpus_id, s in c.items():
print(f'\t{model.corpus[corpus_id]}: {s:.4f}')
# os.remove('annoy_model.bin')
print('-' * 50 + '\n')
def hnswlib_demo():
corpus_new = [i + str(id) for id, i in enumerate(corpus * 10)]
print(corpus_new)
model = HnswlibSimilarity(corpus=corpus_new)
print(model)
similarity_score = model.similarity(sentences[0], sentences[1])
print(f"{sentences[0]} vs {sentences[1]}, score: {float(similarity_score):.4f}")
model.add_corpus(corpus)
model.build_index()
model.save_index('hnsw_model.bin')
print(model.most_similar("men 喜爱这首歌"))
# Semantic Search batch
del model
model = HnswlibSimilarity()
model.load_index('hnsw_model.bin')
print(model.most_similar("men 喜爱这首歌"))
queries = ["明天的天气不错是晴天", "men 喜爱这首歌"]
res = model.most_similar(queries, topn=3)
print(res)
for q_id, c in res.items():
print('query:', queries[q_id])
print("search top 3:")
for corpus_id, s in c.items():
print(f'\t{model.corpus[corpus_id]}: {s:.4f}')
# os.remove('hnsw_model.bin')
print('-' * 50 + '\n')
if __name__ == '__main__':
annoy_demo()
hnswlib_demo()
2023-09-11 03:09:23.344 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
2023-09-11 03:09:23.348 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 60
2023-09-11 03:09:28.151 | INFO | similarities.similarity:add_corpus:155 - Add 60 docs, total: 60, emb len: 60
2023-09-11 03:09:28.241 | DEBUG | fastsim:create_index:48 - Init Annoy index, embedding_size: 768
2023-09-11 03:09:28.242 | DEBUG | fastsim:build_index:53 - Building index with 256 trees.
Similarity: AnnoySimilarity, matching_model: <SentenceModel: /home/mw/input/99556126636/zh_model/ 中文模型, encoder_type: MEAN, max_seq_length: 128, emb_dim: 768>, corpus size: 60
2023-09-11 03:09:29.445 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 6
明天的天气不错是晴天 vs 今天天气很好阳光明媚, score: 0.7728
2023-09-11 03:09:30.544 | INFO | similarities.similarity:add_corpus:155 - Add 6 docs, total: 66, emb len: 66
2023-09-11 03:09:30.544 | DEBUG | fastsim:create_index:48 - Init Annoy index, embedding_size: 768
2023-09-11 03:09:30.545 | DEBUG | fastsim:build_index:53 - Building index with 256 trees.
2023-09-11 03:09:30.669 | DEBUG | similarities.similarity:save_index:230 - Save corpus embeddings to file: annoy_model.bin.json.
2023-09-11 03:09:30.669 | INFO | fastsim:save_index:67 - Saving Annoy index to: annoy_model.bin, corpus embedding to: annoy_model.bin.json
{0: {59: 0.4495447165407924, 29: 0.44775851052770577, 5: 0.44683993510903264, 11: 0.44624594564543685, 35: 0.44612286870435014, 53: 0.44598170858875363, 41: 0.44574389155260974, 17: 0.44521147337774636, 23: 0.44469588530591864, 47: 0.444264516571927}}
2023-09-11 03:09:32.450 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
2023-09-11 03:09:32.544 | INFO | fastsim:load_index:75 - Loading index from: annoy_model.bin, corpus embedding from: annoy_model.bin.json
2023-09-11 03:09:32.566 | DEBUG | fastsim:create_index:48 - Init Annoy index, embedding_size: 768
{0: {59: 0.4495447165407924, 29: 0.44775851052770577, 5: 0.44683993510903264, 11: 0.44624594564543685, 35: 0.44612286870435014, 53: 0.44598170858875363, 41: 0.44574389155260974, 17: 0.44521147337774636, 23: 0.44469588530591864, 47: 0.444264516571927}}
{0: {60: 0.7727918750653213, 6: 0.7462793254362339, 36: 0.7355303251593384}, 1: {59: 0.4495447165407924, 29: 0.44775863580996855, 5: 0.4468400604954468}}
query: 明天的天气不错是晴天
search top 3:
今天天气很好阳光明媚: 0.7728
今天天气很好阳光明媚 6: 0.7463
今天天气很好阳光明媚 36: 0.7355
query: men 喜爱这首歌
search top 3:
我喜爱随时随地用耳机听音乐 59: 0.4495
我喜爱随时随地用耳机听音乐 29: 0.4478
我喜爱随时随地用耳机听音乐 5: 0.4468
--------------------------------------------------
['今天天气很好阳光明媚 0', '在好天气里,我喜爱去漫步 1', '这本书太无聊了,我无奈读上来 2', '我喜爱去海边度假,感触阳光和海风 3', '在旅行途中,咱们遇到了许多乏味的人 4', '我喜爱随时随地用耳机听音乐 5', '今天天气很好阳光明媚 6', '在好天气里,我喜爱去漫步 7', '这本书太无聊了,我无奈读上来 8', '我喜爱去海边度假,感触阳光和海风 9', '在旅行途中,咱们遇到了许多乏味的人 10', '我喜爱随时随地用耳机听音乐 11', '今天天气很好阳光明媚 12', '在好天气里,我喜爱去漫步 13', '这本书太无聊了,我无奈读上来 14', '我喜爱去海边度假,感触阳光和海风 15', '在旅行途中,咱们遇到了许多乏味的人 16', '我喜爱随时随地用耳机听音乐 17', '今天天气很好阳光明媚 18', '在好天气里,我喜爱去漫步 19', '这本书太无聊了,我无奈读上来 20', '我喜爱去海边度假,感触阳光和海风 21', '在旅行途中,咱们遇到了许多乏味的人 22', '我喜爱随时随地用耳机听音乐 23', '今天天气很好阳光明媚 24', '在好天气里,我喜爱去漫步 25', '这本书太无聊了,我无奈读上来 26', '我喜爱去海边度假,感触阳光和海风 27', '在旅行途中,咱们遇到了许多乏味的人 28', '我喜爱随时随地用耳机听音乐 29', '今天天气很好阳光明媚 30', '在好天气里,我喜爱去漫步 31', '这本书太无聊了,我无奈读上来 32', '我喜爱去海边度假,感触阳光和海风 33', '在旅行途中,咱们遇到了许多乏味的人 34', '我喜爱随时随地用耳机听音乐 35', '今天天气很好阳光明媚 36', '在好天气里,我喜爱去漫步 37', '这本书太无聊了,我无奈读上来 38', '我喜爱去海边度假,感触阳光和海风 39', '在旅行途中,咱们遇到了许多乏味的人 40', '我喜爱随时随地用耳机听音乐 41', '今天天气很好阳光明媚 42', '在好天气里,我喜爱去漫步 43', '这本书太无聊了,我无奈读上来 44', '我喜爱去海边度假,感触阳光和海风 45', '在旅行途中,咱们遇到了许多乏味的人 46', '我喜爱随时随地用耳机听音乐 47', '今天天气很好阳光明媚 48', '在好天气里,我喜爱去漫步 49', '这本书太无聊了,我无奈读上来 50', '我喜爱去海边度假,感触阳光和海风 51', '在旅行途中,咱们遇到了许多乏味的人 52', '我喜爱随时随地用耳机听音乐 53', '今天天气很好阳光明媚 54', '在好天气里,我喜爱去漫步 55', '这本书太无聊了,我无奈读上来 56', '我喜爱去海边度假,感触阳光和海风 57', '在旅行途中,咱们遇到了许多乏味的人 58', '我喜爱随时随地用耳机听音乐 59']
2023-09-11 03:09:34.947 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
2023-09-11 03:09:35.044 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 60
2023-09-11 03:09:40.349 | INFO | similarities.similarity:add_corpus:155 - Add 60 docs, total: 60, emb len: 60
2023-09-11 03:09:40.350 | DEBUG | fastsim:create_index:150 - Init Hnswlib index, embedding_size: 768
2023-09-11 03:09:40.351 | INFO | fastsim:build_index:156 - Building HNSWLIB index, max_elements: 60
2023-09-11 03:09:40.351 | DEBUG | fastsim:build_index:157 - Parameters Required: M: 64
2023-09-11 03:09:40.352 | DEBUG | fastsim:build_index:158 - Parameters Required: ef_construction: 400
2023-09-11 03:09:40.352 | DEBUG | fastsim:build_index:159 - Parameters Required: ef(>topn): 50
Similarity: HnswlibSimilarity, matching_model: <SentenceModel: /home/mw/input/99556126636/zh_model/ 中文模型, encoder_type: MEAN, max_seq_length: 128, emb_dim: 768>, corpus size: 60
2023-09-11 03:09:41.345 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 6
明天的天气不错是晴天 vs 今天天气很好阳光明媚, score: 0.7728
2023-09-11 03:09:42.548 | INFO | similarities.similarity:add_corpus:155 - Add 6 docs, total: 66, emb len: 66
2023-09-11 03:09:42.550 | DEBUG | fastsim:create_index:150 - Init Hnswlib index, embedding_size: 768
2023-09-11 03:09:42.551 | INFO | fastsim:build_index:156 - Building HNSWLIB index, max_elements: 66
2023-09-11 03:09:42.551 | DEBUG | fastsim:build_index:157 - Parameters Required: M: 64
2023-09-11 03:09:42.552 | DEBUG | fastsim:build_index:158 - Parameters Required: ef_construction: 400
2023-09-11 03:09:42.552 | DEBUG | fastsim:build_index:159 - Parameters Required: ef(>topn): 50
2023-09-11 03:09:42.717 | DEBUG | similarities.similarity:save_index:230 - Save corpus embeddings to file: hnsw_model.bin.json.
2023-09-11 03:09:42.718 | INFO | fastsim:save_index:172 - Saving hnswlib index to: hnsw_model.bin, corpus embedding to: hnsw_model.bin.json
{0: {59: 0.44954460859298706, 29: 0.44775843620300293, 5: 0.44683992862701416, 11: 0.44624578952789307, 35: 0.4461227059364319, 53: 0.4459817409515381, 41: 0.4457439184188843, 17: 0.44521135091781616, 23: 0.4446955919265747, 47: 0.44426441192626953}}
2023-09-11 03:09:44.547 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu
2023-09-11 03:09:44.643 | INFO | fastsim:load_index:180 - Loading index from: hnsw_model.bin, corpus embedding from: hnsw_model.bin.json
2023-09-11 03:09:44.665 | DEBUG | fastsim:create_index:150 - Init Hnswlib index, embedding_size: 768
Warning: Calling load_index for an already inited index. Old index is being deallocated.
{0: {59: 0.44954460859298706, 29: 0.44775843620300293, 5: 0.44683992862701416, 11: 0.44624578952789307, 35: 0.4461227059364319, 53: 0.4459817409515381, 41: 0.4457439184188843, 17: 0.44521135091781616, 23: 0.4446955919265747, 47: 0.44426441192626953}}
{0: {60: 0.7727917432785034, 6: 0.746279239654541, 36: 0.7355299592018127}, 1: {59: 0.4495447278022766, 29: 0.4477585554122925, 5: 0.44683998823165894}}
query: 明天的天气不错是晴天
search top 3:
今天天气很好阳光明媚: 0.7728
今天天气很好阳光明媚 6: 0.7463
今天天气很好阳光明媚 36: 0.7355
query: men 喜爱这首歌
search top 3:
我喜爱随时随地用耳机听音乐 59: 0.4495
我喜爱随时随地用耳机听音乐 29: 0.4478
我喜爱随时随地用耳机听音乐 5: 0.4468
--------------------------------------------------
1.5 基于字面的文本类似度计算和匹配搜寻
反对同义词词林(Cilin)、知网 Hownet、词向量(WordEmbedding)、Tfidf、SimHash、BM25 等算法的类似度计算和字面匹配搜寻,罕用于文本匹配冷启动。
–> 480 self.cilin_dict = self.load_cilin_dict(cilin_path) # Cilin(词林) semantic dictionary
481 self.corpus = {}
482
/opt/conda/lib/python3.7/site-packages/similarities/literalsim.py in load_cilin_dict(path)
522 """加载词林语义词典"""
523 sem_dict = {}
–> 524 for line in open(path, ‘r’, encoding=’utf-8′):
525 line = line.strip()
526 terms = line.split(' ')
FileNotFoundError: [Errno 2] No such file or directory: ‘/opt/conda/lib/python3.7/site-packages/similarities/data/cilin.txt’
增加词库门路批改 literalsim.py 文件
default_cilin_path=’/home/mw/project/similarities-main/similarities/data/cilin.txt’
import sys
from loguru import logger
sys.path.append('/home/mw/project/similarities-main')
sys.path.append('/home/mw/project/similarities-main/similarities')
from similarities import (
SimHashSimilarity,
TfidfSimilarity,
BM25Similarity,
WordEmbeddingSimilarity,
CilinSimilarity,
HownetSimilarity,
SameCharsSimilarity,
SequenceMatcherSimilarity,
)
logger.remove()
logger.add(sys.stderr, level="INFO")
def sim_and_search(m):
print(m)
if 'BM25' not in str(m):
sim_scores = m.similarity(text1, text2)
print('sim scores:', sim_scores)
for (idx, i), j in zip(enumerate(text1), text2):
s = sim_scores[idx] if isinstance(sim_scores, list) else sim_scores[idx][idx]
print(f"{i} vs {j}, score: {s:.4f}")
m.add_corpus(corpus)
res = m.most_similar(queries, topn=3)
print('sim search:', res)
for q_id, c in res.items():
print('query:', queries[q_id])
print("search top 3:")
for corpus_id, s in c.items():
print(f'\t{m.corpus[corpus_id]}: {s:.4f}')
print('-' * 50 + '\n')
if __name__ == '__main__':
text1 = [
'如何更换花呗绑定银行卡',
'花呗更改绑定银行卡'
]
text2 = [
'花呗更改绑定银行卡',
'我什么时候开明了花呗',
]
corpus = [
'花呗更改绑定银行卡',
'我什么时候开明了花呗',
'俄罗斯正告乌克兰拥护欧盟协定',
'暴风雨埋葬了东北部;新泽西 16 英寸的降雪',
'地方情报局局长拜访以色列叙利亚谈判',
'人在巴基斯坦基地的炸弹袭击中丧生',
]
queries = [
'我的花呗开明了?',
'乌克兰被俄罗斯正告',
'更改绑定银行卡',
]
print('text1:', text1)
print('text2:', text2)
print('query:', queries)
sim_and_search(SimHashSimilarity())
sim_and_search(TfidfSimilarity())
sim_and_search(BM25Similarity())
sim_and_search(WordEmbeddingSimilarity())
# sim_and_search(CilinSimilarity()) #词库门路在 /home/mw/project/similarities-main/similarities/data/ 下,自行添加
# sim_and_search(HownetSimilarity())
sim_and_search(SameCharsSimilarity())
sim_and_search(SequenceMatcherSimilarity())
2023-09-11 03:36:11.670 | INFO | similarities.literalsim:add_corpus:75 - Start computing corpus embeddings, new docs: 6
text1: ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
text2: ['花呗更改绑定银行卡', '我什么时候开明了花呗']
query: ['我的花呗开明了?', '乌克兰被俄罗斯正告', '更改绑定银行卡']
Similarity: SimHashSimilarity, matching_model: SimHash
sim scores: [0.9375, 0.5]
如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.9375
花呗更改绑定银行卡 vs 我什么时候开明了花呗, score: 0.5000
Computing corpus SimHash: 100%|██████████| 6/6 [00:00<00:00, 3128.91it/s]
2023-09-11 03:36:11.675 | INFO | similarities.literalsim:add_corpus:84 - Add 6 docs, total: 6, emb size: 6
sim search: {0: {3: 0.703125, 5: 0.5625, 1: 0.515625}, 1: {0: 0.78125, 1: 0.484375, 2: 0.484375}, 2: {4: 1.0, 1: 0.59375, 2: 0.59375}}
query: 我的花呗开明了?search top 3:
我什么时候开明了花呗: 0.7031
暴风雨埋葬了东北部;新泽西 16 英寸的降雪: 0.5625
人在巴基斯坦基地的炸弹袭击中丧生: 0.5156
query: 乌克兰被俄罗斯正告
search top 3:
俄罗斯正告乌克兰拥护欧盟协定: 0.7812
人在巴基斯坦基地的炸弹袭击中丧生: 0.4844
地方情报局局长拜访以色列叙利亚谈判: 0.4844
query: 更改绑定银行卡
search top 3:
花呗更改绑定银行卡: 1.0000
人在巴基斯坦基地的炸弹袭击中丧生: 0.5938
地方情报局局长拜访以色列叙利亚谈判: 0.5938
--------------------------------------------------
Similarity: TfidfSimilarity, matching_model: Tfidf
2023-09-11 03:36:12.649 | INFO | similarities.literalsim:add_corpus:238 - Start computing corpus embeddings, new docs: 6
sim scores: tensor([[0.7948, 0.4022],
[1.0000, 0.4048]], dtype=torch.float64)
如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.7948
花呗更改绑定银行卡 vs 我什么时候开明了花呗, score: 0.4048
Computing corpus TFIDF: 100%|██████████| 6/6 [00:00<00:00, 23.17it/s]
2023-09-11 03:36:12.911 | INFO | similarities.literalsim:add_corpus:247 - Add 6 docs, total: 6, emb size: 6
2023-09-11 03:36:13.461 | INFO | similarities.literalsim:add_corpus:334 - Start computing corpus embeddings, new docs: 6
2023-09-11 03:36:13.463 | INFO | similarities.literalsim:add_corpus:340 - Add 6 docs, total: 6
2023-09-11 03:36:13.465 | INFO | text2vec.word2vec:__init__:80 - Load pretrained model:w2v-light-tencent-chinese, path:/home/mw/.text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.bin
sim search: {0: {3: 0.921499490737915, 4: 0.43930041790008545, 5: 0.0}, 1: {0: 0.7380481958389282, 4: 0.0, 3: 0.0}, 2: {4: 0.8345502018928528, 3: 0.0, 5: 0.0}}
query: 我的花呗开明了?search top 3:
我什么时候开明了花呗: 0.9215
花呗更改绑定银行卡: 0.4393
暴风雨埋葬了东北部;新泽西 16 英寸的降雪: 0.0000
query: 乌克兰被俄罗斯正告
search top 3:
俄罗斯正告乌克兰拥护欧盟协定: 0.7380
花呗更改绑定银行卡: 0.0000
我什么时候开明了花呗: 0.0000
query: 更改绑定银行卡
search top 3:
花呗更改绑定银行卡: 0.8346
我什么时候开明了花呗: 0.0000
暴风雨埋葬了东北部;新泽西 16 英寸的降雪: 0.0000
--------------------------------------------------
Similarity: BM25Similarity, matching_model: BM25
sim search: {0: {3: 4.453010263817695, 4: 1.3720219233789517, 5: 1.010258330300517}, 1: {0: 4.245182027356298, 1: 0.0, 2: 0.0}, 2: {4: 4.549213631437518, 0: 0.0, 1: 0.0}}
query: 我的花呗开明了?search top 3:
我什么时候开明了花呗: 4.4530
花呗更改绑定银行卡: 1.3720
暴风雨埋葬了东北部;新泽西 16 英寸的降雪: 1.0103
query: 乌克兰被俄罗斯正告
search top 3:
俄罗斯正告乌克兰拥护欧盟协定: 4.2452
人在巴基斯坦基地的炸弹袭击中丧生: 0.0000
地方情报局局长拜访以色列叙利亚谈判: 0.0000
query: 更改绑定银行卡
search top 3:
花呗更改绑定银行卡: 4.5492
俄罗斯正告乌克兰拥护欧盟协定: 0.0000
人在巴基斯坦基地的炸弹袭击中丧生: 0.0000
--------------------------------------------------
2023-09-11 03:36:14.743 | INFO | similarities.literalsim:add_corpus:424 - Start computing corpus embeddings, new docs: 6
Similarity: WordEmbeddingSimilarity, matching_model: Word2Vec
sim scores: tensor([[0.9812, 0.8195],
[1.0000, 0.8264]], dtype=torch.float64)
如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.9812
花呗更改绑定银行卡 vs 我什么时候开明了花呗, score: 0.8264
Word2Vec Embeddings: 100%|██████████| 6/6 [00:00<00:00, 7707.76it/s]
2023-09-11 03:36:14.746 | INFO | similarities.literalsim:add_corpus:431 - Add 6 docs, total: 6, emb size: 6
2023-09-11 03:36:14.756 | INFO | similarities.literalsim:add_corpus:804 - Start add new docs: 6
2023-09-11 03:36:14.757 | INFO | similarities.literalsim:add_corpus:805 - Add 6 docs, total: 6
2023-09-11 03:36:14.761 | INFO | similarities.literalsim:add_corpus:900 - Start add new docs: 6
2023-09-11 03:36:14.762 | INFO | similarities.literalsim:add_corpus:901 - Add 6 docs, total: 6
sim search: {0: {3: 0.8737779259681702, 4: 0.7954849004745483, 5: 0.713451623916626}, 1: {0: 0.9661487936973572, 1: 0.811479926109314, 2: 0.7922273278236389}, 2: {4: 0.9858745336532593, 2: 0.819598913192749, 5: 0.8008757829666138}}
query: 我的花呗开明了?search top 3:
我什么时候开明了花呗: 0.8738
花呗更改绑定银行卡: 0.7955
暴风雨埋葬了东北部;新泽西 16 英寸的降雪: 0.7135
query: 乌克兰被俄罗斯正告
search top 3:
俄罗斯正告乌克兰拥护欧盟协定: 0.9661
人在巴基斯坦基地的炸弹袭击中丧生: 0.8115
地方情报局局长拜访以色列叙利亚谈判: 0.7922
query: 更改绑定银行卡
search top 3:
花呗更改绑定银行卡: 0.9859
地方情报局局长拜访以色列叙利亚谈判: 0.8196
暴风雨埋葬了东北部;新泽西 16 英寸的降雪: 0.8009
--------------------------------------------------
Similarity: SameCharsSimilarity, matching_model: SameChars
sim scores: [0.8888888888888888, 0.2222222222222222]
如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.8889
花呗更改绑定银行卡 vs 我什么时候开明了花呗, score: 0.2222
sim search: {0: {3: 0.75, 4: 0.25, 5: 0.25}, 1: {0: 0.8888888888888888, 1: 0.1111111111111111, 2: 0.0}, 2: {4: 1.0, 0: 0.0, 1: 0.0}}
query: 我的花呗开明了?search top 3:
我什么时候开明了花呗: 0.7500
花呗更改绑定银行卡: 0.2500
暴风雨埋葬了东北部;新泽西 16 英寸的降雪: 0.2500
query: 乌克兰被俄罗斯正告
search top 3:
俄罗斯正告乌克兰拥护欧盟协定: 0.8889
人在巴基斯坦基地的炸弹袭击中丧生: 0.1111
地方情报局局长拜访以色列叙利亚谈判: 0.0000
query: 更改绑定银行卡
search top 3:
花呗更改绑定银行卡: 1.0000
俄罗斯正告乌克兰拥护欧盟协定: 0.0000
人在巴基斯坦基地的炸弹袭击中丧生: 0.0000
--------------------------------------------------
Similarity: SequenceMatcherSimilarity, matching_model: SequenceMatcher
sim scores: [0.5555555555555556, 0.2222222222222222]
如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.5556
花呗更改绑定银行卡 vs 我什么时候开明了花呗, score: 0.2222
sim search: {0: {3: 0.375, 4: 0.25, 1: 0.125}, 1: {0: 0.5555555555555556, 1: 0.1111111111111111, 2: 0.0}, 2: {4: 1.0, 0: 0.0, 1: 0.0}}
query: 我的花呗开明了?search top 3:
我什么时候开明了花呗: 0.3750
花呗更改绑定银行卡: 0.2500
人在巴基斯坦基地的炸弹袭击中丧生: 0.1250
query: 乌克兰被俄罗斯正告
search top 3:
俄罗斯正告乌克兰拥护欧盟协定: 0.5556
人在巴基斯坦基地的炸弹袭击中丧生: 0.1111
地方情报局局长拜访以色列叙利亚谈判: 0.0000
query: 更改绑定银行卡
search top 3:
花呗更改绑定银行卡: 1.0000
俄罗斯正告乌克兰拥护欧盟协定: 0.0000
人在巴基斯坦基地的炸弹袭击中丧生: 0.0000
--------------------------------------------------
2. 图像类似度计算和匹配搜寻
反对 CLIP、pHash、SIFT 等算法的图像类似度计算和匹配搜寻。
自行去 huggingface 下载模型到和鲸社区里即可:OFA-Sys/chinese-clip-vit-base-patch16,这里就不过展现了。
# import glob
# import sys
# from PIL import Image
# sys.path.append('/home/mw/project/similarities-main/similarities')
# from imagesim import ImageHashSimilarity, SiftSimilarity,ClipSimilarity
# def sim_and_search(m):
# print(m)
# # similarity
# sim_scores = m.similarity(imgs1, imgs2)
# print('sim scores:', sim_scores)
# for (idx, i), j in zip(enumerate(image_fps1), image_fps2):
# s = sim_scores[idx] if isinstance(sim_scores, list) else sim_scores[idx][idx]
# print(f"{i} vs {j}, score: {s:.4f}")
# # search
# m.add_corpus(corpus_imgs)
# queries = imgs1
# res = m.most_similar(queries, topn=3)
# print('sim search:', res)
# for q_id, c in res.items():
# print('query:', image_fps1[q_id])
# print("search top 3:")
# for corpus_id, s in c.items():
# print(f'\t{m.corpus[corpus_id].filename}: {s:.4f}')
# print('-' * 50 + '\n')
# def clip_demo():
# m = ClipSimilarity()
# print(m)
# # similarity score between text and image
# image_fps = [
# '/home/mw/project/similarities-main/examples/data/image3.png', # yellow flower image
# '/home/mw/project/similarities-main/examples/data/image1.png', # tiger image
# ]
# texts = ['a yellow flower', '老虎', '一头狮子', '玩具车']
# imgs = [Image.open(i) for i in image_fps]
# sim_scores = m.similarity(imgs, texts)
# print('sim scores:', sim_scores)
# for idx, i in enumerate(image_fps):
# for idy, j in enumerate(texts):
# s = sim_scores[idx][idy]
# print(f"{i} vs {j}, score: {s:.4f}")
# print('-' * 50 + '\n')
# if __name__ == "__main__":
# image_fps1 = ['/home/mw/project/similarities-main/examples/data/image1.png', '/home/mw/project/similarities-main/examples/data/image3.png']
# image_fps2 = ['/home/mw/project/similarities-main/examples/data/image12-like-image1.png', '/home/mw/project/similarities-main/examples/data/image10.png']
# imgs1 = [Image.open(i) for i in image_fps1]
# imgs2 = [Image.open(i) for i in image_fps2]
# corpus_fps = glob.glob('data/*.jpg') + glob.glob('data/*.png')
# corpus_imgs = [Image.open(i) for i in corpus_fps]
# # 1. image and text similarity
# clip_demo()
# # 2. image and image similarity score
# sim_and_search(ClipSimilarity()) # the best result
# sim_and_search(ImageHashSimilarity(hash_function='phash'))
# sim_and_search(SiftSimilarity())
Similarity: ClipSimilarity, matching_model: CLIPModel
sim scores: tensor([[0.9580, 0.8654],
[0.6558, 0.6145]])
data/image1.png vs data/image12-like-image1.png, score: 0.9580
data/image3.png vs data/image10.png, score: 0.6145
sim search: {0: {6: 0.9999999403953552, 0: 0.9579654932022095, 4: 0.9326782822608948}, 1: {8: 0.9999997615814209, 4: 0.6729235649108887, 0: 0.6558331847190857}}
query: data/image1.png
search top 3:
data/image1.png: 1.0000
data/image12-like-image1.png: 0.9580
data/image8-like-image1.png: 0.9327
sim scores: tensor([[0.3220, 0.2409],
[0.1677, 0.2959]])
data/image3.png vs a yellow flower, score: 0.3220
data/image1.png vs 老虎, score: 0.2112
更多优质内容请关注公号:汀丶人工智能;会提供一些相干的资源和优质文章,收费获取浏览。
我的项目链接:文本语义匹配搜寻疾速上手 baseline