解锁搜寻新境界!让文本语义匹配助你轻松找到你须要的所有!(疾速上手baseline)
实现了多种类似度计算、匹配搜索算法,反对文本、图像,python3开发,pip装置,开箱即用。
文本类似度计算(文本匹配)
- 余弦类似(Cosine Similarity):两向量求余弦
- 点积(Dot Product):两向量归一化后求内积
- 汉明间隔(Hamming Distance),编辑间隔(Levenshtein Distance),欧氏间隔(Euclidean Distance),曼哈顿间隔(Manhattan Distance)等
语义模型
- [CoSENT文本匹配模型]【举荐】
- BERT模型(文本向量表征)
- SentenceBERT文本匹配模型
字面模型
- [Word2Vec文本浅层语义表征]【举荐】
- 同义词词林
- 知网Hownet义原匹配
- BM25、RankBM25
- TFIDF
- SimHash
图像类似度计算(图像匹配)
语义模型
- [CLIP(Contrastive Language-Image Pre-Training)]
- VGG(doing)
- ResNet(doing)
特征提取
- [pHash]【举荐】, dHash, wHash, aHash
- SIFT, Scale Invariant Feature Transform(SIFT)
- SURF, Speeded Up Robust Features(SURF)(doing)
图文类似度计算
- [CLIP(Contrastive Language-Image Pre-Training)]
匹配搜寻
- [SemanticSearch]:向量类似检索,应用Cosine
Similarty + topk高效计算,比一对一暴力计算快一个数量级
- [SemanticSearch]:向量类似检索,应用Cosine
我的项目链接见文末
环境装置:
!pip install --upgrade pip -i https://mirrors.cloud.tencent.com/pypi/simple!pip install -U similarities -i https://mirrors.cloud.tencent.com/pypi/simple#装置依赖库!pip install -r /home/mw/project/similarities-main/requirements.txt -i https://mirrors.cloud.tencent.com/pypi/simple#装置高版本torch,装置完后重启内核!pip install torch==1.12.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu -i https://mirrors.cloud.tencent.com/pypi/simple
1.1 文本语义类似度计算
from similarities import Similarity#模型文件在input门路下,有中文和多语言两个模型,可自行抉择m = Similarity(model_name_or_path="/home/mw/input/99556126636/zh_model/中文模型",max_seq_length=128)r = m.similarity('明天的天气不错是晴天', '今天天气很好阳光明媚')print(f"similarity score: {float(r)}")2023-09-11 02:36:44.046 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpusimilarity score: 0.7727918028831482
Similarity的默认办法:
Similarity(corpus: Union[List[str], Dict[str, str]] = None, model_name_or_path="shibing624/text2vec-base-chinese", max_seq_length=128)
- 返回值:余弦值
score
范畴是[-1, 1],值越大越类似 corpus
:搜寻用的doc集,仅搜寻时须要,输出格局:句子列表List[str]
或者{corpus_id: sentence}的Dict[str, str]
格局model_name_or_path
:模型名称或者模型门路,默认会从HF model hub下载并应用中文语义匹配模型shibing624/text2vec-base-chinese,如果是多语言景,能够替换为多语言匹配模型shibing624/text2vec-base-multilingualmax_seq_length
:输出句子的最大长度,最大为匹配模型反对的最大长度,BERT系列是512
1.2. 文本语义匹配搜寻
个别在文档候选集中找与query最类似的文本,罕用于QA场景的问句类似匹配、文本类似检索等工作。
from similarities import Similarity# 1.Compute cosine similarity between two sentences.sentences = ['明天的天气不错是晴天', '今天天气很好阳光明媚']corpus = [ '今天天气很好阳光明媚', '在好天气里,我喜爱去漫步', '这本书太无聊了,我无奈读上来', '我喜爱去海边度假,感触阳光和海风', '在旅行途中,咱们遇到了许多乏味的人', '我喜爱随时随地用耳机听音乐',]model = Similarity(model_name_or_path="/home/mw/input/99556126636/zh_model/中文模型")print(model)similarity_score = model.similarity(sentences[0], sentences[1])print(f"{sentences[0]} vs {sentences[1]}, score: {float(similarity_score):.4f}")print('-' * 50 + '\n')# 2.Compute similarity between two listsimilarity_scores = model.similarity(sentences, corpus)print(similarity_scores.numpy())for i in range(len(sentences)): for j in range(len(corpus)): print(f"{sentences[i]} vs {corpus[j]}, score: {similarity_scores.numpy()[i][j]:.4f}")print('-' * 50 + '\n')# 3.Semantic Searchmodel.add_corpus(corpus)res = model.most_similar(queries=sentences, topn=3)print(res)for q_id, c in res.items(): print('query:', sentences[q_id]) print("search top 3:") for corpus_id, s in c.items(): print(f'\t{model.corpus[corpus_id]}: {s:.4f}')
2023-09-11 02:43:19.744 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpuSimilarity: Similarity, matching_model: <SentenceModel: /home/mw/input/99556126636/zh_model/中文模型, encoder_type: MEAN, max_seq_length: 128, emb_dim: 768>明天的天气不错是晴天 vs 今天天气很好阳光明媚, score: 0.7728--------------------------------------------------2023-09-11 02:43:22.348 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 6[[0.77279186 0.377486 0.2831661 0.3328314 0.33157927 0.271398 ] [1. 0.4531002 0.22196919 0.42843264 0.31628954 0.28194088]]明天的天气不错是晴天 vs 今天天气很好阳光明媚, score: 0.7728明天的天气不错是晴天 vs 在好天气里,我喜爱去漫步, score: 0.3775明天的天气不错是晴天 vs 这本书太无聊了,我无奈读上来, score: 0.2832明天的天气不错是晴天 vs 我喜爱去海边度假,感触阳光和海风, score: 0.3328明天的天气不错是晴天 vs 在旅行途中,咱们遇到了许多乏味的人, score: 0.3316明天的天气不错是晴天 vs 我喜爱随时随地用耳机听音乐, score: 0.2714今天天气很好阳光明媚 vs 今天天气很好阳光明媚, score: 1.0000今天天气很好阳光明媚 vs 在好天气里,我喜爱去漫步, score: 0.4531今天天气很好阳光明媚 vs 这本书太无聊了,我无奈读上来, score: 0.2220今天天气很好阳光明媚 vs 我喜爱去海边度假,感触阳光和海风, score: 0.4284今天天气很好阳光明媚 vs 在旅行途中,咱们遇到了许多乏味的人, score: 0.3163今天天气很好阳光明媚 vs 我喜爱随时随地用耳机听音乐, score: 0.2819--------------------------------------------------2023-09-11 02:43:23.945 | INFO | similarities.similarity:add_corpus:155 - Add 6 docs, total: 6, emb len: 6{0: {0: 0.772791862487793, 1: 0.377485990524292, 3: 0.33283141255378723}, 1: {0: 1.0, 1: 0.45310020446777344, 3: 0.4284326434135437}}query: 明天的天气不错是晴天search top 3: 今天天气很好阳光明媚: 0.7728 在好天气里,我喜爱去漫步: 0.3775 我喜爱去海边度假,感触阳光和海风: 0.3328query: 今天天气很好阳光明媚search top 3: 今天天气很好阳光明媚: 1.0000 在好天气里,我喜爱去漫步: 0.4531 我喜爱去海边度假,感触阳光和海风: 0.4284
余弦score
的值范畴[-1, 1],值越大,示意该query与corpus的文本越类似。
1.3 多语言文本语义类似度计算和匹配搜寻
多语言:包含中、英、韩、日、德、意等多国语言
from similarities import Similarity# Two lists of sentencessentences1 = [ 'The cat sits outside', 'A man is playing guitar', 'The new movie is awesome', '花呗更改绑定银行卡', 'The quick brown fox jumps over the lazy dog.',]sentences2 = [ 'The dog plays in the garden', 'A woman watches TV', 'The new movie is so great', '如何更换花呗绑定银行卡', '麻利的棕色狐狸跳过了懒狗',]model = Similarity(model_name_or_path="/home/mw/input/99556126636/mul_model/多语言模型")# 应用的是多语言文本匹配模型scores = model.similarity(sentences1, sentences2)print('1:use Similarity compute cos scores\n')for i in range(len(sentences1)): for j in range(len(sentences2)): print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[j], scores[i][j]))print('-' * 50 + '\n')print('2:search\n')# 2.Semantic Searchcorpus = [ 'The cat sits outside', 'A man is playing guitar', 'I love pasta', 'The new movie is awesome', 'The cat plays in the garden', 'A woman watches TV', 'The new movie is so great', 'Do you like pizza?', '如何更换花呗绑定银行卡', '麻利的棕色狐狸跳过了懒狗', '猫在窗外', '电影很棒',]model.add_corpus(corpus)model.save_index('en_corpus_emb.json')res = model.most_similar(queries=sentences1, topn=3)print(res)del modelmodel = Similarity(model_name_or_path="/home/mw/input/99556126636/mul_model/多语言模型")model.load_index('en_corpus_emb.json')res = model.most_similar(queries=sentences1, topn=3)print(res)for q_id, c in res.items(): print('query:', sentences1[q_id]) print("search top 3:") for corpus_id, s in c.items(): print(f'\t{model.corpus[corpus_id]}: {s:.4f}')
2023-09-11 02:46:32.262 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu2023-09-11 02:46:33.748 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 121:use Similarity compute cos scoresThe cat sits outside The dog plays in the garden Score: 0.6211The cat sits outside A woman watches TV Score: 0.4926The cat sits outside The new movie is so great Score: 0.5312The cat sits outside 如何更换花呗绑定银行卡 Score: 0.4604The cat sits outside 麻利的棕色狐狸跳过了懒狗 Score: 0.4951A man is playing guitar The dog plays in the garden Score: 0.6483A man is playing guitar A woman watches TV Score: 0.5747A man is playing guitar The new movie is so great Score: 0.5524A man is playing guitar 如何更换花呗绑定银行卡 Score: 0.5098A man is playing guitar 麻利的棕色狐狸跳过了懒狗 Score: 0.5210The new movie is awesome The dog plays in the garden Score: 0.5940The new movie is awesome A woman watches TV Score: 0.5510The new movie is awesome The new movie is so great Score: 0.9822The new movie is awesome 如何更换花呗绑定银行卡 Score: 0.4767The new movie is awesome 麻利的棕色狐狸跳过了懒狗 Score: 0.5523花呗更改绑定银行卡 The dog plays in the garden Score: 0.4788花呗更改绑定银行卡 A woman watches TV Score: 0.3842花呗更改绑定银行卡 The new movie is so great Score: 0.4845花呗更改绑定银行卡 如何更换花呗绑定银行卡 Score: 0.9377花呗更改绑定银行卡 麻利的棕色狐狸跳过了懒狗 Score: 0.4546The quick brown fox jumps over the lazy dog. The dog plays in the garden Score: 0.7547The quick brown fox jumps over the lazy dog. A woman watches TV Score: 0.4952The quick brown fox jumps over the lazy dog. The new movie is so great Score: 0.5761The quick brown fox jumps over the lazy dog. 如何更换花呗绑定银行卡 Score: 0.4426The quick brown fox jumps over the lazy dog. 麻利的棕色狐狸跳过了懒狗 Score: 0.9290--------------------------------------------------2:search2023-09-11 02:46:34.448 | INFO | similarities.similarity:add_corpus:155 - Add 12 docs, total: 12, emb len: 122023-09-11 02:46:34.468 | DEBUG | similarities.similarity:save_index:230 - Save corpus embeddings to file: en_corpus_emb.json.{0: {0: 0.9999998807907104, 10: 0.819859504699707, 4: 0.8006516695022583}, 1: {1: 1.0000001192092896, 4: 0.5819121599197388, 5: 0.5746968388557434}, 2: {3: 1.0, 6: 0.982224702835083, 11: 0.8939364552497864}, 3: {8: 0.9376938343048096, 1: 0.5211056470870972, 0: 0.49192243814468384}, 4: {9: 0.9290249943733215, 4: 0.657951831817627, 10: 0.6018596887588501}}2023-09-11 02:46:37.260 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu{0: {0: 0.9999998807907104, 10: 0.819859504699707, 4: 0.8006516695022583}, 1: {1: 1.0000001192092896, 4: 0.5819121599197388, 5: 0.5746968388557434}, 2: {3: 1.0, 6: 0.982224702835083, 11: 0.8939364552497864}, 3: {8: 0.9376938343048096, 1: 0.5211056470870972, 0: 0.49192243814468384}, 4: {9: 0.9290249943733215, 4: 0.657951831817627, 10: 0.6018596887588501}}query: The cat sits outsidesearch top 3: The cat sits outside: 1.0000 猫在窗外: 0.8199 The cat plays in the garden: 0.8007query: A man is playing guitarsearch top 3: A man is playing guitar: 1.0000 The cat plays in the garden: 0.5819 A woman watches TV: 0.5747query: The new movie is awesomesearch top 3: The new movie is awesome: 1.0000 The new movie is so great: 0.9822 电影很棒: 0.8939query: 花呗更改绑定银行卡search top 3: 如何更换花呗绑定银行卡: 0.9377 A man is playing guitar: 0.5211 The cat sits outside: 0.4919query: The quick brown fox jumps over the lazy dog.search top 3: 麻利的棕色狐狸跳过了懒狗: 0.9290 The cat plays in the garden: 0.6580 猫在窗外: 0.6019
1.4. 疾速近似文本语义匹配搜寻(Annoy和Hnswlib:百万数据集)
反对Annoy、Hnswlib的近似语义匹配搜寻,罕用于百万数据集的匹配搜寻工作。
上亿级别的能够应用Milvus向量数据库,检索很快,上面样例是在1000w数据集下测试
- 成果预览:
性能比照:
| 硬件配置| 向量库数据量 | 提取特色所需工夫 | milvus检索所需工夫 |排序所需工夫 |总耗时|
| -------- | -------- | -------- | -------- | -------- |-------- |
| CPU 12核 2.5GHz | 1000w 大小15GB左右 | 64.5ms | 258.3ms | 871.6 ms | 1.19s |
|CPU + Tesla V100 32G | 1000w 大小15GB左右 | 10ms | 213.6ms | 24.1ms | 0.25s |
- 我的项目专栏链接:欢送fork
基于Milvus+ERNIE+SimCSE+In-batch Negatives样本策略的学术文献语义检索系统
上面展现demo:
Annoy和Hnswlib是两个罕用的近似语义匹配搜寻库,它们都能够用于高效地搜寻最靠近给定向量的街坊。
Annoy(Approximate Nearest Neighbors Oh Yeah):
- Annoy 是一种基于树结构的近似最近邻算法,其中树被构建为一种非凡的二叉搜寻树。它应用欧氏间隔进行近似街坊搜寻。
- Annoy 反对疾速插入和更新数据,并且占用较少的内存空间。它的搜寻速度快,尤其实用于高维向量的近似搜寻。
- Annoy 能够用于各种工作,如举荐零碎、图像和文本处理等。
- Annoy 的接口简略易用,可在多个编程语言中应用,如Python、C++等。
Hnswlib(Hierarchical Navigable Small World Library):
- Hnswlib 也是一种基于树结构的近似最近邻算法,它应用一种叫做 “层级可导航小世界” 的数据结构。通过构建多个层级的索引构造,它可能疾速搜寻最靠近的街坊。
- Hnswlib 反对欧氏间隔和角度间隔等多种间隔度量形式,使得它实用于不同的利用场景。
- Hnswlib 的索引构造能够在线更新,反对高效地增加和删除向量。
- Hnswlib 提供了多线程搜寻性能,能够进步搜寻速度,并且能够在大规模数据集上进行高效搜寻。
- Hnswlib 在C++中实现,但也提供了Python的绑定接口。
综上所述,Annoy 是一种基于树结构的近似最近邻算法,实用于高维向量的近似搜寻;而Hnswlib 是基于 “层级可导航小世界” 构造的近似最近邻算法,反对多种间隔度量形式,并实用于大规模数据集的高效搜寻
更多内容参考:
[举荐零碎[九]我的项目技术细节解说z1:Elasticsearch 如何进行疾速检索(ES倒排索引和分词原理)以及倒排索引在召回中的利用。](https://blog.csdn.net/sinat_39620217/article/details/129399015)
[举荐零碎[九]我的项目技术细节解说z3:向量检索技术与ANN搜索算法[KD树、Annoy、LSH部分哈希、PQ乘积量化、IVFPQ倒排乘积量化、HNSW层级图搜寻等],超级具体技术原理解说](https://blog.csdn.net/sinat_39620217/article/details/129410504)
import osimport sys#加载门路导入函数sys.path.append("/home/mw/project/similarities-main/similarities")from fastsim import AnnoySimilarityfrom fastsim import HnswlibSimilarity#须要留神,请批改/home/mw/project/similarities-main/similarities/fastsim.py 对应模型门路,批改为model_name_or_path="/home/mw/input/99556126636/zh_model/中文模型",sentences = ['明天的天气不错是晴天', '今天天气很好阳光明媚']corpus = [ '今天天气很好阳光明媚', '在好天气里,我喜爱去漫步', '这本书太无聊了,我无奈读上来', '我喜爱去海边度假,感触阳光和海风', '在旅行途中,咱们遇到了许多乏味的人', '我喜爱随时随地用耳机听音乐',]def annoy_demo(): corpus_new = [i + str(id) for id, i in enumerate(corpus * 10)] model = AnnoySimilarity(corpus=corpus_new) print(model) similarity_score = model.similarity(sentences[0], sentences[1]) print(f"{sentences[0]} vs {sentences[1]}, score: {float(similarity_score):.4f}") model.add_corpus(corpus) model.build_index() model.save_index('annoy_model.bin') print(model.most_similar("men喜爱这首歌")) # Semantic Search batch del model model = AnnoySimilarity() model.load_index('annoy_model.bin') print(model.most_similar("men喜爱这首歌")) queries = ["明天的天气不错是晴天", "men喜爱这首歌"] res = model.most_similar(queries, topn=3) print(res) for q_id, c in res.items(): print('query:', queries[q_id]) print("search top 3:") for corpus_id, s in c.items(): print(f'\t{model.corpus[corpus_id]}: {s:.4f}') # os.remove('annoy_model.bin') print('-' * 50 + '\n')def hnswlib_demo(): corpus_new = [i + str(id) for id, i in enumerate(corpus * 10)] print(corpus_new) model = HnswlibSimilarity(corpus=corpus_new) print(model) similarity_score = model.similarity(sentences[0], sentences[1]) print(f"{sentences[0]} vs {sentences[1]}, score: {float(similarity_score):.4f}") model.add_corpus(corpus) model.build_index() model.save_index('hnsw_model.bin') print(model.most_similar("men喜爱这首歌")) # Semantic Search batch del model model = HnswlibSimilarity() model.load_index('hnsw_model.bin') print(model.most_similar("men喜爱这首歌")) queries = ["明天的天气不错是晴天", "men喜爱这首歌"] res = model.most_similar(queries, topn=3) print(res) for q_id, c in res.items(): print('query:', queries[q_id]) print("search top 3:") for corpus_id, s in c.items(): print(f'\t{model.corpus[corpus_id]}: {s:.4f}') # os.remove('hnsw_model.bin') print('-' * 50 + '\n')if __name__ == '__main__': annoy_demo() hnswlib_demo()
2023-09-11 03:09:23.344 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu2023-09-11 03:09:23.348 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 602023-09-11 03:09:28.151 | INFO | similarities.similarity:add_corpus:155 - Add 60 docs, total: 60, emb len: 602023-09-11 03:09:28.241 | DEBUG | fastsim:create_index:48 - Init Annoy index, embedding_size: 7682023-09-11 03:09:28.242 | DEBUG | fastsim:build_index:53 - Building index with 256 trees.Similarity: AnnoySimilarity, matching_model: <SentenceModel: /home/mw/input/99556126636/zh_model/中文模型, encoder_type: MEAN, max_seq_length: 128, emb_dim: 768>, corpus size: 602023-09-11 03:09:29.445 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 6明天的天气不错是晴天 vs 今天天气很好阳光明媚, score: 0.77282023-09-11 03:09:30.544 | INFO | similarities.similarity:add_corpus:155 - Add 6 docs, total: 66, emb len: 662023-09-11 03:09:30.544 | DEBUG | fastsim:create_index:48 - Init Annoy index, embedding_size: 7682023-09-11 03:09:30.545 | DEBUG | fastsim:build_index:53 - Building index with 256 trees.2023-09-11 03:09:30.669 | DEBUG | similarities.similarity:save_index:230 - Save corpus embeddings to file: annoy_model.bin.json.2023-09-11 03:09:30.669 | INFO | fastsim:save_index:67 - Saving Annoy index to: annoy_model.bin, corpus embedding to: annoy_model.bin.json{0: {59: 0.4495447165407924, 29: 0.44775851052770577, 5: 0.44683993510903264, 11: 0.44624594564543685, 35: 0.44612286870435014, 53: 0.44598170858875363, 41: 0.44574389155260974, 17: 0.44521147337774636, 23: 0.44469588530591864, 47: 0.444264516571927}}2023-09-11 03:09:32.450 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu2023-09-11 03:09:32.544 | INFO | fastsim:load_index:75 - Loading index from: annoy_model.bin, corpus embedding from: annoy_model.bin.json2023-09-11 03:09:32.566 | DEBUG | fastsim:create_index:48 - Init Annoy index, embedding_size: 768{0: {59: 0.4495447165407924, 29: 0.44775851052770577, 5: 0.44683993510903264, 11: 0.44624594564543685, 35: 0.44612286870435014, 53: 0.44598170858875363, 41: 0.44574389155260974, 17: 0.44521147337774636, 23: 0.44469588530591864, 47: 0.444264516571927}}{0: {60: 0.7727918750653213, 6: 0.7462793254362339, 36: 0.7355303251593384}, 1: {59: 0.4495447165407924, 29: 0.44775863580996855, 5: 0.4468400604954468}}query: 明天的天气不错是晴天search top 3: 今天天气很好阳光明媚: 0.7728 今天天气很好阳光明媚6: 0.7463 今天天气很好阳光明媚36: 0.7355query: men喜爱这首歌search top 3: 我喜爱随时随地用耳机听音乐59: 0.4495 我喜爱随时随地用耳机听音乐29: 0.4478 我喜爱随时随地用耳机听音乐5: 0.4468--------------------------------------------------['今天天气很好阳光明媚0', '在好天气里,我喜爱去漫步1', '这本书太无聊了,我无奈读上来2', '我喜爱去海边度假,感触阳光和海风3', '在旅行途中,咱们遇到了许多乏味的人4', '我喜爱随时随地用耳机听音乐5', '今天天气很好阳光明媚6', '在好天气里,我喜爱去漫步7', '这本书太无聊了,我无奈读上来8', '我喜爱去海边度假,感触阳光和海风9', '在旅行途中,咱们遇到了许多乏味的人10', '我喜爱随时随地用耳机听音乐11', '今天天气很好阳光明媚12', '在好天气里,我喜爱去漫步13', '这本书太无聊了,我无奈读上来14', '我喜爱去海边度假,感触阳光和海风15', '在旅行途中,咱们遇到了许多乏味的人16', '我喜爱随时随地用耳机听音乐17', '今天天气很好阳光明媚18', '在好天气里,我喜爱去漫步19', '这本书太无聊了,我无奈读上来20', '我喜爱去海边度假,感触阳光和海风21', '在旅行途中,咱们遇到了许多乏味的人22', '我喜爱随时随地用耳机听音乐23', '今天天气很好阳光明媚24', '在好天气里,我喜爱去漫步25', '这本书太无聊了,我无奈读上来26', '我喜爱去海边度假,感触阳光和海风27', '在旅行途中,咱们遇到了许多乏味的人28', '我喜爱随时随地用耳机听音乐29', '今天天气很好阳光明媚30', '在好天气里,我喜爱去漫步31', '这本书太无聊了,我无奈读上来32', '我喜爱去海边度假,感触阳光和海风33', '在旅行途中,咱们遇到了许多乏味的人34', '我喜爱随时随地用耳机听音乐35', '今天天气很好阳光明媚36', '在好天气里,我喜爱去漫步37', '这本书太无聊了,我无奈读上来38', '我喜爱去海边度假,感触阳光和海风39', '在旅行途中,咱们遇到了许多乏味的人40', '我喜爱随时随地用耳机听音乐41', '今天天气很好阳光明媚42', '在好天气里,我喜爱去漫步43', '这本书太无聊了,我无奈读上来44', '我喜爱去海边度假,感触阳光和海风45', '在旅行途中,咱们遇到了许多乏味的人46', '我喜爱随时随地用耳机听音乐47', '今天天气很好阳光明媚48', '在好天气里,我喜爱去漫步49', '这本书太无聊了,我无奈读上来50', '我喜爱去海边度假,感触阳光和海风51', '在旅行途中,咱们遇到了许多乏味的人52', '我喜爱随时随地用耳机听音乐53', '今天天气很好阳光明媚54', '在好天气里,我喜爱去漫步55', '这本书太无聊了,我无奈读上来56', '我喜爱去海边度假,感触阳光和海风57', '在旅行途中,咱们遇到了许多乏味的人58', '我喜爱随时随地用耳机听音乐59']2023-09-11 03:09:34.947 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu2023-09-11 03:09:35.044 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 602023-09-11 03:09:40.349 | INFO | similarities.similarity:add_corpus:155 - Add 60 docs, total: 60, emb len: 602023-09-11 03:09:40.350 | DEBUG | fastsim:create_index:150 - Init Hnswlib index, embedding_size: 7682023-09-11 03:09:40.351 | INFO | fastsim:build_index:156 - Building HNSWLIB index, max_elements: 602023-09-11 03:09:40.351 | DEBUG | fastsim:build_index:157 - Parameters Required: M: 642023-09-11 03:09:40.352 | DEBUG | fastsim:build_index:158 - Parameters Required: ef_construction: 4002023-09-11 03:09:40.352 | DEBUG | fastsim:build_index:159 - Parameters Required: ef(>topn): 50Similarity: HnswlibSimilarity, matching_model: <SentenceModel: /home/mw/input/99556126636/zh_model/中文模型, encoder_type: MEAN, max_seq_length: 128, emb_dim: 768>, corpus size: 602023-09-11 03:09:41.345 | INFO | similarities.similarity:add_corpus:151 - Start computing corpus embeddings, new docs: 6明天的天气不错是晴天 vs 今天天气很好阳光明媚, score: 0.77282023-09-11 03:09:42.548 | INFO | similarities.similarity:add_corpus:155 - Add 6 docs, total: 66, emb len: 662023-09-11 03:09:42.550 | DEBUG | fastsim:create_index:150 - Init Hnswlib index, embedding_size: 7682023-09-11 03:09:42.551 | INFO | fastsim:build_index:156 - Building HNSWLIB index, max_elements: 662023-09-11 03:09:42.551 | DEBUG | fastsim:build_index:157 - Parameters Required: M: 642023-09-11 03:09:42.552 | DEBUG | fastsim:build_index:158 - Parameters Required: ef_construction: 4002023-09-11 03:09:42.552 | DEBUG | fastsim:build_index:159 - Parameters Required: ef(>topn): 502023-09-11 03:09:42.717 | DEBUG | similarities.similarity:save_index:230 - Save corpus embeddings to file: hnsw_model.bin.json.2023-09-11 03:09:42.718 | INFO | fastsim:save_index:172 - Saving hnswlib index to: hnsw_model.bin, corpus embedding to: hnsw_model.bin.json{0: {59: 0.44954460859298706, 29: 0.44775843620300293, 5: 0.44683992862701416, 11: 0.44624578952789307, 35: 0.4461227059364319, 53: 0.4459817409515381, 41: 0.4457439184188843, 17: 0.44521135091781616, 23: 0.4446955919265747, 47: 0.44426441192626953}}2023-09-11 03:09:44.547 | DEBUG | text2vec.sentence_model:__init__:76 - Use device: cpu2023-09-11 03:09:44.643 | INFO | fastsim:load_index:180 - Loading index from: hnsw_model.bin, corpus embedding from: hnsw_model.bin.json2023-09-11 03:09:44.665 | DEBUG | fastsim:create_index:150 - Init Hnswlib index, embedding_size: 768Warning: Calling load_index for an already inited index. Old index is being deallocated.{0: {59: 0.44954460859298706, 29: 0.44775843620300293, 5: 0.44683992862701416, 11: 0.44624578952789307, 35: 0.4461227059364319, 53: 0.4459817409515381, 41: 0.4457439184188843, 17: 0.44521135091781616, 23: 0.4446955919265747, 47: 0.44426441192626953}}{0: {60: 0.7727917432785034, 6: 0.746279239654541, 36: 0.7355299592018127}, 1: {59: 0.4495447278022766, 29: 0.4477585554122925, 5: 0.44683998823165894}}query: 明天的天气不错是晴天search top 3: 今天天气很好阳光明媚: 0.7728 今天天气很好阳光明媚6: 0.7463 今天天气很好阳光明媚36: 0.7355query: men喜爱这首歌search top 3: 我喜爱随时随地用耳机听音乐59: 0.4495 我喜爱随时随地用耳机听音乐29: 0.4478 我喜爱随时随地用耳机听音乐5: 0.4468--------------------------------------------------
1.5 基于字面的文本类似度计算和匹配搜寻
反对同义词词林(Cilin)、知网Hownet、词向量(WordEmbedding)、Tfidf、SimHash、BM25等算法的类似度计算和字面匹配搜寻,罕用于文本匹配冷启动。
--> 480 self.cilin_dict = self.load_cilin_dict(cilin_path) # Cilin(词林) semantic dictionary
481 self.corpus = {} 482
/opt/conda/lib/python3.7/site-packages/similarities/literalsim.py in load_cilin_dict(path)
522 """加载词林语义词典""" 523 sem_dict = {}
--> 524 for line in open(path, 'r', encoding='utf-8'):
525 line = line.strip() 526 terms = line.split(' ')
FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.7/site-packages/similarities/data/cilin.txt'
增加词库门路批改literalsim.py文件
default_cilin_path='/home/mw/project/similarities-main/similarities/data/cilin.txt'
import sysfrom loguru import loggersys.path.append('/home/mw/project/similarities-main')sys.path.append('/home/mw/project/similarities-main/similarities')from similarities import ( SimHashSimilarity, TfidfSimilarity, BM25Similarity, WordEmbeddingSimilarity, CilinSimilarity, HownetSimilarity, SameCharsSimilarity, SequenceMatcherSimilarity,)logger.remove()logger.add(sys.stderr, level="INFO")def sim_and_search(m): print(m) if 'BM25' not in str(m): sim_scores = m.similarity(text1, text2) print('sim scores: ', sim_scores) for (idx, i), j in zip(enumerate(text1), text2): s = sim_scores[idx] if isinstance(sim_scores, list) else sim_scores[idx][idx] print(f"{i} vs {j}, score: {s:.4f}") m.add_corpus(corpus) res = m.most_similar(queries, topn=3) print('sim search: ', res) for q_id, c in res.items(): print('query:', queries[q_id]) print("search top 3:") for corpus_id, s in c.items(): print(f'\t{m.corpus[corpus_id]}: {s:.4f}') print('-' * 50 + '\n')if __name__ == '__main__': text1 = [ '如何更换花呗绑定银行卡', '花呗更改绑定银行卡' ] text2 = [ '花呗更改绑定银行卡', '我什么时候开明了花呗', ] corpus = [ '花呗更改绑定银行卡', '我什么时候开明了花呗', '俄罗斯正告乌克兰拥护欧盟协定', '暴风雨埋葬了东北部;新泽西16英寸的降雪', '地方情报局局长拜访以色列叙利亚谈判', '人在巴基斯坦基地的炸弹袭击中丧生', ] queries = [ '我的花呗开明了?', '乌克兰被俄罗斯正告', '更改绑定银行卡', ] print('text1: ', text1) print('text2: ', text2) print('query: ', queries) sim_and_search(SimHashSimilarity()) sim_and_search(TfidfSimilarity()) sim_and_search(BM25Similarity()) sim_and_search(WordEmbeddingSimilarity()) # sim_and_search(CilinSimilarity()) #词库门路在/home/mw/project/similarities-main/similarities/data/下,自行添加 # sim_and_search(HownetSimilarity()) sim_and_search(SameCharsSimilarity()) sim_and_search(SequenceMatcherSimilarity())
2023-09-11 03:36:11.670 | INFO | similarities.literalsim:add_corpus:75 - Start computing corpus embeddings, new docs: 6text1: ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']text2: ['花呗更改绑定银行卡', '我什么时候开明了花呗']query: ['我的花呗开明了?', '乌克兰被俄罗斯正告', '更改绑定银行卡']Similarity: SimHashSimilarity, matching_model: SimHashsim scores: [0.9375, 0.5]如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.9375花呗更改绑定银行卡 vs 我什么时候开明了花呗, score: 0.5000Computing corpus SimHash: 100%|██████████| 6/6 [00:00<00:00, 3128.91it/s]2023-09-11 03:36:11.675 | INFO | similarities.literalsim:add_corpus:84 - Add 6 docs, total: 6, emb size: 6sim search: {0: {3: 0.703125, 5: 0.5625, 1: 0.515625}, 1: {0: 0.78125, 1: 0.484375, 2: 0.484375}, 2: {4: 1.0, 1: 0.59375, 2: 0.59375}}query: 我的花呗开明了?search top 3: 我什么时候开明了花呗: 0.7031 暴风雨埋葬了东北部;新泽西16英寸的降雪: 0.5625 人在巴基斯坦基地的炸弹袭击中丧生: 0.5156query: 乌克兰被俄罗斯正告search top 3: 俄罗斯正告乌克兰拥护欧盟协定: 0.7812 人在巴基斯坦基地的炸弹袭击中丧生: 0.4844 地方情报局局长拜访以色列叙利亚谈判: 0.4844query: 更改绑定银行卡search top 3: 花呗更改绑定银行卡: 1.0000 人在巴基斯坦基地的炸弹袭击中丧生: 0.5938 地方情报局局长拜访以色列叙利亚谈判: 0.5938--------------------------------------------------Similarity: TfidfSimilarity, matching_model: Tfidf2023-09-11 03:36:12.649 | INFO | similarities.literalsim:add_corpus:238 - Start computing corpus embeddings, new docs: 6sim scores: tensor([[0.7948, 0.4022], [1.0000, 0.4048]], dtype=torch.float64)如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.7948花呗更改绑定银行卡 vs 我什么时候开明了花呗, score: 0.4048Computing corpus TFIDF: 100%|██████████| 6/6 [00:00<00:00, 23.17it/s]2023-09-11 03:36:12.911 | INFO | similarities.literalsim:add_corpus:247 - Add 6 docs, total: 6, emb size: 62023-09-11 03:36:13.461 | INFO | similarities.literalsim:add_corpus:334 - Start computing corpus embeddings, new docs: 62023-09-11 03:36:13.463 | INFO | similarities.literalsim:add_corpus:340 - Add 6 docs, total: 62023-09-11 03:36:13.465 | INFO | text2vec.word2vec:__init__:80 - Load pretrained model:w2v-light-tencent-chinese, path:/home/mw/.text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.binsim search: {0: {3: 0.921499490737915, 4: 0.43930041790008545, 5: 0.0}, 1: {0: 0.7380481958389282, 4: 0.0, 3: 0.0}, 2: {4: 0.8345502018928528, 3: 0.0, 5: 0.0}}query: 我的花呗开明了?search top 3: 我什么时候开明了花呗: 0.9215 花呗更改绑定银行卡: 0.4393 暴风雨埋葬了东北部;新泽西16英寸的降雪: 0.0000query: 乌克兰被俄罗斯正告search top 3: 俄罗斯正告乌克兰拥护欧盟协定: 0.7380 花呗更改绑定银行卡: 0.0000 我什么时候开明了花呗: 0.0000query: 更改绑定银行卡search top 3: 花呗更改绑定银行卡: 0.8346 我什么时候开明了花呗: 0.0000 暴风雨埋葬了东北部;新泽西16英寸的降雪: 0.0000--------------------------------------------------Similarity: BM25Similarity, matching_model: BM25sim search: {0: {3: 4.453010263817695, 4: 1.3720219233789517, 5: 1.010258330300517}, 1: {0: 4.245182027356298, 1: 0.0, 2: 0.0}, 2: {4: 4.549213631437518, 0: 0.0, 1: 0.0}}query: 我的花呗开明了?search top 3: 我什么时候开明了花呗: 4.4530 花呗更改绑定银行卡: 1.3720 暴风雨埋葬了东北部;新泽西16英寸的降雪: 1.0103query: 乌克兰被俄罗斯正告search top 3: 俄罗斯正告乌克兰拥护欧盟协定: 4.2452 人在巴基斯坦基地的炸弹袭击中丧生: 0.0000 地方情报局局长拜访以色列叙利亚谈判: 0.0000query: 更改绑定银行卡search top 3: 花呗更改绑定银行卡: 4.5492 俄罗斯正告乌克兰拥护欧盟协定: 0.0000 人在巴基斯坦基地的炸弹袭击中丧生: 0.0000--------------------------------------------------2023-09-11 03:36:14.743 | INFO | similarities.literalsim:add_corpus:424 - Start computing corpus embeddings, new docs: 6Similarity: WordEmbeddingSimilarity, matching_model: Word2Vecsim scores: tensor([[0.9812, 0.8195], [1.0000, 0.8264]], dtype=torch.float64)如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.9812花呗更改绑定银行卡 vs 我什么时候开明了花呗, score: 0.8264Word2Vec Embeddings: 100%|██████████| 6/6 [00:00<00:00, 7707.76it/s]2023-09-11 03:36:14.746 | INFO | similarities.literalsim:add_corpus:431 - Add 6 docs, total: 6, emb size: 62023-09-11 03:36:14.756 | INFO | similarities.literalsim:add_corpus:804 - Start add new docs: 62023-09-11 03:36:14.757 | INFO | similarities.literalsim:add_corpus:805 - Add 6 docs, total: 62023-09-11 03:36:14.761 | INFO | similarities.literalsim:add_corpus:900 - Start add new docs: 62023-09-11 03:36:14.762 | INFO | similarities.literalsim:add_corpus:901 - Add 6 docs, total: 6sim search: {0: {3: 0.8737779259681702, 4: 0.7954849004745483, 5: 0.713451623916626}, 1: {0: 0.9661487936973572, 1: 0.811479926109314, 2: 0.7922273278236389}, 2: {4: 0.9858745336532593, 2: 0.819598913192749, 5: 0.8008757829666138}}query: 我的花呗开明了?search top 3: 我什么时候开明了花呗: 0.8738 花呗更改绑定银行卡: 0.7955 暴风雨埋葬了东北部;新泽西16英寸的降雪: 0.7135query: 乌克兰被俄罗斯正告search top 3: 俄罗斯正告乌克兰拥护欧盟协定: 0.9661 人在巴基斯坦基地的炸弹袭击中丧生: 0.8115 地方情报局局长拜访以色列叙利亚谈判: 0.7922query: 更改绑定银行卡search top 3: 花呗更改绑定银行卡: 0.9859 地方情报局局长拜访以色列叙利亚谈判: 0.8196 暴风雨埋葬了东北部;新泽西16英寸的降雪: 0.8009--------------------------------------------------Similarity: SameCharsSimilarity, matching_model: SameCharssim scores: [0.8888888888888888, 0.2222222222222222]如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.8889花呗更改绑定银行卡 vs 我什么时候开明了花呗, score: 0.2222sim search: {0: {3: 0.75, 4: 0.25, 5: 0.25}, 1: {0: 0.8888888888888888, 1: 0.1111111111111111, 2: 0.0}, 2: {4: 1.0, 0: 0.0, 1: 0.0}}query: 我的花呗开明了?search top 3: 我什么时候开明了花呗: 0.7500 花呗更改绑定银行卡: 0.2500 暴风雨埋葬了东北部;新泽西16英寸的降雪: 0.2500query: 乌克兰被俄罗斯正告search top 3: 俄罗斯正告乌克兰拥护欧盟协定: 0.8889 人在巴基斯坦基地的炸弹袭击中丧生: 0.1111 地方情报局局长拜访以色列叙利亚谈判: 0.0000query: 更改绑定银行卡search top 3: 花呗更改绑定银行卡: 1.0000 俄罗斯正告乌克兰拥护欧盟协定: 0.0000 人在巴基斯坦基地的炸弹袭击中丧生: 0.0000--------------------------------------------------Similarity: SequenceMatcherSimilarity, matching_model: SequenceMatchersim scores: [0.5555555555555556, 0.2222222222222222]如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.5556花呗更改绑定银行卡 vs 我什么时候开明了花呗, score: 0.2222sim search: {0: {3: 0.375, 4: 0.25, 1: 0.125}, 1: {0: 0.5555555555555556, 1: 0.1111111111111111, 2: 0.0}, 2: {4: 1.0, 0: 0.0, 1: 0.0}}query: 我的花呗开明了?search top 3: 我什么时候开明了花呗: 0.3750 花呗更改绑定银行卡: 0.2500 人在巴基斯坦基地的炸弹袭击中丧生: 0.1250query: 乌克兰被俄罗斯正告search top 3: 俄罗斯正告乌克兰拥护欧盟协定: 0.5556 人在巴基斯坦基地的炸弹袭击中丧生: 0.1111 地方情报局局长拜访以色列叙利亚谈判: 0.0000query: 更改绑定银行卡search top 3: 花呗更改绑定银行卡: 1.0000 俄罗斯正告乌克兰拥护欧盟协定: 0.0000 人在巴基斯坦基地的炸弹袭击中丧生: 0.0000--------------------------------------------------
2. 图像类似度计算和匹配搜寻
反对CLIP、pHash、SIFT等算法的图像类似度计算和匹配搜寻。
自行去huggingface下载模型到和鲸社区里即可:OFA-Sys/chinese-clip-vit-base-patch16 ,这里就不过展现了。
# import glob# import sys# from PIL import Image# sys.path.append('/home/mw/project/similarities-main/similarities')# from imagesim import ImageHashSimilarity, SiftSimilarity,ClipSimilarity# def sim_and_search(m):# print(m)# # similarity# sim_scores = m.similarity(imgs1, imgs2)# print('sim scores: ', sim_scores)# for (idx, i), j in zip(enumerate(image_fps1), image_fps2):# s = sim_scores[idx] if isinstance(sim_scores, list) else sim_scores[idx][idx]# print(f"{i} vs {j}, score: {s:.4f}")# # search# m.add_corpus(corpus_imgs)# queries = imgs1# res = m.most_similar(queries, topn=3)# print('sim search: ', res)# for q_id, c in res.items():# print('query:', image_fps1[q_id])# print("search top 3:")# for corpus_id, s in c.items():# print(f'\t{m.corpus[corpus_id].filename}: {s:.4f}')# print('-' * 50 + '\n')# def clip_demo():# m = ClipSimilarity()# print(m)# # similarity score between text and image# image_fps = [# '/home/mw/project/similarities-main/examples/data/image3.png', # yellow flower image# '/home/mw/project/similarities-main/examples/data/image1.png', # tiger image# ]# texts = ['a yellow flower', '老虎', '一头狮子', '玩具车']# imgs = [Image.open(i) for i in image_fps]# sim_scores = m.similarity(imgs, texts)# print('sim scores: ', sim_scores)# for idx, i in enumerate(image_fps):# for idy, j in enumerate(texts):# s = sim_scores[idx][idy]# print(f"{i} vs {j}, score: {s:.4f}")# print('-' * 50 + '\n')# if __name__ == "__main__":# image_fps1 = ['/home/mw/project/similarities-main/examples/data/image1.png', '/home/mw/project/similarities-main/examples/data/image3.png']# image_fps2 = ['/home/mw/project/similarities-main/examples/data/image12-like-image1.png', '/home/mw/project/similarities-main/examples/data/image10.png']# imgs1 = [Image.open(i) for i in image_fps1]# imgs2 = [Image.open(i) for i in image_fps2]# corpus_fps = glob.glob('data/*.jpg') + glob.glob('data/*.png')# corpus_imgs = [Image.open(i) for i in corpus_fps]# # 1. image and text similarity# clip_demo()# # 2. image and image similarity score# sim_and_search(ClipSimilarity()) # the best result# sim_and_search(ImageHashSimilarity(hash_function='phash'))# sim_and_search(SiftSimilarity())
Similarity: ClipSimilarity, matching_model: CLIPModel sim scores: tensor([[0.9580, 0.8654], [0.6558, 0.6145]]) data/image1.png vs data/image12-like-image1.png, score: 0.9580 data/image3.png vs data/image10.png, score: 0.6145 sim search: {0: {6: 0.9999999403953552, 0: 0.9579654932022095, 4: 0.9326782822608948}, 1: {8: 0.9999997615814209, 4: 0.6729235649108887, 0: 0.6558331847190857}} query: data/image1.png search top 3: data/image1.png: 1.0000 data/image12-like-image1.png: 0.9580 data/image8-like-image1.png: 0.9327
sim scores: tensor([[0.3220, 0.2409], [0.1677, 0.2959]]) data/image3.png vs a yellow flower, score: 0.3220 data/image1.png vs 老虎, score: 0.2112
更多优质内容请关注公号:汀丶人工智能;会提供一些相干的资源和优质文章,收费获取浏览。
我的项目链接:文本语义匹配搜寻疾速上手baseline