关于深度学习:开工第一课-用-DocArray-搭建-fashion-search-引擎

DocArray 是由 Jina AI 近期公布的、实用于嵌套及非结构化数据传输的库，本文将演示如何利用 DocArray，搭建一个简略的服装搜索引擎。

动工大吉，各位同学大家好哇！

咱们为大家精心筹备了一学就会的 Demo 以及开箱即用的工具，新的一年，让咱们借助这个无敌 buff，解决非结构化数据传输这个让人头疼的阻碍吧~

DocArray：深度学习工程师必备 library

DocArray: The data structure for unstructured data.

DocArray 是一种可扩大数据结构，完满适配深度学习工作，次要用于嵌套及非结构化数据的传输，反对的数据类型包含文本、图像、音频、视频、3D mesh 等。

与其余数据结构相比：

✅ 示意齐全反对，✔ 示意局部反对，❌ 示意不反对

利用 DocArray，深度学习工程师能够借助 Pythonic API，无效地解决、嵌入、搜寻、举荐、存储和传输数据。

在后续教程示例中，你将理解：

借助 DocArray，搭建一个简略的服装搜寻零碎；
上传服装图片，并在数据集中找到类似匹配

注：本教程所有代码都能够在 GitHub 下载。

手把手教你搭建一个服装搜寻零碎

筹备工作：观看 DocArray 视频

5min 买不了吃亏买不了受骗，反而会排除常识阻碍，为后续步骤做好筹备。

家养字幕君在线翻译中，预计本周公布中文字幕视频，英文视频见 Here。

from IPython.display import YouTubeVideo
YouTubeVideo("Amo19S1SrhE", width=800, height=450)

配置：设置根本变量，并依我的项目调整

DATA_DIR = "./data"
DATA_PATH = f"{DATA_DIR}/*.jpg"
MAX_DOCS = 1000
QUERY_IMAGE = "./query.jpg" # image we'll use to search with
PLOT_EMBEDDINGS = False # Really useful but have to manually stop it to progress to next cell

# Toy data - If data dir doesn't exist, we'll get data of ~800 fashion images from here
TOY_DATA_URL = "https://github.com/alexcg1/neural-search-notebooks/raw/main/fashion-search/data.zip?raw=true"

设置

# We use "[full]" because we want to deal with more complex data like images (as opposed to text)
!pip install "docarray[full]==0.4.4"

from docarray import Document, DocumentArray

加载图片

# Download images if they don't exist
import os

if not os.path.isdir(DATA_DIR) and not os.path.islink(DATA_DIR):
    print(f"Can't find {DATA_DIR}. Downloading toy dataset")
    !wget "$TOY_DATA_URL" -O data.zip
    !unzip -q data.zip # Don't print out every darn filename
    !rm -f data.zip
else:
    print(f"Nothing to download. Using {DATA_DIR} for data")

# Use `.from_files` to quickly load them into a `DocumentArray`
docs = DocumentArray.from_files(DATA_PATH, size=MAX_DOCS)
print(f"{len(docs)} Documents in DocumentArray")

docs.plot_image_sprites() # Preview the images

图片预处理

from docarray import Document

# Convert to tensor, normalize so they're all similar enough
def preproc(d: Document):
    return (d.load_uri_to_image_tensor()  # load
             .set_image_tensor_shape((80, 60))  # ensure all images right size (dataset image size _should_ be (80, 60))
             .set_image_tensor_normalization()  # normalize color 
             .set_image_tensor_channel_axis(-1, 0))  # switch color axis for the PyTorch model later

# apply en masse
docs.apply(preproc)

图片嵌入

!pip install torchvision==0.11.2

# Use GPU if available
import torch
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

import torchvision
model = torchvision.models.resnet50(pretrained=True)  # load ResNet50

docs.embed(model, device=device)

可视化嵌入向量

if PLOT_EMBEDDINGS:
    docs.plot_embeddings(image_sprites=True, image_source="uri")

创立 query Document

此处应用的是数据集中的第一张图片

# Download query doc
!wget https://github.com/alexcg1/neural-search-notebooks/raw/main/fashion-search/1_build_basic_search/query.jpg -O query.jpg

query_doc = Document(uri=QUERY_IMAGE)
query_doc.display()

# Throw the one Document into a DocumentArray, since that's what we're matching against
query_docs = DocumentArray([query_doc])

# Apply same preprocessing
query_docs.apply(preproc)

# ...and create embedding just like we did with the dataset
query_docs.embed(model, device=device) # If running on non-gpu machine, change "cuda" to "cpu"

匹配

query_docs.match(docs, limit=9)

查看后果

模型会根据输出图片进行匹配，此处的匹配甚至会波及到对模特的匹配。

咱们只心愿模型针对服装进行匹配，因而这里应用 Jina AI 的后果调优工具 Finetuner 进行调优。

(DocumentArray(query_doc.matches, copy=True)
    .apply(lambda d: d.set_image_tensor_channel_axis(0, -1)
                      .set_image_tensor_inv_normalization())).plot_image_sprites()

if PLOT_EMBEDDINGS:
    query_doc.matches.plot_embeddings(image_sprites=True, image_source="uri")

进阶教程预报

1、微调模型

后续 notebook 中，咱们将展现如何借助 Jina Finetuner 进步模型的性能。

2、创立利用

后续教程中，咱们将演示如何利用 Jina 的神经搜寻框架和 Jina Hub Executors，打造和扩大搜索引擎。

点击此处查看高清动图

本文相干链接：

Jina Hub：https://hub.jina.ai/

Jina GitHub：https://github.com/jina-ai/jina/

Finetuner：https://finetuner.jina.ai/

退出 Slack：https://slack.jina.ai/

在 Colab 中查看以上全副代码：

https://reurl.cc/RjLy5z

关于深度学习:开工第一课-用-DocArray-搭建-fashion-search-引擎

DocArray：深度学习工程师必备 library

手把手教你搭建一个服装搜寻零碎

筹备工作：观看 DocArray 视频

配置：设置根本变量，并依我的项目调整

设置

加载图片

图片预处理

图片嵌入

可视化嵌入向量

创立 query Document

匹配

查看后果

进阶教程预报

评论

发表回复取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

关于深度学习:开工第一课-用-DocArray-搭建-fashion-search-引擎

DocArray：深度学习工程师必备 library

手把手教你搭建一个服装搜寻零碎

筹备工作：观看 DocArray 视频

配置：设置根本变量，并依我的项目调整

设置

加载图片

图片预处理

图片嵌入

可视化嵌入向量

创立 query Document

匹配

查看后果

进阶教程预报

评论

发表回复 取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

发表回复取消回复