共计 6605 个字符,预计需要花费 17 分钟才能阅读完成。
又一个大爆炸?MetaAI 开源新的自监督 foundation model——DINOv2,无需 fine-tuning,刷榜多个上游工作,填补 SAM 宰割畛域外的空白
链接🔗
Paper:DINOv2: Learning Robust Visual Features without Supervision (arxiv.org)
Code:facebookresearch/dinov2: PyTorch code and models for the DINOv2 self-supervised learning method. (github.com)
Demo:DINOv2 by Meta AI (metademolab.com)
前言📑
“DINOv2 complements our other recent computer vision research, including Segment Anything. Segment Anything is a promptable segmentation system focused on zero-shot generalization to diverse set of segmentation tasks. DINOv2 combines with simple linear classifiers to achieve strong results across multiple tasks beyond the segmentation sub-field, creating horizontal impact.”
前段时间,Meta AI 就高调公布了 Segment Anything(SAM),SAM 以交互式形式 k 疾速生成 Mask,并能够对从未训练过的图片进行精准宰割,能够依据文字提醒或使用者点击进而圈出图像中的特定物体,其灵活性在图像宰割畛域内属独创。
然而,归根到底 SAM 是一个 promptable segmentation system,次要利用于各种宰割工作,对其余的视觉工作 (e.g. Classification, Retrieval,VQA…) 的帮忙没有那么间接。
于是,在继 [宰割所有],Meta AI 再次公布重量级开源我的项目——DINOv2,DINOv2 能够抽取到弱小的图像特色,且在上游工作上 不须要微调,这使得它适宜作为许多不同的利用中新的 BackBone。
- DINOv2 delivers strong performance and does not require fine-tuning. This makes it suitable for use as a backbone for many different computer vision tasks.
DINOv2 is able to take a video and generate a higher-quality segmentation than the original DINO method. DINOv2 allows remarkable properties to emerge, such as a robust understanding of object parts, and robust semantic and low-level understanding of images.
同时,本次 DINOv2 的公布还有着小札的亲自站场,在一开始就播种大量关注度。(最初还不忘提一嘴心心念念的 Metaverse🤣)
在小札的亲自宣发下,DINOv2 在公布后短短的一天里就播种了 2k+ 的star
!
次要个性✨
在 Meta AI 官网的 Blog 中,将 DINOv2 的个性总结如下:
- Meta AI has built DINOv2, a new method for training high-performance computer vision models.
- DINOv2 delivers strong performance and does not require fine-tuning. This makes it suitable for use as a backbone for many different computer vision tasks.
- Because it uses self-supervision, DINOv2 can learn from any collection of images. It can also learn features, such as depth estimation, that the current standard approach cannot.
- DINOv2 是一种训练高性能计算机视觉模型的新办法。
- DINOv2 提供了弱小的性能,并且 不须要微调。
- 因为是自监督(self-supervision),DINOv2 能够从任何图像汇合中学习。同时,它还能够学习到当现有办法无奈学习的某些特色,例如深度预计。
DINOv2 是一种新的高性能计算机视觉模型训练方法,应用自监督学习来实现与该畛域中应用的规范办法相匹配或超过后果。与其余自监督零碎一样,应用 DINOv2 办法的模型能够在不须要任何相干元数据的状况下对任何图像汇合进行训练。这意味着它能够从它所接管到的所有图像中学习,而不仅仅是那些蕴含特定一组标签或 alt 文本或题目的图像。DINOv2 提供了可间接用作简略线性分类器输出的高性能特色。这种灵活性意味着 DINOv2 可用于创立许多不同计算机视觉工作的多用途骨干。
文中的试验展现了 DINOv2 在上游工作上的杰出能力,例如分类、宰割和图像检索等应用领域。其中,最令人诧异的是,在深度预计方面,DINOv2 的后果显著优于 in-domain 与 out-of-domain 的 SOTA 的 pipeline。作者认为这种弱小的域外体现是自监督特色学习和轻量级工作特定模块(例如线性分类器)相结合的后果。
最初,因为不采纳 fine-tuning,骨干放弃通用,同一特色能够同时用于许多不同工作。
钻研内容📖
这里,咱们不开展 DINOv2 的具体算法细节,只简要介绍一下 DINOv2 次要干了些什么(集体认识,欢送探讨):
创立了一个新的高质量数据集
Building a large, curated, and diverse dataset to train the models
在现在的大模型时代,为了进一步提高性能,往往更大的模型须要更多的数据进行训练。因为没有足够大的高质量数据集来满足 DINOv2 的训练需要,Meta AI 通过从大量未经整顿的数据池中检索与几个通过整顿的数据集中的图像相近的图像,来组建一个新的数据集。具体流程如下所示:
This approach enabled us to produce a pretraining dataset totaling 142 million images out of the 1.2 billion source images.
通过上图所示的流程,Meta AI 从 12 亿张图片中失去了通过整顿的 1.42 亿张图像,命名为 LVD-142M 数据集。
蒸馏失去好的轻量模型
大模型虽好,但其硬件和算力的要求太高(都是大厂和大实验室能力玩🥲),咱们总是心愿着呈现门槛没那么高的 Strong, lightweight models 的呈现。
因而,Meta AI 通过模型蒸馏的办法,将大模型的常识压缩到较小的模型中,使后续跟进的研究者只需付出最小的准确性代价,就能大大降低推理老本。同时,失去的 ViT-Small、ViT-Base 和 ViT-Large 模型也在上游工作上展现出不错的泛化性能,具体可见前面的试验后果。
公布了一系列高性能的预训练模型
Releasing a family of high-performance pretrained models
最重要的,Meta AI 向社区公布了一系列 DINOv2 预训练模型:
We release DINOv2 pretrained models to the community with a matching stable, accurate, and scaled implementation: We share pretraining code and recipe for ViT-L/16 (300 M params) and ViT-g/14 (1.1 B params) architectures, as well as checkpoints for a range of pretrained models from the larger ViT-g/14 down to smaller distilled models (ViT-S/14, ViT-B/14 and ViT-L/14). The performance of our approach is competitive or better than the performance of text-image models such as CLIP and OpenCLIP on a wide array of tasks, some of which are illustrated in our demo. Don’t hesitate to play with it! Our features can be used out of the box for nearest neighbor classification or paired with linear classification, yielding strong performance. DINOv2 allows skipping the model adaptation phase (fine-tuning) — our linear evaluation performance is close to their fine-tuned counterpart (within 2 percent on ImageNet-1k) .
DINOv2 作为特征提取器能够开箱即用,无需微调就能在多个上游工作上获得相当好的后果(在 ImageNet-1k 上,linear evaluation 比 Fine-tuning 只有 2% 内的差距),如下图所示:
算法和技术改良
With more training data, larger models perform better than smaller ones, but their training poses two major challenges. First, increasing the model size makes the training more challenging because of potential instability. In DINOv2, we included additional regularization methods inspired by the similarity search and classification literature, making the training algorithm much more stable. Second, in order to remain tractable, larger models require more efficient implementations. The DINOv2 training code integrates the latest mixed-precision and distributed training implementations proposed in the cutting-edge PyTorch 2 (fully sharded data parallel), an efficient implementation of the stochastic depth technique, as well as the latest compute algorithm implementations of xFormers (in particular, variable-length memory-efficient attention). This allows faster and more efficient iteration cycles. Overall, with equivalent hardware, our code runs around twice as fast with only a third of the memory usage, allowing scaling in data, model size, and hardware.
通过利用最新的 Pytorch 2.0 的数据并行、分布式训练、混合精度训练与 variable-length memory-efficient attention 等技术,在等同硬件的状况下,新的代码运行速度大概是之前的两倍,而内存使用量只有原来的三分之一,这能够帮忙 DINOv2 在在数据、模型大小和硬件方面进行更加高效的扩大。
玩玩 Demo😁
同时,Meta 在官网上放出了深度预计、语义宰割和实例检索的网页 Demo,无需注册能够间接尝试(这才是「Open」AI)
链接:DINOv2 by Meta AI (metademolab.com)
这里,也分享一下我的试玩后果:
深度预计(Depth Estimation)
个别很少有预训练模型展现本人在深度预计方面的能力,这也阐明了 DINOv2 模型体现出弱小的散布外泛化能力(strong out-of-distribution performance)。
这里,我特意选取了始终非天然光照条件下的夜景作为测试,失去的后果还是十分惊艳的!
语义宰割(Semantic Segmentation)
DINOv2 的解冻特色 (frozen features) 能够很容易地用于语义宰割工作。
这里就是简略的语义宰割,没有 SAM 在宰割工作上的可玩性那么强
实例检索(Instance Retrieval)
这是我认为很有意思的一个 Demo,它是从大量的艺术图片汇合中找到与给定图片类似的艺术作品。
这里我上传了一张黄鹤楼的图片作为 Query:
这是 Dinov2 给出的后果,感觉在语义上还是非常靠近的(都有一个巍峨的塔或楼😂)
将来方向🗒️
这里,Meta AI 也给出了团队的将来钻研方向:
Going forward, the team plans to integrate this model, which can function as a building block, in a larger, more complex AI system that could interact with large language models. A visual backbone providing rich information on images will allow complex AI systems to reason on images in a deeper way than describing them with a single text sentence. Models trained with text supervision are ultimately limited by the image captions. With DINOv2, there is no such built-in limitation.
总的来说,就是和大语言模型 (LLMs, Large Language Models) 联合,向通用人工智能与简单零碎 (Complex AI systems) 后退。(这里咱们看看就好,究竟是大厂要做的工作🥲)
结语🔚
DINOv2 向咱们展现了 CV 中 Self-supervised Learning 的一个重大提高,并在各个工作上体现了其作为一个通用视觉模型 Backbone 的弱小泛化能力。能够期待更多基于 DINOv2 的钻研工作呈现。
MMpretrain
如果你对和 DINOv2 相干的预训练 foundation models 感兴趣,举荐你关注一下 OpenMMLab 的开源深度学习预训练工具箱 MMPreTrain:open-mmlab/mmpretrain: OpenMMLab Pre-training Toolbox and Benchmark (github.com)
MMpretrain 涵盖了多样的骨干网络与预训练模型,并反对多种训练策略(有监督学习,无监督学习等),其中收录的自监督算法如下,能够看出都是近两年最新的经典办法,这里咱们也能够期待一下 DINOv2 的呈现😊
参考
- Mark Zuckerberg – Continuing our work to open source more… | Facebook
- DINOv2: State-of-the-art computer vision models with self-supervised learning (facebook.com)