关于人工智能:Generative-AI-新世界-扩散模型原理的代码实践之采样篇

在上一期的文章中，探讨了在 Amazon SageMaker Studio 上应用 QLoRA 等量化技术微调 Falcon 40B 大语言模型。而从本期开始，咱们将一起尝试在更深的常识维度，持续探索生成式 AI 这一炽热的新常识畛域。

亚马逊云科技开发者社区为开发者们提供寰球的开发技术资源。这里有技术文档、开发案例、技术专栏、培训视频、流动与比赛等。帮忙中国开发者对接世界最前沿技术，观点，和我的项目，并将中国优良开发者或技术举荐给寰球云社区。如果你还没有关注/珍藏，看到这里请肯定不要匆匆划过，点这里让它成为你的技术宝库！

目前打算有三个大方向：

代码深度实际方向。例如用代码残缺诠释 Diffusion 模型的工作原理，或者 Transformer 的残缺架构等；
模型部署和训练优化方向。例如尝试解读 LMI、DeepSpeed、Accelerate、FlashAttention 等不同模型优化方向的最新进展；
模型量化实际方向。例如 GPTQ、bitsandbtyes 等前沿模型量化原理和实际等。

在咱们之前曾经连载的十二期文章中，除了通过论文介绍生成式 AI 和大语言模型（LLMs）的次要原理之外，在代码实际环节次要还是局限于是引入预训练模型、在预训练模型根底上做微调、应用 API 等等。很多资深研究者通过多种渠道和咱们沟通，感觉还不过瘾，心愿内容能够更加深刻。

因而，本期做为代码深度实际方向的第一个系列：“扩散模型原理”代码实际系列，将尝试用代码残缺从底层开始洞悉扩散模型（Diffusion Models）的工作原理。而不再仅仅止步于引入预训练模型或应用 API 实现工作。

扩散模型系列内容概述

基于扩散模型（Diffusion Models）的大模型，例如：Stable Diffusion、Midjourney、DALL-E 等可能仅通过提醒词（Prompt）就可能生成图像。咱们心愿通过编写这个“扩散模型原理”代码实际系列，应用代码来探索和诠释这些利用背地算法的原理。

这个由四篇文章组成的“扩散模型原理” 代码实际系列中，咱们将：

摸索基于扩散的生成人工智能的前沿世界，并从头开始创立本人的扩散模型
深刻理解扩散过程和驱动扩散过程的模型，而不仅仅是事后构建的模型和 API
通过进行采样、训练扩散模型、构建用于噪声预测的神经网络以及为个性化图像生成增加背景信息，取得实用的编码技能
在整个系列的最初，咱们将有一个模型，能够作为咱们持续摸索利用扩散模型的终点

我将会用四集的篇幅，逐行代码来构建扩散模型（Diffusion Model）。这四局部别离是：

噪声采样（Sampling）
训练扩散模型（Training）
增加上下文（Embedding & Adding Context）
噪声疾速采样（Fast Sampling）

这四局部的残缺代码可参考我的集体 GitHub ，网址如下:

https://github.com/hanyun2019/difussion-model-code-implementation?trk=cndc-detail

本文是第一局部：噪声采样（Sampling）。

扩散模型的指标

中国有句古语：起心动念。因而，既然咱们要开始从底层揭开扩散模型（Diffusion Model）的面纱，首先是否应该要想分明一个问题：应用扩散模型的指标是什么？

本章将探讨扩散模型的指标，以及如何利用各种游戏角色图像（例如：精灵图像）训练数据来加强模型的能力，而后让扩散模型本人去生成更多的游戏角色图像（例如：生成某种格调的精灵图像等）。

假如上面是你曾经有的精灵图像数据集（来自 ElvGames 的 FrootsnVeggies 和 kyrise 精灵图像集），你想要更多的在这些数据集中没有的大量精灵图像，你该如何实现这个当初看起来不可能实现的工作？

《FrootsnVeggies》
《kyrise》

Source: Sprites by ElvGames

面对这个看上去不可能实现的工作，扩散模型（Diffusion Model）就能帮上忙了。你有很多训练数据，比方你在这里看到的游戏中精灵角色的图像，这是你的训练数据集。而你想要更多训练数据集中没有的精灵图像。你能够应用神经网络，依照扩散模型过程为你生成更多这样的精灵。扩散模型可能生成这样的精灵图像。这就是咱们这个系列要探讨的乏味话题。

以这个精灵图像数据集为例，扩散模型可能学习到精灵角色的通用特色，例如某种精灵的身材轮廓、头发色彩甚至腰带配饰细节等。

神经网络学习生成精灵图像的概念是什么呢？它可能是一些粗劣的细节，比方精灵的头发色彩、腰带配饰等；也可能是一些大抵的轮廓，比方头部轮廓、身材轮廓、或者介于两者之间的其它轮廓等。而做到这一点的一种办法，即通过获取数据并可能专一更精密的细节或轮廓的办法，实际上是增加不同级别的噪声（noise）。因而，这只是在图像中增加噪声，它被称为 “噪声过程”（noising process）。

这个思路其实是受到了物理学的启发，场景很相似一滴墨水滴到一杯清水里的全过程。最后咱们确切地晓得墨水滴落在那里；然而随着工夫的推移，咱们会看到墨水扩散到清水中直到它齐全隐没（或者说齐全和清水融为一体）。

如下图所示，咱们从最右边的图像“Bob the Sprite”开始，当增加乐音时，它会隐没，直到咱们分别不出它到底是哪个精灵。

Source: How Diffusion Models Work,https://learn.deeplearning.ai/diffusion-models/lesson/2/intuition?trk=cndc-detail ,by DeepLearning.AI

以这个 Bob 精灵图像为例，以下详细描述通过增加不同阶段噪声，到精灵训练数据集的全过程。

在最右边图像“Bob the Sprite!”的时候，咱们想让神经网络晓得：“这就是 Bob ，它是一个精灵”。

到了“Probably Bob”的时候，咱们想让神经网络晓得：“你晓得，这里有一些噪声”，不过通过一些细节它看起来像“Bob the Sprite!”。

到了“Well, Bob or Fred”这个图像时，变得只能看到精灵的含糊轮廓了。那么在这里咱们感觉到这可能是精灵，但可能是精灵 Bob 、精灵 Fred ，或者是精灵 Nance ，这时咱们可能想让神经网络为这些精灵图像举荐更通用的细节，比方：在此基础上为 Bob 倡议一些细节，或者你会为 Fred 倡议一些细节等。

到了最初“No Idea”这个图像时，尽管曾经无奈识别图像的特色，咱们依然心愿它看起来更像精灵。这时，咱们依然想让神经网络晓得：“我心愿你通过这张齐全嘈杂的图像，通过提炼出精灵可能样子的轮廓，来把它变成更像精灵的图像”。

这就是整个“噪声过程”（noising process），即随着工夫的推移逐步减少噪声的过程，如同把一滴墨水齐全扩散到一杯清水之中。咱们须要训练的那个神经网络，就是心愿它可能把不同的嘈杂图像变成漂亮精灵。这就是咱们的指标，即扩散模型的指标。

要让神经网络做到这一点，就是要让它学会去除增加的噪声。从“No Idea”这个图像开始（这时只是纯正的噪声），到开始看起来像外面可能有精灵，再到长得像精灵 Bob ，到最初就是精灵 Bob。

这里要强调的是：“No Idea”这个图像的噪声十分重要，因为它是正态分布（normal distribution）的。换句话说，也就是这个图像的像素每一个都是从正态分布（又称 “高斯分布”）中采样的。

因而，当你心愿神经网络生成一个新的精灵时，比方精灵 Fred ，你能够从该正态分布中采样噪声，而后你能够应用神经网络逐步去除噪声来取得一个全新的精灵！除了你训练过的所有精灵之外，你还能够取得更多的精灵。

Source:How Diffusion Models Work, https://learn.deeplearning.ai/diffusion-models/lesson/2/intuition?trk=cndc-detail ,by DeepLearning.AI

祝贺你，你曾经找到了生成大量的全新漂亮精灵的实践办法！接下来就是代码实际了。

在下一章里，咱们将用代码展现为了实现正态分布噪声采样，而被动在迭代阶段增加噪声的办法；和没有增加噪声办法的模型输入后果比照测试。这将是一次很乏味和难忘的扩散模型工作原理微妙体验。

Sampling 噪声采样的代码实际

首先咱们将探讨采样。咱们将具体介绍采样的细节以及它在多个不同的迭代中是如何工作的。

1. 创立 Amazon SageMaker Notebook 实例

篇幅所限，本文不再赘述如何创立 Amazon SageMaker Notebook 实例。

如需具体理解，可参考以下官网文档：

https://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/gs-setup-working-env.html?trk=cndc-detail

2. 代码阐明

本试验的残缺示例代码可参考：

https://github.com/hanyun2019/difussion-model-code-implementation/blob/dm-project-haowen-mac/L1\_Sampling.ipynb?trk=cndc-detail

示例代码的 notebook 在 Amazon SageMaker Notebook 测试通过，内核为 conda_pytorch_p310 ，实例为一台 ml.g5.2xlarge 实例，如下图所示。

3. 采样过程阐明

首先假如你有一个噪声样本（noise sample），你把这个噪声样本输出到一个曾经训练好的神经网络中。这个神经网络曾经晓得精灵图像的样子，它接下来的次要工作是预测噪声。请留神：这个神经网络预测的是噪声而不是精灵图像，而后咱们从噪声样本中减去预测的噪声，来失去更像精灵图像的输入后果。

Source: How Diffusion Models Work, https://learn.deeplearning.ai/diffusion-models/lesson/2/intuition?trk=cndc-detail ,by DeepLearning.AI

因为只是对噪声的预测，它并不能齐全打消所有噪声，因而须要多个步骤能力取得高质量的样本。比方咱们心愿在 500 次这样的迭代之后，可能失去看起来十分像精灵图像的输入后果。

Source: How Diffusion Models Work,https://learn.deeplearning.ai/diffusion-models/lesson/2/intuition?trk=cndc-detail ,by DeepLearning.AI

咱们先看一段伪代码，从算法实现上高屋建瓴地看下整个逻辑构造：

Source: How Diffusion Models Work, https://learn.deeplearning.ai/diffusion-models/lesson/2/intuition?trk=cndc-detail ,by DeepLearning.AI

首先咱们以随机采样噪声样本（random noise sample）的形式，开始这段旅程。

如果你看过一些对于穿梭工夫旅行的电影，这整个过程很像是一段时间旅行。想像一下你有一杯墨汁，咱们实际上是在用时光倒退（step backwards）的形式；它最后是齐全扩散的乌黑墨汁，而后咱们会始终追溯到有第一滴墨汁滴入一杯清水的那个最后时候。

而后，咱们将采样一些额定噪声（extra noise）。为什么咱们须要增加一些额定噪声，这其实是一个很乏味的话题，咱们会在本文的前面局部具体探讨这个话题。

这是你理论将原始噪声、那个样本传递回神经网络的中央，而后你会失去一些预测的噪声。而这种预测噪声是经过训练的神经网络想要从原始噪声中减去的噪声，以在最初失去看起来更像精灵图像的输入后果。

最初咱们还会用到一种名为 “DDPM” 的采样算法，它代表降噪扩散概率模型。

4. 导入所需的库文件

当初咱们进入通过代码解读扩散模型的局部。首先，咱们须要导入 PyTorch 和一些 PyTorch 相干的实用库，以及导入帮忙咱们设计神经网络的一些辅助函数（helper functions）。

from typing import Dict, Tuplefrom tqdm import tqdmimport torchimport torch.nn as nnimport torch.nn.functional as Ffrom torch.utils.data import DataLoaderfrom torchvision import models, transformsfrom torchvision.utils import save_image, make_gridimport matplotlib.pyplot as pltfrom matplotlib.animation import FuncAnimation, PillowWriterimport numpy as npfrom IPython.display import HTMLfrom diffusion_utilities import *

5. 神经网络架构设计

当初咱们来设置神经网络，咱们要用它来采样。

class ContextUnet(nn.Module):    def __init__(self, in_channels, n_feat=256, n_cfeat=10, height=28):  # cfeat - context features        super(ContextUnet, self).__init__()        # number of input channels, number of intermediate feature maps and number of classes        self.in_channels = in_channels        self.n_feat = n_feat        self.n_cfeat = n_cfeat        self.h = height  #assume h == w. must be divisible by 4, so 28,24,20,16...        # Initialize the initial convolutional layer        self.init_conv = ResidualConvBlock(in_channels, n_feat, is_res=True)# Initialize the down-sampling path of the U-Net with two levels        self.down1 = UnetDown(n_feat, n_feat)        # down1 #[10, 256, 8, 8]        self.down2 = UnetDown(n_feat, 2 * n_feat)    # down2 #[10, 256, 4,  4]                 # original: self.to_vec = nn.Sequential(nn.AvgPool2d(7), nn.GELU())        self.to_vec = nn.Sequential(nn.AvgPool2d((4)), nn.GELU())        # Embed the timestep and context labels with a one-layer fully connected neural network        self.timeembed1 = EmbedFC(1, 2*n_feat)        self.timeembed2 = EmbedFC(1, 1*n_feat)        self.contextembed1 = EmbedFC(n_cfeat, 2*n_feat)        self.contextembed2 = EmbedFC(n_cfeat, 1*n_feat)        # Initialize the up-sampling path of the U-Net with three levels        self.up0 = nn.Sequential(            nn.ConvTranspose2d(2 * n_feat, 2 * n_feat, self.h//4, self.h//4), # up-sample              nn.GroupNorm(8, 2 * n_feat), # normalize                                   nn.ReLU(),        )        self.up1 = UnetUp(4 * n_feat, n_feat)        self.up2 = UnetUp(2 * n_feat, n_feat)        # Initialize the final convolutional layers to map to the same number of channels as the input image        self.out = nn.Sequential(            nn.Conv2d(2 * n_feat, n_feat, 3, 1, 1), # reduce number of feature maps   #in_channels, out_channels, kernel_size, stride=1, padding=0            nn.GroupNorm(8, n_feat), # normalize            nn.ReLU(),            nn.Conv2d(n_feat, self.in_channels, 3, 1, 1), # map to same number of channels as input        )    def forward(self, x, t, c=None):        """        x : (batch, n_feat, h, w) : input image        t : (batch, n_cfeat)      : time step        c : (batch, n_classes)    : context label        """        # x is the input image, c is the context label, t is the timestep, context_mask says which samples to block the context on        # pass the input image through the initial convolutional layer        x = self.init_conv(x)        # pass the result through the down-sampling path        down1 = self.down1(x)       #[10, 256, 8, 8]        down2 = self.down2(down1)   #[10, 256, 4, 4]                # convert the feature maps to a vector and apply an activation        hiddenvec = self.to_vec(down2)                # mask out context if context_mask == 1        if c is None:            c = torch.zeros(x.shape[0], self.n_cfeat).to(x)                    # embed context and timestep        cemb1 = self.contextembed1(c).view(-1, self.n_feat * 2, 1, 1)     # (batch, 2*n_feat, 1,1)        temb1 = self.timeembed1(t).view(-1, self.n_feat * 2, 1, 1)        cemb2 = self.contextembed2(c).view(-1, self.n_feat, 1, 1)        temb2 = self.timeembed2(t).view(-1, self.n_feat, 1, 1)        #print(f"uunet forward: cemb1 {cemb1.shape}. temb1 {temb1.shape}, cemb2 {cemb2.shape}. temb2 {temb2.shape}")        up1 = self.up0(hiddenvec)        up2 = self.up1(cemb1*up1 + temb1, down2)  # add and multiply embeddings        up3 = self.up2(cemb2*up2 + temb2, down1)        out = self.out(torch.cat((up3, x), 1))        return out

6. 设置模型训练的超参数

接下来，咱们将设置模型训练须要的一些超参数，包含：工夫步长、图像尺寸等。

如果对照 DDPM 的论文，其中定义了一个 noise schedule 的概念， noise schedule 决定了在特定工夫里步长对图像施加的噪点程度。因而，这部分只是结构一些你记得的缩放因子的 DDPM 算法参数。那些缩放值 S1、S2、S3 ，这些缩放值是在 noise schedule 中计算的。它之所以被称为 “Schedule”，是因为它取决于工夫步长。

Source: How Diffusion Models Work, https://learn.deeplearning.ai/diffusion-models/lesson/2/intuition?trk=cndc-detail ,by DeepLearning.AI

超参数介绍：

beta1：DDPM 算法的超参数
beta2：DDPM 算法的超参数
height：图像的长度和高度
noise schedule（噪声调度）：确定在某个工夫步长利用于图像的噪声级别；
S1，S2，S3：缩放因子的值

如上面代码所示，咱们在这里设置的工夫步长（timesteps）是 500 ；图像尺寸参数 height 设置为 16 ，示意这是 16 乘 16 的正方形图像；DDPM 的超参数 beta1 和 beta2 等等。

# hyperparameters# diffusion hyperparameterstimesteps = 500beta1 = 1e-4beta2 = 0.02# network hyperparametersdevice = torch.device("cuda:0" if torch.cuda.is_available() else torch.device('cpu'))n_feat = 64 # 64 hidden dimension featuren_cfeat = 5 # context vector is of size 5height = 16 # 16x16 imagesave_dir = './weights/'

请记住，你正在浏览 500 次的步骤，因为你正在经验你在这里看到的迟缓去除乐音的 500 次迭代。

Source: How Diffusion Models Work, https://learn.deeplearning.ai/diffusion-models/lesson/2/intuition?trk=cndc-detail ,by DeepLearning.AI

以下代码块将构建 DDPM 论文中定义的工夫步长（noise schedule）：

# construct DDPM noise scheduleb_t = (beta2 - beta1) * torch.linspace(0, 1, timesteps + 1, device=device) + beta1a_t = 1 - b_tab_t = torch.cumsum(a_t.log(), dim=0).exp()    ab_t[0] = 1

接下来实例化模型：

# construct modelnn_model = ContextUnet(in_channels=3, n_feat=n_feat, n_cfeat=n_cfeat, height=height).to(device)

7. 增加额定噪声的输入测试

首先测试的是增加额定噪声的输入测试。能够重点关注下变量 z 。

在每次迭代之后，咱们通过设置“z = torch.randn_like(x)”来增加额定的采样噪声，以让噪声输出合乎正态分布：

# helper function; removes the predicted noise (but adds some noise back in to avoid collapse)def denoise_add_noise(x, t, pred_noise, z=None):    if z is None:        z = torch.randn_like(x)    noise = b_t.sqrt()[t] * z    mean = (x - pred_noise * ((1 - a_t[t]) / (1 - ab_t[t]).sqrt())) / a_t[t].sqrt()

接下来加载该模型：

# load in model weights and set to eval modenn_model.load_state_dict(torch.load(f"{save_dir}/model_trained.pth", map_location=device))nn_model.eval()print("Loaded in Model")

以下代码段实现了后面介绍过的 DDPM 采样算法：

# sample using standard algorithm@torch.no_grad()def sample_ddpm(n_sample, save_rate=20):    # x_T ~ N(0, 1), sample initial noise    samples = torch.randn(n_sample, 3, height, height).to(device)      # array to keep track of generated steps for plotting    intermediate = []     for i in range(timesteps, 0, -1):        print(f'sampling timestep {i:3d}', end='\r')        # reshape time tensor        t = torch.tensor([i / timesteps])[:, None, None, None].to(device)        # sample some random noise to inject back in. For i = 1, don't add back in noise        z = torch.randn_like(samples) if i > 1 else 0        eps = nn_model(samples, t)    # predict noise e_(x_t,t)        samples = denoise_add_noise(samples, i, eps, z)        if i % save_rate ==0 or i==timesteps or i<8:            intermediate.append(samples.detach().cpu().numpy())    intermediate = np.stack(intermediate)    return samples, intermediate

运行模型以取得预测的噪声：

eps = nn_model(samples, t)    # predict noise e_(x_t,t)

最初降噪：

samples = denoise_add_noise(samples, i, eps, z)

当初，让咱们来可视化采样随时间推移的样子。这可能须要几分钟，具体取决于你在哪种硬件上运行。在本系列的第四集中，咱们还将介绍一种疾速采样（Fast Sampling）技术，这个在第四集中咱们在具体探讨。

点击开始按钮来查看不同工夫线上，模型生成的精灵图像，动图显示如下所示。

Source: Model output with Amazon SageMaker notebook instance

如果以上动图无奈在手机上失常显示，能够参考上面这三张，我在不同工夫线上别离做了截图。

Source:Model output with Amazon SageMaker notebook instance

8. 未增加额定噪声的输入测试

对于咱们不增加乐音的输入测试，代码方面其实实现很简略，就是是将变量 z 设置为零，而后将其传入。代码如下所示。

# incorrectly sample without adding in noise@torch.no_grad()def sample_ddpm_incorrect(n_sample):    # x_T ~ N(0, 1), sample initial noise    samples = torch.randn(n_sample, 3, height, height).to(device)      # array to keep track of generated steps for plotting    intermediate = []     for i in range(timesteps, 0, -1):        print(f'sampling timestep {i:3d}', end='\r')        # reshape time tensor        t = torch.tensor([i / timesteps])[:, None, None, None].to(device)        # don't add back in noise        z = 0        eps = nn_model(samples, t)    # predict noise e_(x_t,t)        samples = denoise_add_noise(samples, i, eps, z)        if i%20==0 or i==timesteps or i<8:            intermediate.append(samples.detach().cpu().numpy())    intermediate = np.stack(intermediate)    return samples, intermediate

让咱们来看看不增加乐音形式的输入后果，如下图所示：输入变形了！

Source: Model output with Amazon SageMaker notebook instance

这显然不是咱们想要的后果。可见，在这个神经网络的架构设计中，在每个迭代阶段增加额定噪声，来放弃输出噪声合乎正态分布是很要害的一个步骤。

总结

作为 “扩散模型工作原理”代码实际系列的第一篇，本文通过两段不同代码块的实现，来比照了两种扩散模型的采样办法：

增加额定噪声的办法
不增加额定噪声的办法

总结来说，就是扩散模型的神经网络输出应该是合乎正态分布的噪声样本。因为在迭代过程中，噪声样本减去模型预测的噪声之后失去的样本曾经不合乎正态分布了，所以容易导致输入变形。因而，在每次迭代之后，咱们须要依据其所处的工夫步长来增加额定的采样噪声，以让输出合乎正态分布。这能够保障模型训练的稳定性，以防止模型的预测后果因为靠近数据集的均值，而导致的输入后果变形。

这个系列之后的文章，咱们将持续深刻理解扩散过程和执行该过程的模型，帮忙大家在更深层次的了解扩散模型；并且通过本人入手从头构建扩散模型，而不是仅仅援用预训练好的模型或应用模型的 API ，来对扩散模型底层实现原理的了解更加粗浅。下一篇文章咱们将用代码来实际扩散模型的训练，敬请期待。

参考资料

DeepLearning.AI short course “How Diffusion Models Work”

Sprites by ElvGames, FrootsnVeggies and kyrise

Code reference, This code is modified from

DDPM & DDIM papers

Diffusion model is based on Denoising Diffusion Probabilistic Models and Denoising Diffusion Implicit Models

作者黄浩文

亚马逊云科技资深开发者布道师，专一于 AI/ML、Data Science 等。领有 20 多年电信、挪动互联网以及云计算等行业架构设计、技术及守业治理等丰盛教训，曾就任于 Microsoft、Sun Microsystems、中国电信等企业，专一为游戏、电商、媒体和广告等企业客户提供 AI/ML、数据分析和企业数字化转型等解决方案咨询服务。

文章起源：https://dev.amazoncloud.cn/column/article/6504433963c37d5eec956a79?sc_medium=regulartraffic&sc_campaign=crossplatform&sc_channel=SF