关于深度学习:从零开始实现VAE和CVAE

扩散模型能够看作是一个档次很深的 VAE(变分自编码器)，前向（forward，或者译为正向）的过程，通过在多个尺度上增加噪声来逐渐扰乱数据分布；而后是反向的过程，去学习如何复原数据结构，上述的毁坏和复原过程别离对应于 VAE 中的编码和解码过程。所以 VAE 是一个重要的概念须要把握，本文将用 python 从头开始实现 VAE 和 CVAE，来减少对于它们的了解。

自编码器是一种由编码器和解码器两局部组成的神经系统构造。解码器在编码器之后，两头是所谓的暗藏层，它有各种各样的名称，有时能够称为瓶颈层、潜在空间、暗藏层、编码层或编码层。它看起来像这样:

自编码器能够利用在各种用处上。最常见的就是数据压缩: 当输出信号通过编码器局部时，图像的潜在示意在尺寸上要小得多。例如，在上图中，尽管输出信号用 8 个值示意，但其压缩示意只须要 3 个值。

自编码器也能够用于各种其余目标: 数据去噪，特色学习，异样检测，以及当初大火的稳固扩散模型。

咱们将应用 MNIST 数据集。要将 MNIST 下载到本地文件夹，请运行以下命令:

 # Download the files
 url = "http://yann.lecun.com/exdb/mnist/"
 filenames = ['train-images-idx3-ubyte.gz', 'train-labels-idx1-ubyte.gz',
              't10k-images-idx3-ubyte.gz', 't10k-labels-idx1-ubyte.gz']
 data = []
 for filename in filenames:
     print("Downloading", filename)
     request.urlretrieve(url + filename, filename)
     with gzip.open(filename, 'rb') as f:
         if 'labels' in filename:
             # Load the labels as a one-dimensional array of integers
             data.append(np.frombuffer(f.read(), np.uint8, offset=8))
         else:
             # Load the images as a two-dimensional array of pixels
             data.append(np.frombuffer(f.read(), np.uint8, offset=16).reshape(-1,28*28))
 
 # Split into training and testing sets
 X_train, y_train, X_test, y_test = data
 
 # Normalize the pixel values
 X_train = X_train.astype(np.float32) / 255.0
 X_test = X_test.astype(np.float32) / 255.0
 
 # Convert labels to integers
 y_train = y_train.astype(np.int64)
 y_test = y_test.astype(np.int64)

当初咱们有了训练和测试集，首先让咱们看看图像:

 def show_images(images, labels):
     """
     Display a set of images and their labels using matplotlib.
     The first column of `images` should contain the image indices,
     and the second column should contain the flattened image pixels
     reshaped into 28x28 arrays.
     """
     # Extract the image indices and reshaped pixels
     pixels = images.reshape(-1, 28, 28)
 
     # Create a figure with subplots for each image
     fig, axs = plt.subplots(ncols=len(images), nrows=1, figsize=(10, 3 * len(images))
     )
 
     # Loop over the images and display them with their labels
     for i in range(len(images)):
         # Display the image and its label
         axs[i].imshow(pixels[i], cmap="gray")
         axs[i].set_title("Label: {}".format(labels[i]))
 
         # Remove the tick marks and axis labels
         axs[i].set_xticks([])
         axs[i].set_yticks([])
         axs[i].set_xlabel("Index: {}".format(i))
 
     # Adjust the spacing between subplots
     fig.subplots_adjust(hspace=0.5)
 
     # Show the figure
     plt.show()

因为数据比较简单，所以咱们这里间接应用线性层，这样不便咱们进行计算：

 import torch.nn as nn
 
 class AutoEncoder(nn.Module):
     def __init__(self):
         super().__init__()
         
         # Set the number of hidden units
         self.num_hidden = 8
         
         # Define the encoder part of the autoencoder
         self.encoder = nn.Sequential(nn.Linear(784, 256),  # input size: 784, output size: 256
             nn.ReLU(),  # apply the ReLU activation function
             nn.Linear(256, self.num_hidden),  # input size: 256, output size: num_hidden
             nn.ReLU(),  # apply the ReLU activation function)
         
         # Define the decoder part of the autoencoder
         self.decoder = nn.Sequential(nn.Linear(self.num_hidden, 256),  # input size: num_hidden, output size: 256
             nn.ReLU(),  # apply the ReLU activation function
             nn.Linear(256, 784),  # input size: 256, output size: 784
             nn.Sigmoid(),  # apply the sigmoid activation function to compress the output to a range of (0, 1)
         )
 
     def forward(self, x):
         # Pass the input through the encoder
         encoded = self.encoder(x)
         # Pass the encoded representation through the decoder
         decoded = self.decoder(encoded)
         # Return both the encoded representation and the reconstructed output
         return encoded, decoded

训练时咱们并不需要图像标签，因为我这是一种无监督的办法。这里咱们抉择应用简略的均方误差损失，因为咱们想以最准确的形式重建咱们的图像。让咱们做一些筹备工作:

 # Convert the training data to PyTorch tensors
 X_train = torch.from_numpy(X_train)
 
 # Create the autoencoder model and optimizer
 model = AutoEncoder()
 optimizer = optim.Adam(model.parameters(), lr=learning_rate)
 
 # Define the loss function
 criterion = nn.MSELoss()
 
 # Set the device to GPU if available, otherwise use CPU
 model.to(device)
 
 # Create a DataLoader to handle batching of the training data
 train_loader = torch.utils.data.DataLoader(X_train, batch_size=batch_size, shuffle=True)

最初，训练循环也很规范：

 # Training loop
 for epoch in range(num_epochs):
     total_loss = 0.0
     for batch_idx, data in enumerate(train_loader):
         # Get a batch of training data and move it to the device
         data = data.to(device)
 
         # Forward pass
         encoded, decoded = model(data)
 
         # Compute the loss and perform backpropagation
         loss = criterion(decoded, data)
         optimizer.zero_grad()
         loss.backward()
         optimizer.step()
 
         # Update the running loss
         total_loss += loss.item() * data.size(0)
 
     # Print the epoch loss
     epoch_loss = total_loss / len(train_loader.dataset)
     print("Epoch {}/{}: loss={:.4f}".format(epoch + 1, num_epochs, epoch_loss)
     )

为了计算损失，将输出图像与重建后的图像间接进行进行比拟就能够了。训练速度会很快，甚至你能够在 CPU 上实现。当训练实现后，比拟输入和输出图像:

下面一行是原始图像，上面一行是重建图像。

这里有几个问题：

1、重建的图像很含糊。这是因为重建并不完满。

2、这种类型的压缩不是收费的，它是以在解码过程中会呈现问题，3、咱们只应用 8 个暗藏单元，减少暗藏单元的数量会进步图像品质，而缩小它们会使含糊更重大。

看看 32 个暗藏单元的后果：

自编码器对于现有数据体现得还不错，然而他有一个最大的问题，就是生成新数据十分艰难。如果咱们去掉编码器局部，只从潜在层开始，咱们应该可能失去一个有意义的图像。然而对于自编码器来说，没有办法能够有意义的形式对潜在空间进行采样，即提出一种牢靠的采样策略，以确保输入图像是可读的，并且还会产生肯定的变动。

当初咱们要做的是从这个潜在空间生成一堆样本。从这个潜在空间散布中生成样本有点艰难，所以咱们从矩形中生成样本。这是咱们失去的后果:

尽管有些样本看起来很好，但要对同一空间进行有意义的采样将会变得更加艰难，因为该空间的维数更高。例如，如果咱们将维度减少到 32，后果如下:

数字曾经无奈识别了，哪有没有更好的方法呢?

变分自编码器 (VAEs) 的论文名为“Auto-Encoding Variational Bayes”，由 Diederik P. Kingma 和 Max Welling 于 2014 年发表。

VAEs 为咱们提供了一种更灵便的办法来学习畛域的潜在示意。根本思维很简略: 咱们学习的是潜在空间散布的参数，而不是具体的数值。而生成潜在变量时不是间接从潜在示意中获取，而是应用潜在空间散布参数来生成潜在示意。

咱们应用输出数据学习均值和方差的向量，便稍后应用它们对将要用解码器解码的潜在变量进行采样

然而采样操作是不可微的。所以应用了一种叫做从新参数化的技巧。它的工作原理是这样的: 咱们不再从那个块中获取样本，而是明确地学习均值和方差的两个向量，而后有一个独立的块，只从中采样

因而，咱们不间接对这些散布进行抽样，而是做以下操作:

𝐿~ 示意潜在示意的一个组成部分。所以当初的模型变成了这样

下面的模型代码也随之扭转：

 class VAE(AutoEncoder):
     def __init__(self):
         super().__init__()
         # Add mu and log_var layers for reparameterization
         self.mu = nn.Linear(self.num_hidden, self.num_hidden)
         self.log_var = nn.Linear(self.num_hidden, self.num_hidden)
 
     def reparameterize(self, mu, log_var):
         # Compute the standard deviation from the log variance
         std = torch.exp(0.5 * log_var)
         # Generate random noise using the same shape as std
         eps = torch.randn_like(std)
         # Return the reparameterized sample
         return mu + eps * std
 
     def forward(self, x):
         # Pass the input through the encoder
         encoded = self.encoder(x)
         # Compute the mean and log variance vectors
         mu = self.mu(encoded)
         log_var = self.log_var(encoded)
         # Reparameterize the latent variable
         z = self.reparameterize(mu, log_var)
         # Pass the latent variable through the decoder
         decoded = self.decoder(z)
         # Return the encoded output, decoded output, mean, and log variance
         return encoded, decoded, mu, log_var
 
     def sample(self, num_samples):
         with torch.no_grad():
             # Generate random noise
             z = torch.randn(num_samples, self.num_hidden).to(device)
             # Pass the noise through the decoder to generate samples
             samples = self.decoder(z)
         # Return the generated samples
         return samples

咱们如何训练这模型呢？首先来定义损失函数:

 # Define a loss function that combines binary cross-entropy and Kullback-Leibler divergence
 def loss_function(recon_x, x, mu, logvar):
     # Compute the binary cross-entropy loss between the reconstructed output and the input data
     BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784), reduction="sum")
     # Compute the Kullback-Leibler divergence between the learned latent variable distribution and a standard Gaussian distribution
     KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
     # Combine the two losses by adding them together and return the result
     return BCE + KLD

第一个重量咱们曾经很相熟了，它只是重构误差。第二个局部引入了对学习散布偏离先验散布过多的惩办，还记得咱们所过几次的 KL 散度，他能够比拟两个概率分布之间的相似性，咱们比拟的是咱们的散布与规范正态分布的相似性，有了这个函数，咱们就能够这样训练：

 def train_vae(X_train, learning_rate=1e-3, num_epochs=10, batch_size=32):
     # Convert the training data to PyTorch tensors
     X_train = torch.from_numpy(X_train).to(device)
 
     # Create the autoencoder model and optimizer
     model = VAE()
     optimizer = optim.Adam(model.parameters(), lr=learning_rate)
 
     # Define the loss function
     criterion = nn.MSELoss(reduction="sum")
 
     # Set the device to GPU if available, otherwise use CPU
     model.to(device)
 
     # Create a DataLoader to handle batching of the training data
     train_loader = torch.utils.data.DataLoader(X_train, batch_size=batch_size, shuffle=True)
 
     # Training loop
     for epoch in range(num_epochs):
         total_loss = 0.0
         for batch_idx, data in enumerate(train_loader):
             # Get a batch of training data and move it to the device
             data = data.to(device)
 
             # Forward pass
             encoded, decoded, mu, log_var = model(data)
 
             # Compute the loss and perform backpropagation
             KLD = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
             loss = criterion(decoded, data) + 3 * KLD
             optimizer.zero_grad()
             loss.backward()
             optimizer.step()
 
             # Update the running loss
             total_loss += loss.item() * data.size(0)
 
         # Print the epoch loss
         epoch_loss = total_loss / len(train_loader.dataset)
         print("Epoch {}/{}: loss={:.4f}".format(epoch + 1, num_epochs, epoch_loss)
         )
 
     # Return the trained model
     return model

上面看看咱们生成的图像：

图像看起来还是很含糊，这是因为咱们应用 MAE 来进行重建管制，应用其余的损失会好

最初一件十分乏味的事件是生成一个随机向量，而后逐步扭转它的一个维度，同时放弃其余维度固定。这样咱们就能够看到解码器的输入是如何变动的。上面是一些例子:

通过扭转它的一个重量咱们从 0 挪动到 9，而后挪动到 1 或 7。

为了生成具备特定标签的图像，编码器须要学习如何在给定提醒时解码潜在变量。在这种状况下，咱们用一些信息来限度编码器，怎么能做到呢? 一个很显著的想法是将一个编码数字标签传递给解码器，这样它就能够学习解码过程。它看起来像这样:

当初为解码器减少了一个额定的信息源。为什么咱们先做线性投影再求和? 线性投影是指有匹配的码层尺寸和标签信息，咱们须要将他投影到与潜在空间雷同的维度，而后把它们加起来。也能够取平均值，或者逐点乘法，或者只是把这些向量连接起来——任何相似的办法都能够，咱们这里只是简略的相加。看看这个图，是不是和扩散模型有点像了（cvae 可是 2016 年公布的）。

而后在推理时，咱们要做的就是传递一个想要生成的数字的标签。代码就变为了:

 class ConditionalVAE(VAE):
     # VAE implementation from the article linked above
     def __init__(self, num_classes):
         super().__init__()
         # Add a linear layer for the class label
         self.label_projector = nn.Sequential(nn.Linear(num_classes, self.num_hidden),
             nn.ReLU(),)
 
     def condition_on_label(self, z, y):
         projected_label = self.label_projector(y.float())
         return z + projected_label
 
     def forward(self, x, y):
         # Pass the input through the encoder
         encoded = self.encoder(x)
         # Compute the mean and log variance vectors
         mu = self.mu(encoded)
         log_var = self.log_var(encoded)
         # Reparameterize the latent variable
         z = self.reparameterize(mu, log_var)
         # Pass the latent variable through the decoder
         decoded = self.decoder(self.condition_on_label(z, y))
         # Return the encoded output, decoded output, mean, and log variance
         return encoded, decoded, mu, log_var
 
     def sample(self, num_samples, y):
         with torch.no_grad():
             # Generate random noise
             z = torch.randn(num_samples, self.num_hidden).to(device)
             # Pass the noise through the decoder to generate samples
             samples = self.decoder(self.condition_on_label(z, y))
         # Return the generated samples
         return samples

这里有一个叫做 label_projector 的新层，它做线性投影。潜在空间在前向传递和采样过程中都通过该层。

CVAE 损失还是 VAE 的损失，训练也根本一样，咱们这里就只看后果了：

 num_samples = 10
 random_labels = [8] * num_samples
 show_images(cvae.sample(num_samples, one_hot(torch.LongTensor(random_labels), num_classes=10).to(device))
     .cpu()
     .detach()
     .numpy(),
     labels=random_labels,
 )

能够看到，咱们的图像根本固定了，并不会呈现其余数字

自编码器是了解无监督学习和数据压缩的根底。尽管简略的主动编码器能够重建图像，但它们难以生成新数据。变分自编码器 (VAEs) 提供了一种更灵便的办法，通过学习可采样的潜在空间散布的参数来生成新数据。利用重参数化技巧使采样操作可微。而 CVAE 又为起初的提供了条件反对，所以学习这些会对咱们了解稳固扩散模型提供很好的实践根底。

https://avoid.overfit.cn/post/57bd9ac6acbb4fe0987bdfc1819d1c59

作者 Konstantin Sofeikov

关于深度学习:从零开始实现VAE和CVAE

什么是自编码器? 它们的作用是什么

自编码器的实现

变分自编码器 VAE

生成具备特定标签的图像 CVAE

总结

Just My Socks（注册教程内含优惠码）

关于深度学习:从零开始实现VAE和CVAE

什么是自编码器? 它们的作用是什么

自编码器的实现

变分自编码器 VAE

生成具备特定标签的图像 CVAE

总结

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）