扩散模型能够看作是一个档次很深的VAE(变分自编码器),前向(forward,或者译为正向)的过程,通过在多个尺度上增加噪声来逐渐扰乱数据分布;而后是反向的过程,去学习如何复原数据结构,上述的毁坏和复原过程别离对应于VAE中的编码和解码过程。所以VAE是一个重要的概念须要把握,本文将用python从头开始实现VAE和CVAE,来减少对于它们的了解。
什么是自编码器?它们的作用是什么
自编码器是一种由编码器和解码器两局部组成的神经系统构造。解码器在编码器之后,两头是所谓的暗藏层,它有各种各样的名称,有时能够称为瓶颈层、潜在空间、暗藏层、编码层或编码层。它看起来像这样:
自编码器能够利用在各种用处上。最常见的就是数据压缩:当输出信号通过编码器局部时,图像的潜在示意在尺寸上要小得多。例如,在上图中,尽管输出信号用8个值示意,但其压缩示意只须要3个值。
自编码器也能够用于各种其余目标:数据去噪,特色学习,异样检测,以及当初大火的稳固扩散模型。
自编码器的实现
咱们将应用MNIST数据集。要将MNIST下载到本地文件夹,请运行以下命令:
# Download the files url = "http://yann.lecun.com/exdb/mnist/" filenames = ['train-images-idx3-ubyte.gz', 'train-labels-idx1-ubyte.gz', 't10k-images-idx3-ubyte.gz', 't10k-labels-idx1-ubyte.gz'] data = [] for filename in filenames: print("Downloading", filename) request.urlretrieve(url + filename, filename) with gzip.open(filename, 'rb') as f: if 'labels' in filename: # Load the labels as a one-dimensional array of integers data.append(np.frombuffer(f.read(), np.uint8, offset=8)) else: # Load the images as a two-dimensional array of pixels data.append(np.frombuffer(f.read(), np.uint8, offset=16).reshape(-1,28*28)) # Split into training and testing sets X_train, y_train, X_test, y_test = data # Normalize the pixel values X_train = X_train.astype(np.float32) / 255.0 X_test = X_test.astype(np.float32) / 255.0 # Convert labels to integers y_train = y_train.astype(np.int64) y_test = y_test.astype(np.int64)
当初咱们有了训练和测试集,首先让咱们看看图像:
def show_images(images, labels): """ Display a set of images and their labels using matplotlib. The first column of `images` should contain the image indices, and the second column should contain the flattened image pixels reshaped into 28x28 arrays. """ # Extract the image indices and reshaped pixels pixels = images.reshape(-1, 28, 28) # Create a figure with subplots for each image fig, axs = plt.subplots( ncols=len(images), nrows=1, figsize=(10, 3 * len(images)) ) # Loop over the images and display them with their labels for i in range(len(images)): # Display the image and its label axs[i].imshow(pixels[i], cmap="gray") axs[i].set_title("Label: {}".format(labels[i])) # Remove the tick marks and axis labels axs[i].set_xticks([]) axs[i].set_yticks([]) axs[i].set_xlabel("Index: {}".format(i)) # Adjust the spacing between subplots fig.subplots_adjust(hspace=0.5) # Show the figure plt.show()
因为数据比较简单,所以咱们这里间接应用线性层,这样不便咱们进行计算:
import torch.nn as nn class AutoEncoder(nn.Module): def __init__(self): super().__init__() # Set the number of hidden units self.num_hidden = 8 # Define the encoder part of the autoencoder self.encoder = nn.Sequential( nn.Linear(784, 256), # input size: 784, output size: 256 nn.ReLU(), # apply the ReLU activation function nn.Linear(256, self.num_hidden), # input size: 256, output size: num_hidden nn.ReLU(), # apply the ReLU activation function ) # Define the decoder part of the autoencoder self.decoder = nn.Sequential( nn.Linear(self.num_hidden, 256), # input size: num_hidden, output size: 256 nn.ReLU(), # apply the ReLU activation function nn.Linear(256, 784), # input size: 256, output size: 784 nn.Sigmoid(), # apply the sigmoid activation function to compress the output to a range of (0, 1) ) def forward(self, x): # Pass the input through the encoder encoded = self.encoder(x) # Pass the encoded representation through the decoder decoded = self.decoder(encoded) # Return both the encoded representation and the reconstructed output return encoded, decoded
训练时咱们并不需要图像标签,因为我这是一种无监督的办法。这里咱们抉择应用简略的均方误差损失,因为咱们想以最准确的形式重建咱们的图像。让咱们做一些筹备工作:
# Convert the training data to PyTorch tensors X_train = torch.from_numpy(X_train) # Create the autoencoder model and optimizer model = AutoEncoder() optimizer = optim.Adam(model.parameters(), lr=learning_rate) # Define the loss function criterion = nn.MSELoss() # Set the device to GPU if available, otherwise use CPU model.to(device) # Create a DataLoader to handle batching of the training data train_loader = torch.utils.data.DataLoader( X_train, batch_size=batch_size, shuffle=True )
最初,训练循环也很规范:
# Training loop for epoch in range(num_epochs): total_loss = 0.0 for batch_idx, data in enumerate(train_loader): # Get a batch of training data and move it to the device data = data.to(device) # Forward pass encoded, decoded = model(data) # Compute the loss and perform backpropagation loss = criterion(decoded, data) optimizer.zero_grad() loss.backward() optimizer.step() # Update the running loss total_loss += loss.item() * data.size(0) # Print the epoch loss epoch_loss = total_loss / len(train_loader.dataset) print( "Epoch {}/{}: loss={:.4f}".format(epoch + 1, num_epochs, epoch_loss) )
为了计算损失,将输出图像与重建后的图像间接进行进行比拟就能够了。训练速度会很快,甚至你能够在CPU上实现。当训练实现后,比拟输入和输出图像:
下面一行是原始图像,上面一行是重建图像。
这里有几个问题:
1、重建的图像很含糊。这是因为重建并不完满。
2、这种类型的压缩不是收费的,它是以在解码过程中会呈现问题,3、咱们只应用8个暗藏单元,减少暗藏单元的数量会进步图像品质,而缩小它们会使含糊更重大。
看看32个暗藏单元的后果:
自编码器对于现有数据体现得还不错,然而他有一个最大的问题,就是生成新数据十分艰难。如果咱们去掉编码器局部,只从潜在层开始,咱们应该可能失去一个有意义的图像。然而对于自编码器来说,没有办法能够有意义的形式对潜在空间进行采样,即提出一种牢靠的采样策略,以确保输入图像是可读的,并且还会产生肯定的变动。
当初咱们要做的是从这个潜在空间生成一堆样本。从这个潜在空间散布中生成样本有点艰难,所以咱们从矩形中生成样本。这是咱们失去的后果:
尽管有些样本看起来很好,但要对同一空间进行有意义的采样将会变得更加艰难,因为该空间的维数更高。例如,如果咱们将维度减少到32,后果如下:
数字曾经无奈识别了,哪有没有更好的方法呢?
变分自编码器 VAE
变分自编码器(VAEs)的论文名为“Auto-Encoding Variational Bayes”,由Diederik P. Kingma和Max Welling于2014年发表。
VAEs为咱们提供了一种更灵便的办法来学习畛域的潜在示意。根本思维很简略:咱们学习的是潜在空间散布的参数,而不是具体的数值。而生成潜在变量时不是间接从潜在示意中获取,而是应用潜在空间散布参数来生成潜在示意。
咱们应用输出数据学习均值和方差的向量,便稍后应用它们对将要用解码器解码的潜在变量进行采样
然而采样操作是不可微的。所以应用了一种叫做从新参数化的技巧。它的工作原理是这样的:咱们不再从那个块中获取样本,而是明确地学习均值和方差的两个向量,而后有一个独立的块,只从中采样
因而,咱们不间接对这些散布进行抽样,而是做以下操作:
~示意潜在示意的一个组成部分。所以当初的模型变成了这样
下面的模型代码也随之扭转:
class VAE(AutoEncoder): def __init__(self): super().__init__() # Add mu and log_var layers for reparameterization self.mu = nn.Linear(self.num_hidden, self.num_hidden) self.log_var = nn.Linear(self.num_hidden, self.num_hidden) def reparameterize(self, mu, log_var): # Compute the standard deviation from the log variance std = torch.exp(0.5 * log_var) # Generate random noise using the same shape as std eps = torch.randn_like(std) # Return the reparameterized sample return mu + eps * std def forward(self, x): # Pass the input through the encoder encoded = self.encoder(x) # Compute the mean and log variance vectors mu = self.mu(encoded) log_var = self.log_var(encoded) # Reparameterize the latent variable z = self.reparameterize(mu, log_var) # Pass the latent variable through the decoder decoded = self.decoder(z) # Return the encoded output, decoded output, mean, and log variance return encoded, decoded, mu, log_var def sample(self, num_samples): with torch.no_grad(): # Generate random noise z = torch.randn(num_samples, self.num_hidden).to(device) # Pass the noise through the decoder to generate samples samples = self.decoder(z) # Return the generated samples return samples
咱们如何训练这模型呢?首先来定义损失函数:
# Define a loss function that combines binary cross-entropy and Kullback-Leibler divergence def loss_function(recon_x, x, mu, logvar): # Compute the binary cross-entropy loss between the reconstructed output and the input data BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784), reduction="sum") # Compute the Kullback-Leibler divergence between the learned latent variable distribution and a standard Gaussian distribution KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) # Combine the two losses by adding them together and return the result return BCE + KLD
第一个重量咱们曾经很相熟了,它只是重构误差。第二个局部引入了对学习散布偏离先验散布过多的惩办,还记得咱们所过几次的KL散度,他能够比拟两个概率分布之间的相似性,咱们比拟的是咱们的散布与规范正态分布的相似性,有了这个函数,咱们就能够这样训练:
def train_vae(X_train, learning_rate=1e-3, num_epochs=10, batch_size=32): # Convert the training data to PyTorch tensors X_train = torch.from_numpy(X_train).to(device) # Create the autoencoder model and optimizer model = VAE() optimizer = optim.Adam(model.parameters(), lr=learning_rate) # Define the loss function criterion = nn.MSELoss(reduction="sum") # Set the device to GPU if available, otherwise use CPU model.to(device) # Create a DataLoader to handle batching of the training data train_loader = torch.utils.data.DataLoader( X_train, batch_size=batch_size, shuffle=True ) # Training loop for epoch in range(num_epochs): total_loss = 0.0 for batch_idx, data in enumerate(train_loader): # Get a batch of training data and move it to the device data = data.to(device) # Forward pass encoded, decoded, mu, log_var = model(data) # Compute the loss and perform backpropagation KLD = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp()) loss = criterion(decoded, data) + 3 * KLD optimizer.zero_grad() loss.backward() optimizer.step() # Update the running loss total_loss += loss.item() * data.size(0) # Print the epoch loss epoch_loss = total_loss / len(train_loader.dataset) print( "Epoch {}/{}: loss={:.4f}".format(epoch + 1, num_epochs, epoch_loss) ) # Return the trained model return model
上面看看咱们生成的图像:
图像看起来还是很含糊,这是因为咱们应用MAE来进行重建管制,应用其余的损失会好
最初一件十分乏味的事件是生成一个随机向量,而后逐步扭转它的一个维度,同时放弃其余维度固定。这样咱们就能够看到解码器的输入是如何变动的。上面是一些例子:
通过扭转它的一个重量咱们从0挪动到9,而后挪动到1或7。
生成具备特定标签的图像 CVAE
为了生成具备特定标签的图像,编码器须要学习如何在给定提醒时解码潜在变量。在这种状况下,咱们用一些信息来限度编码器,怎么能做到呢?一个很显著的想法是将一个编码数字标签传递给解码器,这样它就能够学习解码过程。它看起来像这样:
当初为解码器减少了一个额定的信息源。为什么咱们先做线性投影再求和?线性投影是指有匹配的码层尺寸和标签信息,咱们须要将他投影到与潜在空间雷同的维度,而后把它们加起来。也能够取平均值,或者逐点乘法,或者只是把这些向量连接起来——任何相似的办法都能够,咱们这里只是简略的相加。看看这个图,是不是和扩散模型有点像了(cvae可是2016年公布的)。
而后在推理时,咱们要做的就是传递一个想要生成的数字的标签。代码就变为了:
class ConditionalVAE(VAE): # VAE implementation from the article linked above def __init__(self, num_classes): super().__init__() # Add a linear layer for the class label self.label_projector = nn.Sequential( nn.Linear(num_classes, self.num_hidden), nn.ReLU(), ) def condition_on_label(self, z, y): projected_label = self.label_projector(y.float()) return z + projected_label def forward(self, x, y): # Pass the input through the encoder encoded = self.encoder(x) # Compute the mean and log variance vectors mu = self.mu(encoded) log_var = self.log_var(encoded) # Reparameterize the latent variable z = self.reparameterize(mu, log_var) # Pass the latent variable through the decoder decoded = self.decoder(self.condition_on_label(z, y)) # Return the encoded output, decoded output, mean, and log variance return encoded, decoded, mu, log_var def sample(self, num_samples, y): with torch.no_grad(): # Generate random noise z = torch.randn(num_samples, self.num_hidden).to(device) # Pass the noise through the decoder to generate samples samples = self.decoder(self.condition_on_label(z, y)) # Return the generated samples return samples
这里有一个叫做label_projector的新层,它做线性投影。潜在空间在前向传递和采样过程中都通过该层。
CVAE损失还是VAE的损失,训练也根本一样,咱们这里就只看后果了:
num_samples = 10 random_labels = [8] * num_samples show_images( cvae.sample(num_samples, one_hot(torch.LongTensor(random_labels), num_classes=10).to(device)) .cpu() .detach() .numpy(), labels=random_labels, )
能够看到,咱们的图像根本固定了,并不会呈现其余数字
总结
自编码器是了解无监督学习和数据压缩的根底。尽管简略的主动编码器能够重建图像,但它们难以生成新数据。变分自编码器(VAEs)提供了一种更灵便的办法,通过学习可采样的潜在空间散布的参数来生成新数据。利用重参数化技巧使采样操作可微。而CVAE又为起初的提供了条件反对,所以学习这些会对咱们了解稳固扩散模型提供很好的实践根底。
https://avoid.overfit.cn/post/57bd9ac6acbb4fe0987bdfc1819d1c59
作者 Konstantin Sofeikov