在本文中,咱们将实现 Meta AI 和 Sorbonne Universite 的钻研人员最近发表的一篇名为 DIFFEDIT 的论文。对于那些相熟稳固扩散过程或者想理解 DiffEdit 是如何工作的人来说,这篇文章将对你有所帮忙。
什么是 DiffEdit?
简略地说,能够将 DiffEdit 办法看作图像到图像的一个更受管制的版本。DiffEdit 承受三个输出 -
- 输出图像
- 题目 - 形容输出图像
- 指标查问文本 - 形容想要生成的新图像的文本
模型会依据查问文本生成原始图像的批改版本。如果您想对理论图像进行轻微调整而不须要齐全批改它,那么应用 DiffEdit 是十分无效的。
从下面这张论文中截取的图片中能够看到,作者从输出的图像中创立了一个掩码,确定了图像中呈现水果的局部(如橙色所示),而后进行掩码扩散,将水果替换为梨。作者提供了整个 DiffEdit 过程的良好可视化示意。
这篇论文中,生成遮蔽掩码仿佛是最重要的步骤,其余的局部是应用文本条件进行扩散过程的调节。应用掩码对图像进行调节的办法与在“Hugging face”的 In-Paint 实现的想法相似。正如作者所倡议的,“DiffEdit 过程有三个步骤:
步骤 1: 为输出图像增加噪声,并去噪: 一次参考提醒文本,一次参考查问文本(或无条件,也就是不参考任何文本),并依据去噪后果的差别推导出一个掩码。
步骤 2: 对输出图像进行 DDIM 编码,预计与输出图像绝对应的潜在值
步骤 3: 在文本查问条件下执行 DDIM 解码,应用推断的掩码将背景替换为来自编码过程中相应工夫步 ” 1 “ 的像素值
import torch, logging
## disable warnings
## Imaging library
from PIL import Image
from torchvision import transforms as tfms
## Basic libraries
from fastdownload import FastDownload
import numpy as np
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display
import shutil
import os
## For video display
from IPython.display import HTML
from base64 import b64encode
## Import the CLIP artifacts
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, DDIMScheduler
## Helper functions
def load_artifacts():
'''A function to load all diffusion artifacts'''
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", torch_dtype=torch.float16).to("cuda")
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", torch_dtype=torch.float16).to("cuda")
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16)
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16).to("cuda")
scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", clip_sample=False, set_alpha_to_one=False)
return vae, unet, tokenizer, text_encoder, scheduler
def load_image(p):
'''Function to load images from a defined path'''
return Image.open(p).convert('RGB').resize((512,512))
def pil_to_latents(image):
'''Function to convert image to latents'''
init_image = tfms.ToTensor()(image).unsqueeze(0) * 2.0 - 1.0
init_image = init_image.to(device="cuda", dtype=torch.float16)
init_latent_dist = vae.encode(init_image).latent_dist.sample() * 0.18215
return init_latent_dist
def latents_to_pil(latents):
'''Function to convert latents to images'''
latents = (1 / 0.18215) * latents
with torch.no_grad():
image = vae.decode(latents).sample
image = (image / 2 + 0.5).clamp(0, 1)
image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
images = (image * 255).round().astype("uint8")
pil_images = [Image.fromarray(image) for image in images]
return pil_images
def text_enc(prompts, maxlen=None):
'''A function to take a texual promt and convert it into embeddings'''
if maxlen is None: maxlen = tokenizer.model_max_length
inp = tokenizer(prompts, padding="max_length", max_length=maxlen, truncation=True, return_tensors="pt")
return text_encoder(inp.input_ids.to("cuda"))[0].half()
vae, unet, tokenizer, text_encoder, scheduler = load_artifacts()
p = FastDownload().download('https://images.pexels.com/photos/1996333/pexels-photo-1996333.jpeg?cs=srgb&dl=pexels-helena-lopes-1996333.jpg&fm=jpg&_gl=1*1pc0nw8*_ga*OTk4MTI0MzE4LjE2NjY1NDQwMjE.*_ga_8JE65Q40S6*MTY2Njc1MjIwMC4yLjEuMTY2Njc1MjIwMS4wLjAuMA..')
init_img = load_image(p)
DiffEdit 的代码实现
1、掩码创立: 这是 DiffEdit 过程的第一步
对于第一步,论文中有更具体的解释,咱们这里只看重点提到的局部 -
- 应用不同的文本条件 (参考文本和查问文本) 对图像去噪,并从后果中取差别。这个想法的实践是在不同的局部有更多的变动,而不是在图像的背景不会做过多的扭转。
- 反复这个差分过程 10 次
- 求出这些差别的平均值并将其二值化
这里须要留神的是掩码创立的第三步 (均匀和二值化) 在论文中没有解释分明,这使得我花了很多试验工夫才做对。
上面的 prompt_2_img_i2i 函数,能够返回图像的潜在空间,而不是从新缩放和解码后的去噪图像。
def prompt_2_img_i2i(prompts, init_img, neg_prompts=None, g=7.5, seed=100, strength =0.8, steps=50, dim=512):
"""Diffusion process to convert prompt to image"""
# Converting textual prompts to embedding
text = text_enc(prompts)
# Adding an unconditional prompt , helps in the generation process
if not neg_prompts: uncond = text_enc([""], text.shape[1])
else: uncond = text_enc(neg_prompt, text.shape[1])
emb = torch.cat([uncond, text])
# Setting the seed
if seed: torch.manual_seed(seed)
# Setting number of steps in scheduler
# Convert the seed image to latent
init_latents = pil_to_latents(init_img)
# Figuring initial time step based on strength
init_timestep = int(steps * strength)
timesteps = scheduler.timesteps[-init_timestep]
timesteps = torch.tensor([timesteps], device="cuda")
# Adding noise to the latents
noise = torch.randn(init_latents.shape, generator=None, device="cuda", dtype=init_latents.dtype)
init_latents = scheduler.add_noise(init_latents, noise, timesteps)
latents = init_latents
# Computing the timestep to start the diffusion loop
t_start = max(steps - init_timestep, 0)
timesteps = scheduler.timesteps[t_start:].to("cuda")
# Iterating through defined steps
for i,ts in enumerate(tqdm(timesteps)):
# We need to scale the i/p latents to match the variance
inp = scheduler.scale_model_input(torch.cat([latents] * 2), ts)
# Predicting noise residual using U-Net
with torch.no_grad(): u,t = unet(inp, ts, encoder_hidden_states=emb).sample.chunk(2)
# Performing Guidance
pred = u + g*(t-u)
# Conditioning the latents
#latents = scheduler.step(pred, ts, latents).pred_original_sample
latents = scheduler.step(pred, ts, latents).prev_sample
# Returning the latent representation to output an array of 4x64x64
return latents.detach().cpu()
下一步是创立 create_mask 函数,它的参数是应用的初始图像、疏导提醒和查问提醒,以及咱们须要反复这些步骤的次数。论文中作者认为在他们的试验中,n=10 和强度为 0.5 是可行的。因而函数的默认值被调整为该值。Create_mask 函数执行以下步骤 -
- 创立两个去噪的潜在空间,一个条件是参考文本,另一个条件是查问文本,并取这些潜在空间的差值
- 反复此步骤 n 次
- 取这些差别的平均值并进行标准化
- 抉择 0.5 的阈值进行二值化并创立掩码
def create_mask(init_img, rp, qp, n=10, s=0.5):
## Initialize a dictionary to save n iterations
diff = {}
## Repeating the difference process n times
for idx in range(n):
## Creating denoised sample using reference / original text
orig_noise = prompt_2_img_i2i(prompts=rp, init_img=init_img, strength=s, seed = 100*idx)[0]
## Creating denoised sample using query / target text
query_noise = prompt_2_img_i2i(prompts=qp, init_img=init_img, strength=s, seed = 100*idx)[0]
## Taking the difference
diff[idx] = (np.array(orig_noise)-np.array(query_noise))
## Creating a mask placeholder
mask = np.zeros_like(diff[0])
## Taking an average of 10 iterations
for idx in range(n):
## Note np.abs is a key step
mask += np.abs(diff[idx])
## Averaging multiple channels
mask = mask.mean(0)
## Normalizing
mask = (mask - mask.mean()) / np.std(mask)
## Binarizing and returning the mask object
return (mask > 0).astype("uint8")
mask = create_mask(init_img=init_img, rp=["a horse image"], qp=["a zebra image"], n=10)
plt.imshow(np.array(init_img), cmap='gray') # I would add interpolation='none'
plt.imshow(Image.fromarray(mask).resize((512,512)), ## Scaling the mask to original size
alpha=0.5*(np.array(Image.fromarray(mask*255).resize((512,512))) > 0)
2、掩码扩散:DiffEdit 论文的步骤 2 和步骤 3
步骤 2 和 3 须要在同一个循环中实现,因为作者是说基于参考文本对非掩码局部和查问文本对掩码局部进行条件解决。应用这个简略的公式将这两个局部联合起来,创立组合的潜在空间
def prompt_2_img_diffedit(rp, qp, init_img, mask, g=7.5, seed=100, strength =0.7, steps=70, dim=512):
"""Diffusion process to convert prompt to image"""
# Converting textual prompts to embedding
rtext = text_enc(rp)
qtext = text_enc(qp)
# Adding an unconditional prompt , helps in the generation process
uncond = text_enc([""], rtext.shape[1])
emb = torch.cat([uncond, rtext, qtext])
# Setting the seed
if seed: torch.manual_seed(seed)
# Setting number of steps in scheduler
# Convert the seed image to latent
init_latents = pil_to_latents(init_img)
# Figuring initial time step based on strength
init_timestep = int(steps * strength)
timesteps = scheduler.timesteps[-init_timestep]
timesteps = torch.tensor([timesteps], device="cuda")
# Adding noise to the latents
noise = torch.randn(init_latents.shape, generator=None, device="cuda", dtype=init_latents.dtype)
init_latents = scheduler.add_noise(init_latents, noise, timesteps)
latents = init_latents
# Computing the timestep to start the diffusion loop
t_start = max(steps - init_timestep, 0)
timesteps = scheduler.timesteps[t_start:].to("cuda")
# Converting mask to torch tensor
mask = torch.tensor(mask, dtype=unet.dtype).unsqueeze(0).unsqueeze(0).to("cuda")
# Iterating through defined steps
for i,ts in enumerate(tqdm(timesteps)):
# We need to scale the i/p latents to match the variance
inp = scheduler.scale_model_input(torch.cat([latents] * 3), ts)
# Predicting noise residual using U-Net
with torch.no_grad(): u, rt, qt = unet(inp, ts, encoder_hidden_states=emb).sample.chunk(3)
# Performing Guidance
rpred = u + g*(rt-u)
qpred = u + g*(qt-u)
# Conditioning the latents
rlatents = scheduler.step(rpred, ts, latents).prev_sample
qlatents = scheduler.step(qpred, ts, latents).prev_sample
latents = mask*qlatents + (1-mask)*rlatents
# Returning the latent representation to output an array of 4x64x64
return latents_to_pil(latents)
output = prompt_2_img_diffedit(rp = ["a horse image"],
qp=["a zebra image"],
mask = mask,
g=7.5, seed=100, strength =0.5, steps=70, dim=512)
## Plotting side by side
fig, axs = plt.subplots(1, 2, figsize=(12, 6))
for c, img in enumerate([init_img, output[0]]):
if c == 0 : axs.set_title(f"Initial image")
else: axs.set_title(f"DiffEdit output")
def diffEdit(init_img, rp , qp, g=7.5, seed=100, strength =0.7, steps=70, dim=512):
## Step 1: Create mask
mask = create_mask(init_img=init_img, rp=rp, qp=qp)
## Step 2 and 3: Diffusion process using mask
output = prompt_2_img_diffedit(
rp = rp,
mask = mask,
strength =strength,
return mask , output
咱们还能够为 DiffEdit 创立一个可视化函数,显示原始输出图像、掩码图像和最终输入图像。
def plot_diffEdit(init_img, output, mask):
## Plotting side by side
fig, axs = plt.subplots(1, 3, figsize=(12, 6))
## Visualizing initial image
axs[0].set_title(f"Initial image")
## Visualizing initial image
axs[2].set_title(f"DiffEdit output")
## Visualizing the mask
axs[1].imshow(np.array(init_img), cmap='gray')
axs[1].imshow(Image.fromarray(mask).resize((512,512)), ## Scaling the mask to original size
alpha=0.5*(np.array(Image.fromarray(mask*255).resize((512,512))) > 0)
axs[1].set_title(f"DiffEdit mask")
p = FastDownload().download('https://images.pexels.com/photos/1996333/pexels-photo-1996333.jpeg?cs=srgb&dl=pexels-helena-lopes-1996333.jpg&fm=jpg&_gl=1*1pc0nw8*_ga*OTk4MTI0MzE4LjE2NjY1NDQwMjE.*_ga_8JE65Q40S6*MTY2Njc1MjIwMC4yLjEuMTY2Njc1MjIwMS4wLjAuMA..')
init_img = load_image(p)
mask, output = diffEdit(
rp = ["a horse image"],
qp=["a zebra image"]
plot_diffEdit(init_img, output, mask)
p = FastDownload().download('https://raw.githubusercontent.com/johnrobinsn/diffusion_experiments/main/images/bowloberries_scaled.jpg')
init_img = load_image(p)
mask, output = diffEdit(
rp = ['Bowl of Strawberries'],
qp=['Bowl of Grapes']
plot_diffEdit(init_img, output, mask)
FastDiffEdit: 一个更快的 DiffEdit 实现
当初咱们曾经看到了咱们本人手写代码的实现,然而咱们这个实现没有通过任何的优化。为了在速度后果方面体现的更好,能够对原来的 DiffEdit 过程进行一些改良。咱们称这些改良为 FastDiffEdit。
1、掩码创立:FastDiffEdit 掩码过程
掩码创立的最大的问题是它破费太多的工夫(在 A4500 GPU 上大概 50 秒)。咱们可能不须要运行一个残缺的扩散循环来去噪图像,只须要在一个察看中应用原始样本的 U -net 预测,并将反复减少到 20 次。在这种状况下,能够将计算从 10*25 = 250 步改良到 20 步(少了 12 次循环)。让咱们看看这在实践中是否无效。
def prompt_2_img_i2i_fast(prompts, init_img, g=7.5, seed=100, strength =0.5, steps=50, dim=512):
"""Diffusion process to convert prompt to image"""
# Converting textual prompts to embedding
text = text_enc(prompts)
# Adding an unconditional prompt , helps in the generation process
uncond = text_enc([""], text.shape[1])
emb = torch.cat([uncond, text])
# Setting the seed
if seed: torch.manual_seed(seed)
# Setting number of steps in scheduler
# Convert the seed image to latent
init_latents = pil_to_latents(init_img)
# Figuring initial time step based on strength
init_timestep = int(steps * strength)
timesteps = scheduler.timesteps[-init_timestep]
timesteps = torch.tensor([timesteps], device="cuda")
# Adding noise to the latents
noise = torch.randn(init_latents.shape, generator=None, device="cuda", dtype=init_latents.dtype)
init_latents = scheduler.add_noise(init_latents, noise, timesteps)
latents = init_latents
# We need to scale the i/p latents to match the variance
inp = scheduler.scale_model_input(torch.cat([latents] * 2), timesteps)
# Predicting noise residual using U-Net
with torch.no_grad(): u,t = unet(inp, timesteps, encoder_hidden_states=emb).sample.chunk(2)
# Performing Guidance
pred = u + g*(t-u)
# Zero shot prediction
latents = scheduler.step(pred, timesteps, latents).pred_original_sample
# Returning the latent representation to output an array of 4x64x64
return latents.detach().cpu()
创立一个新的掩码函数,它应用 prompt_2_img_i2i_fast 函数。
def create_mask_fast(init_img, rp, qp, n=20, s=0.5):
## Initialize a dictionary to save n iterations
diff = {}
## Repeating the difference process n times
for idx in range(n):
## Creating denoised sample using reference / original text
orig_noise = prompt_2_img_i2i_fast(prompts=rp, init_img=init_img, strength=s, seed = 100*idx)[0]
## Creating denoised sample using query / target text
query_noise = prompt_2_img_i2i_fast(prompts=qp, init_img=init_img, strength=s, seed = 100*idx)[0]
## Taking the difference
diff[idx] = (np.array(orig_noise)-np.array(query_noise))
## Creating a mask placeholder
mask = np.zeros_like(diff[0])
## Taking an average of 10 iterations
for idx in range(n):
## Note np.abs is a key step
mask += np.abs(diff[idx])
## Averaging multiple channels
mask = mask.mean(0)
## Normalizing
mask = (mask - mask.mean()) / np.std(mask)
## Binarizing and returning the mask object
return (mask > 0).astype("uint8")
p = FastDownload().download('https://images.pexels.com/photos/1996333/pexels-photo-1996333.jpeg?cs=srgb&dl=pexels-helena-lopes-1996333.jpg&fm=jpg&_gl=1*1pc0nw8*_ga*OTk4MTI0MzE4LjE2NjY1NDQwMjE.*_ga_8JE65Q40S6*MTY2Njc1MjIwMC4yLjEuMTY2Njc1MjIwMS4wLjAuMA..')
init_img = load_image(p)
mask = create_mask_fast(init_img=init_img, rp=["a horse image"], qp=["a zebra image"], n=20)
plt.imshow(np.array(init_img), cmap='gray') # I would add interpolation='none'
plt.imshow(Image.fromarray(mask).resize((512,512)), ## Scaling the mask to original size
alpha=0.5*(np.array(Image.fromarray(mask*255).resize((512,512))) > 0)
成果还是能够的尽管没有残缺的函数来的精确,但计算工夫在我的机器上从~50 秒缩小到~10 秒(进步了 5 倍!),咱们能够通过增加 cv2 的解决来改良成果。这将使掩码更平滑一点。
import cv2
def improve_mask(mask):
mask = cv2.GaussianBlur(mask*255,(3,3),1) > 0
return mask.astype('uint8')
mask = improve_mask(mask)
plt.imshow(np.array(init_img), cmap='gray') # I would add interpolation='none'
plt.imshow(Image.fromarray(mask).resize((512,512)), ## Scaling the mask to original size
alpha=0.5*(np.array(Image.fromarray(mask*255).resize((512,512))) > 0)
2、将掩码扩散的流程替换为🤗inpaint 的流程
在🤗diffusers 库中有一个叫做 inpaint pipeline 的非凡管道,所以咱们能够应用它来执行掩码扩散。它承受查问提醒、初始图像和生成的掩码返回生成的图像。
from diffusers import StableDiffusionInpaintPipeline
pipe = StableDiffusionInpaintPipeline.from_pretrained(
让咱们应用 inpaint 来进行改良
pipe(prompt=["a zebra image"],
num_inference_steps = 20
inpaint 管道创立了一个更实在的斑马图像。让咱们为掩码和扩散过程创立一个简略的函数。
def fastDiffEdit(init_img, rp , qp, g=7.5, seed=100, strength =0.7, steps=20, dim=512):
## Step 1: Create mask
mask = create_mask_fast(init_img=init_img, rp=rp, qp=qp, n=20)
## Improve masking using CV trick
mask = improve_mask(mask)
## Step 2 and 3: Diffusion process using mask
output = pipe(
num_inference_steps = steps
return mask , output
p = FastDownload().download('https://raw.githubusercontent.com/johnrobinsn/diffusion_experiments/main/images/bowloberries_scaled.jpg')
init_img = load_image(p)
mask, output = fastDiffEdit(init_img, rp = ['Bowl of Strawberries'], qp=['Bowl of Grapes'])
plot_diffEdit(init_img, output, mask)
在这篇文章中,咱们实现了 DiffEdit 论文,而后还提出了创立 FastDiffEdit 的改良办法,这样不仅计算速度进步了 5 倍,成果也变得更好了,而且代码还变少了。
作者:Aayush Agrawal