关于机器学习:使用HuggingFace实现-DiffEdit论文的掩码引导语义图像编辑

在本文中，咱们将实现Meta AI和Sorbonne Universite的钻研人员最近发表的一篇名为DIFFEDIT的论文。对于那些相熟稳固扩散过程或者想理解DiffEdit是如何工作的人来说，这篇文章将对你有所帮忙。

什么是DiffEdit?

简略地说，能够将DiffEdit办法看作图像到图像的一个更受管制的版本。DiffEdit承受三个输出-

输出图像
题目-形容输出图像
指标查问文本-形容想要生成的新图像的文本

模型会依据查问文本生成原始图像的批改版本。如果您想对理论图像进行轻微调整而不须要齐全批改它，那么应用DiffEdit是十分无效的。

从上图中能够看到，只有水果局部被梨代替了。这是一个十分惊人的后果!

论文作者解释说，他们实现这一指标的办法是引入一个遮蔽生成模块，该模块确定图像的哪一部分应该被编辑，而后只对遮罩局部执行基于文本的扩散。

从下面这张论文中截取的图片中能够看到，作者从输出的图像中创立了一个掩码，确定了图像中呈现水果的局部(如橙色所示)，而后进行掩码扩散，将水果替换为梨。作者提供了整个DiffEdit过程的良好可视化示意。

这篇论文中，生成遮蔽掩码仿佛是最重要的步骤，其余的局部是应用文本条件进行扩散过程的调节。应用掩码对图像进行调节的办法与在“Hugging face”的In-Paint 实现的想法相似。正如作者所倡议的，“DiffEdit过程有三个步骤：

步骤1:为输出图像增加噪声，并去噪:一次参考提醒文本，一次参考查问文本(或无条件，也就是不参考任何文本)，并依据去噪后果的差别推导出一个掩码。

步骤2:对输出图像进行DDIM编码，预计与输出图像绝对应的潜在值

步骤3:在文本查问条件下执行DDIM解码，应用推断的掩码将背景替换为来自编码过程中相应工夫步" 1 "的像素值

上面咱们将这些思维实现到理论的代码中。

让咱们从导入所需的库和一些辅助函数开始。

 import torch, logging  ## disable warnings logging.disable(logging.WARNING)    ## Imaging  library from PIL import Image from torchvision import transforms as tfms   ## Basic libraries from fastdownload import FastDownload import numpy as np from tqdm.auto import tqdm import matplotlib.pyplot as plt %matplotlib inline from IPython.display import display import shutil import os  ## For video display from IPython.display import HTML from base64 import b64encode   ## Import the CLIP artifacts  from transformers import CLIPTextModel, CLIPTokenizer from diffusers import AutoencoderKL, UNet2DConditionModel, DDIMScheduler  ## Helper functions  def load_artifacts():     '''     A function to load all diffusion artifacts     '''     vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", torch_dtype=torch.float16).to("cuda")     unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", torch_dtype=torch.float16).to("cuda")     tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16)     text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16).to("cuda")     scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", clip_sample=False, set_alpha_to_one=False)         return vae, unet, tokenizer, text_encoder, scheduler  def load_image(p):     '''     Function to load images from a defined path     '''     return Image.open(p).convert('RGB').resize((512,512))  def pil_to_latents(image):     '''     Function to convert image to latents     '''     init_image = tfms.ToTensor()(image).unsqueeze(0) * 2.0 - 1.0     init_image = init_image.to(device="cuda", dtype=torch.float16)      init_latent_dist = vae.encode(init_image).latent_dist.sample() * 0.18215     return init_latent_dist  def latents_to_pil(latents):     '''     Function to convert latents to images     '''     latents = (1 / 0.18215) * latents     with torch.no_grad():         image = vae.decode(latents).sample     image = (image / 2 + 0.5).clamp(0, 1)     image = image.detach().cpu().permute(0, 2, 3, 1).numpy()     images = (image * 255).round().astype("uint8")     pil_images = [Image.fromarray(image) for image in images]     return pil_images  def text_enc(prompts, maxlen=None):     '''     A function to take a texual promt and convert it into embeddings     '''     if maxlen is None: maxlen = tokenizer.model_max_length     inp = tokenizer(prompts, padding="max_length", max_length=maxlen, truncation=True, return_tensors="pt")      return text_encoder(inp.input_ids.to("cuda"))[0].half()  vae, unet, tokenizer, text_encoder, scheduler = load_artifacts()

让咱们还抉择了一个图像，将在代码实现过程中应用它。

 p = FastDownload().download('https://images.pexels.com/photos/1996333/pexels-photo-1996333.jpeg?cs=srgb&dl=pexels-helena-lopes-1996333.jpg&fm=jpg&_gl=1*1pc0nw8*_ga*OTk4MTI0MzE4LjE2NjY1NDQwMjE.*_ga_8JE65Q40S6*MTY2Njc1MjIwMC4yLjEuMTY2Njc1MjIwMS4wLjAuMA..') init_img = load_image(p) init_img

DiffEdit的代码实现

上面咱们开始依照作者倡议的那样实现这篇论文。

1、掩码创立:这是DiffEdit过程的第一步

对于第一步，论文中有更具体的解释，咱们这里只看重点提到的局部-

应用不同的文本条件(参考文本和查问文本)对图像去噪，并从后果中取差别。这个想法的实践是在不同的局部有更多的变动，而不是在图像的背景不会做过多的扭转。
反复这个差分过程10次
求出这些差别的平均值并将其二值化

这里须要留神的是掩码创立的第三步(均匀和二值化)在论文中没有解释分明，这使得我花了很多试验工夫才做对。

上面的prompt_2_img_i2i函数，能够返回图像的潜在空间，而不是从新缩放和解码后的去噪图像。

 def prompt_2_img_i2i(prompts, init_img, neg_prompts=None, g=7.5, seed=100, strength =0.8, steps=50, dim=512):     """     Diffusion process to convert prompt to image     """     # Converting textual prompts to embedding     text = text_enc(prompts)           # Adding an unconditional prompt , helps in the generation process     if not neg_prompts: uncond =  text_enc([""], text.shape[1])     else: uncond =  text_enc(neg_prompt, text.shape[1])     emb = torch.cat([uncond, text])          # Setting the seed     if seed: torch.manual_seed(seed)          # Setting number of steps in scheduler     scheduler.set_timesteps(steps)          # Convert the seed image to latent     init_latents = pil_to_latents(init_img)          # Figuring initial time step based on strength     init_timestep = int(steps * strength)      timesteps = scheduler.timesteps[-init_timestep]     timesteps = torch.tensor([timesteps], device="cuda")          # Adding noise to the latents      noise = torch.randn(init_latents.shape, generator=None, device="cuda", dtype=init_latents.dtype)     init_latents = scheduler.add_noise(init_latents, noise, timesteps)     latents = init_latents          # Computing the timestep to start the diffusion loop     t_start = max(steps - init_timestep, 0)     timesteps = scheduler.timesteps[t_start:].to("cuda")          # Iterating through defined steps     for i,ts in enumerate(tqdm(timesteps)):         # We need to scale the i/p latents to match the variance         inp = scheduler.scale_model_input(torch.cat([latents] * 2), ts)                  # Predicting noise residual using U-Net         with torch.no_grad(): u,t = unet(inp, ts, encoder_hidden_states=emb).sample.chunk(2)                      # Performing Guidance         pred = u + g*(t-u)          # Conditioning  the latents         #latents = scheduler.step(pred, ts, latents).pred_original_sample         latents = scheduler.step(pred, ts, latents).prev_sample          # Returning the latent representation to output an array of 4x64x64     return latents.detach().cpu()

下一步是创立create_mask函数，它的参数是应用的初始图像、疏导提醒和查问提醒，以及咱们须要反复这些步骤的次数。论文中作者认为在他们的试验中，n=10和强度为0.5是可行的。因而函数的默认值被调整为该值。Create_mask函数执行以下步骤-

创立两个去噪的潜在空间，一个条件是参考文本，另一个条件是查问文本，并取这些潜在空间的差值
反复此步骤n次
取这些差别的平均值并进行标准化
抉择0.5的阈值进行二值化并创立掩码

 def create_mask(init_img, rp, qp, n=10, s=0.5):     ## Initialize a dictionary to save n iterations     diff = {}          ## Repeating the difference process n times     for idx in range(n):         ## Creating denoised sample using reference / original text         orig_noise = prompt_2_img_i2i(prompts=rp, init_img=init_img, strength=s, seed = 100*idx)[0]         ## Creating denoised sample using query / target text         query_noise = prompt_2_img_i2i(prompts=qp, init_img=init_img, strength=s, seed = 100*idx)[0]         ## Taking the difference          diff[idx] = (np.array(orig_noise)-np.array(query_noise))          ## Creating a mask placeholder     mask = np.zeros_like(diff[0])          ## Taking an average of 10 iterations     for idx in range(n):         ## Note np.abs is a key step         mask += np.abs(diff[idx])                ## Averaging multiple channels      mask = mask.mean(0)          ## Normalizing      mask = (mask - mask.mean()) / np.std(mask)          ## Binarizing and returning the mask object     return (mask > 0).astype("uint8")  mask = create_mask(init_img=init_img, rp=["a horse image"], qp=["a zebra image"], n=10)

让咱们在图像上可视化生成的掩码。

 plt.imshow(np.array(init_img), cmap='gray') # I would add interpolation='none' plt.imshow(     Image.fromarray(mask).resize((512,512)), ## Scaling the mask to original size     cmap='cividis',      alpha=0.5*(np.array(Image.fromarray(mask*255).resize((512,512))) > 0)   )

正如咱们在下面看到的，制作的掩码笼罩了马的局部，这确实是咱们想要的后果。

2、掩码扩散:DiffEdit论文的步骤2和步骤3

步骤2和3须要在同一个循环中实现，因为作者是说基于参考文本对非掩码局部和查问文本对掩码局部进行条件解决。应用这个简略的公式将这两个局部联合起来，创立组合的潜在空间

 def prompt_2_img_diffedit(rp, qp, init_img, mask, g=7.5, seed=100, strength =0.7, steps=70, dim=512):     """     Diffusion process to convert prompt to image     """     # Converting textual prompts to embedding     rtext = text_enc(rp)      qtext = text_enc(qp)          # Adding an unconditional prompt , helps in the generation process     uncond =  text_enc([""], rtext.shape[1])     emb = torch.cat([uncond, rtext, qtext])          # Setting the seed     if seed: torch.manual_seed(seed)          # Setting number of steps in scheduler     scheduler.set_timesteps(steps)          # Convert the seed image to latent     init_latents = pil_to_latents(init_img)          # Figuring initial time step based on strength     init_timestep = int(steps * strength)      timesteps = scheduler.timesteps[-init_timestep]     timesteps = torch.tensor([timesteps], device="cuda")          # Adding noise to the latents      noise = torch.randn(init_latents.shape, generator=None, device="cuda", dtype=init_latents.dtype)     init_latents = scheduler.add_noise(init_latents, noise, timesteps)     latents = init_latents          # Computing the timestep to start the diffusion loop     t_start = max(steps - init_timestep, 0)     timesteps = scheduler.timesteps[t_start:].to("cuda")          # Converting mask to torch tensor     mask = torch.tensor(mask, dtype=unet.dtype).unsqueeze(0).unsqueeze(0).to("cuda")          # Iterating through defined steps     for i,ts in enumerate(tqdm(timesteps)):         # We need to scale the i/p latents to match the variance         inp = scheduler.scale_model_input(torch.cat([latents] * 3), ts)                  # Predicting noise residual using U-Net         with torch.no_grad(): u, rt, qt = unet(inp, ts, encoder_hidden_states=emb).sample.chunk(3)                      # Performing Guidance         rpred = u + g*(rt-u)         qpred = u + g*(qt-u)          # Conditioning  the latents         rlatents = scheduler.step(rpred, ts, latents).prev_sample         qlatents = scheduler.step(qpred, ts, latents).prev_sample         latents = mask*qlatents + (1-mask)*rlatents          # Returning the latent representation to output an array of 4x64x64     return latents_to_pil(latents)

让咱们可视化生成的图像

 output = prompt_2_img_diffedit(     rp = ["a horse image"],      qp=["a zebra image"],     init_img=init_img,      mask = mask,      g=7.5, seed=100, strength =0.5, steps=70, dim=512)  ## Plotting side by side fig, axs = plt.subplots(1, 2, figsize=(12, 6)) for c, img in enumerate([init_img, output[0]]):      axs[c].imshow(img)     if c == 0 : axs[c].set_title(f"Initial image ")     else: axs[c].set_title(f"DiffEdit output")

将掩码和扩散过程整合成一个简略的函数。

 def diffEdit(init_img, rp , qp, g=7.5, seed=100, strength =0.7, steps=70, dim=512):          ## Step 1: Create mask     mask = create_mask(init_img=init_img, rp=rp, qp=qp)          ## Step 2 and 3: Diffusion process using mask     output = prompt_2_img_diffedit(         rp = rp,          qp=qp,          init_img=init_img,          mask = mask,          g=g,          seed=seed,         strength =strength,          steps=steps,          dim=dim)     return mask , output

咱们还能够为DiffEdit创立一个可视化函数，显示原始输出图像、掩码图像和最终输入图像。

 def plot_diffEdit(init_img, output, mask):     ## Plotting side by side     fig, axs = plt.subplots(1, 3, figsize=(12, 6))          ## Visualizing initial image     axs[0].imshow(init_img)     axs[0].set_title(f"Initial image")          ## Visualizing initial image     axs[2].imshow(output[0])     axs[2].set_title(f"DiffEdit output")          ## Visualizing the mask      axs[1].imshow(np.array(init_img), cmap='gray')      axs[1].imshow(         Image.fromarray(mask).resize((512,512)), ## Scaling the mask to original size         cmap='cividis',          alpha=0.5*(np.array(Image.fromarray(mask*255).resize((512,512))) > 0)       )     axs[1].set_title(f"DiffEdit mask")

上面能够在一些其余的图像上测试这个函数。

 p = FastDownload().download('https://images.pexels.com/photos/1996333/pexels-photo-1996333.jpeg?cs=srgb&dl=pexels-helena-lopes-1996333.jpg&fm=jpg&_gl=1*1pc0nw8*_ga*OTk4MTI0MzE4LjE2NjY1NDQwMjE.*_ga_8JE65Q40S6*MTY2Njc1MjIwMC4yLjEuMTY2Njc1MjIwMS4wLjAuMA..') init_img = load_image(p) mask, output = diffEdit(   init_img,    rp = ["a horse image"],    qp=["a zebra image"] ) plot_diffEdit(init_img, output, mask)

成果还不错太，再试一个。

 p = FastDownload().download('https://raw.githubusercontent.com/johnrobinsn/diffusion_experiments/main/images/bowloberries_scaled.jpg') init_img = load_image(p) mask, output = diffEdit(   init_img,    rp = ['Bowl of Strawberries'],    qp=['Bowl of Grapes'] ) plot_diffEdit(init_img, output, mask)

FastDiffEdit:一个更快的DiffEdit实现

当初咱们曾经看到了咱们本人手写代码的实现，然而咱们这个实现没有通过任何的优化。为了在速度后果方面体现的更好，能够对原来的DiffEdit过程进行一些改良。咱们称这些改良为FastDiffEdit。

1、掩码创立:FastDiffEdit掩码过程

掩码创立的最大的问题是它破费太多的工夫(在A4500 GPU上大概50秒)。咱们可能不须要运行一个残缺的扩散循环来去噪图像，只须要在一个察看中应用原始样本的U-net预测，并将反复减少到20次。在这种状况下，能够将计算从10*25 = 250步改良到20步(少了12次循环)。让咱们看看这在实践中是否无效。

 def prompt_2_img_i2i_fast(prompts, init_img, g=7.5, seed=100, strength =0.5, steps=50, dim=512):     """     Diffusion process to convert prompt to image     """     # Converting textual prompts to embedding     text = text_enc(prompts)           # Adding an unconditional prompt , helps in the generation process     uncond =  text_enc([""], text.shape[1])     emb = torch.cat([uncond, text])          # Setting the seed     if seed: torch.manual_seed(seed)          # Setting number of steps in scheduler     scheduler.set_timesteps(steps)          # Convert the seed image to latent     init_latents = pil_to_latents(init_img)          # Figuring initial time step based on strength     init_timestep = int(steps * strength)      timesteps = scheduler.timesteps[-init_timestep]     timesteps = torch.tensor([timesteps], device="cuda")          # Adding noise to the latents      noise = torch.randn(init_latents.shape, generator=None, device="cuda", dtype=init_latents.dtype)     init_latents = scheduler.add_noise(init_latents, noise, timesteps)     latents = init_latents          # We need to scale the i/p latents to match the variance     inp = scheduler.scale_model_input(torch.cat([latents] * 2), timesteps)     # Predicting noise residual using U-Net     with torch.no_grad(): u,t = unet(inp, timesteps, encoder_hidden_states=emb).sample.chunk(2)               # Performing Guidance     pred = u + g*(t-u)      # Zero shot prediction     latents = scheduler.step(pred, timesteps, latents).pred_original_sample          # Returning the latent representation to output an array of 4x64x64     return latents.detach().cpu()

创立一个新的掩码函数，它应用prompt_2_img_i2i_fast函数。

 def create_mask_fast(init_img, rp, qp, n=20, s=0.5):     ## Initialize a dictionary to save n iterations     diff = {}          ## Repeating the difference process n times     for idx in range(n):         ## Creating denoised sample using reference / original text         orig_noise = prompt_2_img_i2i_fast(prompts=rp, init_img=init_img, strength=s, seed = 100*idx)[0]         ## Creating denoised sample using query / target text         query_noise = prompt_2_img_i2i_fast(prompts=qp, init_img=init_img, strength=s, seed = 100*idx)[0]         ## Taking the difference          diff[idx] = (np.array(orig_noise)-np.array(query_noise))          ## Creating a mask placeholder     mask = np.zeros_like(diff[0])          ## Taking an average of 10 iterations     for idx in range(n):         ## Note np.abs is a key step         mask += np.abs(diff[idx])                ## Averaging multiple channels      mask = mask.mean(0)          ## Normalizing      mask = (mask - mask.mean()) / np.std(mask)          ## Binarizing and returning the mask object     return (mask > 0).astype("uint8")

看看这个新的函数是否能产生好的蔽成果。

 p = FastDownload().download('https://images.pexels.com/photos/1996333/pexels-photo-1996333.jpeg?cs=srgb&dl=pexels-helena-lopes-1996333.jpg&fm=jpg&_gl=1*1pc0nw8*_ga*OTk4MTI0MzE4LjE2NjY1NDQwMjE.*_ga_8JE65Q40S6*MTY2Njc1MjIwMC4yLjEuMTY2Njc1MjIwMS4wLjAuMA..') init_img = load_image(p) mask = create_mask_fast(init_img=init_img, rp=["a horse image"], qp=["a zebra image"], n=20) plt.imshow(np.array(init_img), cmap='gray') # I would add interpolation='none' plt.imshow(     Image.fromarray(mask).resize((512,512)), ## Scaling the mask to original size     cmap='cividis',      alpha=0.5*(np.array(Image.fromarray(mask*255).resize((512,512))) > 0)   )

成果还是能够的尽管没有残缺的函数来的精确，但计算工夫在我的机器上从~50秒缩小到~10秒(进步了5倍!)，咱们能够通过增加cv2的解决来改良成果。这将使掩码更平滑一点。

 import cv2 def improve_mask(mask):     mask  = cv2.GaussianBlur(mask*255,(3,3),1) > 0     return mask.astype('uint8')  mask = improve_mask(mask) plt.imshow(np.array(init_img), cmap='gray') # I would add interpolation='none' plt.imshow(     Image.fromarray(mask).resize((512,512)), ## Scaling the mask to original size     cmap='cividis',      alpha=0.5*(np.array(Image.fromarray(mask*255).resize((512,512))) > 0)   )

掩码变得更加平滑，笼罩了更多的区域。

2、将掩码扩散的流程替换为inpaint的流程

在diffusers库中有一个叫做inpaint pipeline的非凡管道，所以咱们能够应用它来执行掩码扩散。它承受查问提醒、初始图像和生成的掩码返回生成的图像。

 from diffusers import StableDiffusionInpaintPipeline pipe = StableDiffusionInpaintPipeline.from_pretrained(     "runwayml/stable-diffusion-inpainting",     revision="fp16",     torch_dtype=torch.float16, ).to("cuda")

让咱们应用inpaint来进行改良

 pipe(     prompt=["a zebra image"],      image=init_img,      mask_image=Image.fromarray(mask*255).resize((512,512)),      generator=torch.Generator("cuda").manual_seed(100),     num_inference_steps = 20 ).images[0] image

inpaint管道创立了一个更实在的斑马图像。让咱们为掩码和扩散过程创立一个简略的函数。

 def fastDiffEdit(init_img, rp , qp, g=7.5, seed=100, strength =0.7, steps=20, dim=512):          ## Step 1: Create mask     mask = create_mask_fast(init_img=init_img, rp=rp, qp=qp, n=20)          ## Improve masking using CV trick     mask = improve_mask(mask)          ## Step 2 and 3: Diffusion process using mask     output = pipe(         prompt=qp,          image=init_img,          mask_image=Image.fromarray(mask*255).resize((512,512)),          generator=torch.Generator("cuda").manual_seed(100),         num_inference_steps = steps     ).images     return mask , output

还是在下面的图像上测试这个函数。

 p = FastDownload().download('https://raw.githubusercontent.com/johnrobinsn/diffusion_experiments/main/images/bowloberries_scaled.jpg') init_img = load_image(p) mask, output = fastDiffEdit(init_img, rp = ['Bowl of Strawberries'], qp=['Bowl of Grapes']) plot_diffEdit(init_img, output, mask)

成果比咱们本人写的好多了

总结

在这篇文章中，咱们实现了DiffEdit论文，而后还提出了创立FastDiffEdit的改良办法，这样不仅计算速度进步了5倍，成果也变得更好了，而且代码还变少了。

https://avoid.overfit.cn/post/f0a8a7b6981a4962aae21e97d535ee41

作者：Aayush Agrawal