关于阿里云:AI-充电揭秘大语言模型实践分布式推理的工程化落地才是关键

随着 3 月 15 日 OpenAI 重磅公布了 GPT4，其在司法考试、程序编程上的惊艳体现，将大家对大模型的激情推向了顶点，人们纷纷探讨是否咱们曾经进入到通用人工智能的时代。与此同时，基于大语言模型的利用也如雨后春笋呈现在大家背后，其在协同办公、客服对话、语言翻译、内容生成等方面的应用均来带了前所未有的畅快体验。

在咱们享受大语言模型带来的普惠 AI 能力时，它也给开发者带来了前所未有的挑战。GPT3 模型具备 1750 亿参数量，即便是针对学术界和高级用户的 Alpaca 也具备 70 亿的参数量，因而单机多卡的分布式推理便成为了大模型落地计划的不二抉择。

本文将以 Bloom7B1 模型为样例，分享在阿里云容器服务 ACK 上，进行大语言模型分布式推理的具体实际。

随着越来越多的大语言模型公布，其中也有很多体现优良的开源大语言模型能让大家体验，人们通过已有的大语言模型构建本人的利用也不再遥不可及。然而，与以往的模型不同，单张 GPU 卡的显存可能不足以撑持大语言模型。因而，须要应用模型并行技术，将大语言模型进行切分后，在多张 GPU 卡上进行推理。在本文中，咱们应用 DeepSpeed Inference 来部署大语言模型分布式推理服务。

DeepSpeed Inference 是 Microsoft 提供的分布式推理解决方案，可能很好的反对 transformer 类型的大语言模型。DeepSpeed Inference 提供了模型并行能力，在多 GPU 上对大模型并行推理。通过张量并行技术同时利用多个 GPU，进步推理性能。DeepSpeed 还提供了优化过的推理定制内核来进步 GPU 资源利用率，升高推理提早。详细信息可参考 DeepSpeed Inference [3 ]。

有了大模型分布式推理计划，然而想要在 Kubernetes 集群中高效部署大模型推理服务，还存在很多工程化挑战，比方大规模的 GPU 等异构资源如何高效地治理运维和主动调度？如何疾速部署推理服务，服务上线后如何保障资源可能应答稳定的访问量？以及没有适宜的工具进行推理服务时延、吞吐、GPU 利用率、显存占用等要害指标监控，没有正当的模型切分计划，模型版本治理等。

本文应用阿里云容器服务 ACK 云原生 AI 套件进行 DeepSpeed 分布式推理的实际，能够轻松治理大规模异构资源，精细化的 GPU 调度策略和丰盛的 GPU 监控告警能力，应用 Arena 疾速提交和治理可弹性伸缩的推理服务，以及服务化运维等。

本例中会应用以下组件：

Arena：Arena 是基于 Kubernetes 的机器学习轻量级解决方案，反对数据筹备、模型开发，模型训练、模型预测的残缺生命周期，晋升数据科学家工作效率。同时和阿里云的根底云服务深度集成，反对 GPU 共享、CPFS 等服务，能够运行阿里云优化的深度学习框架，最大化应用阿里云异构设施的性能和老本的效益。更多 arena 信息，能够参考云原生 AI 套件开发者使用指南 [ 1]。
Ingress：在 Kubernetes 集群中，Ingress 作为集群内服务对外裸露的拜访接入点，其简直承载着集群内服务拜访的所有流量。Ingress 是 Kubernetes 中的一个资源对象，用来治理集群内部拜访集群外部服务的形式。您能够通过 Ingress 资源来配置不同的转发规定，从而达到依据不同的规定设置拜访集群内不同的 Service 所对应的后端 Pod。更多 Ingress 信息，能够参考 Ingress 概述 [ 2]。
DeepSpeed Inference：是 Microsoft 提供的分布式推理解决方案，提供了对 GPT、BLOOM 等 LLM 模型的分布式推理优化，具体可参考 DeepSpeed Inference [ 3]。

下列示例中，咱们通过 Arena 在 Kubernetes 集群中部署了基于 Bloom 7B1 模型的单机多卡分布式推理服务，应用 DJLServing 作为模型服务框架。DJLServing 是由 Deep Java Library (DJL) 提供反对的高性能通用模型服务解决方案，能间接反对 DeepSpeed Inference，通过 HTTP 提供大模型推理服务，详细信息可参考 DJLServing [ 4]。应用 Arena 提交推理工作，在 Kubernetes 中应用 Deployment 部署推理服务，从共享存储 OSS 中加载模型和配置文件，通过 Service 裸露服务，为推理服务提供弹性伸缩、GPU 共享调度、性能监控、老本剖析与优化等性能，升高您的运维老本。

创立蕴含 GPU 的 Kubernetes 集群 [ 5]
装置云原生 AI 套件 [ 6]

接下来演示如何应用 Arena 命令行工具，在 ACK 容器服务中提交一个 Bloom7B1 模型的单机多卡分布式推理工作，并配置 Ingress 来进行服务拜访。

模型配置中包含了两个方面的内容：

配置文件 ，对应本例中的 serving.properties 文件，外面形容了模型配置的相干信息。这里重点关注两个参数：

<!—->

- tensor_parallel_degree：用于指定 tensor parallel 的 size，本例中设置为 2，也就是应用 2 张 GPU 卡进行分布式推理；
- model_id：为模型的名称，huggingface 中 model 的名称，也能够是 download 后的模型地址；本例样例中，会将 bloom7B1 模型下载到 OSS 中，并通过 PVC 的模式挂载到容器内，因而这里会指定 OSS 的地址。
推理逻辑文件 ，用于实现模型的加载和 request 的解决，具体如下：

<!—->

- get_model 函数：先进行 model 和分词器的加载，而后将 model 通过 deepspeed.init_inference 转换为具备分布式推理能力的 model，最初通过新生成的 model 来构建推理 pipeline；
- handle 函数：通过调用 get_model 函数中生成的 pipeline 来实现 tokenize，forward 和 detokenize 流程。

serving.properties 内容如下：

这里的 model_id 指定为 pvc 挂载后的容器内地址；如果没有提前 download 模型到本地，能够指定为 bigscience/bloom-7b1，程序会执行主动下载（模型文件一共 15G 作业）

engine=DeepSpeed
option.parallel_loading=true
option.tensor_parallel_degree=2
option.model_loading_timeout=600
option.model_id=model/LLM/bloom-7b1/deepspeed/bloom-7b1
option.data_type=fp16
option.max_new_tokens=100

model.py 内容如下：

mport os
import torch
from typing import Optional

import deepspeed
import logging
logging.basicConfig(format='[%(asctime)s] %(filename)s %(funcName)s():%(lineno)i [%(levelname)s] %(message)s', level=logging.DEBUG)
from djl_python.inputs import Input
from djl_python.outputs import Output
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

predictor = None


def get_model(properties: dict):
    model_dir = properties.get("model_dir")
    model_id = properties.get("model_id")
    mp_size = int(properties.get("tensor_parallel_degree", "2"))
    local_rank = int(os.getenv('OMPI_COMM_WORLD_LOCAL_RANK', '0'))
    logging.info(f"process [{os.getpid()}  rank is [{local_rank}]]")
    if not model_id:
        model_id = model_dir
    logging.info(f"rank[{local_rank}] start load model")
    model = AutoModelForCausalLM.from_pretrained(model_id)
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    logging.info(f"rank[{local_rank}] success load model")

    model = deepspeed.init_inference(model,
                                     mp_size=mp_size,
                                     dtype=torch.float16,
                                     replace_method='auto',
                                     replace_with_kernel_inject=True)
    logging.info(f"rank[{local_rank}] success to convert model to deepspeed kernel")

    return pipeline(task='text-generation',
                    model=model,
                    tokenizer=tokenizer,
                    device=local_rank)


def handle(inputs: Input) -> Optional[Output]:
    global predictor
    if not predictor:
        predictor = get_model(inputs.get_properties())

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    data = inputs.get_as_string()
    output = Output()
    output.add_property("content-type", "application/json")
    result = predictor(data, do_sample=True, max_new_tokens=50)
    return output.add(result)

别离将 serving.properties、model.py 和模型文件（可选）上传到 OSS 上。具体操作，请参见控制台上传文件 [ 7]。

上传到 OSS 之后，别离创立名称为 bloom7b1-pv 和 bloom7b1-pvc 的 PV 和 PVC，以用于推理服务的容器挂载。具体操作，请参见应用 OSS 动态存储卷 [ 8]。

将配置文件信息放入 PVC 中，可通过下列 arena 命令启动推理服务。

–gpus：设置为 2，示意须要应用 2 张 GPU 卡进行分布式推理
–data：bloom7b1-pvc 为上一步创立的 pvc，/model 为 pvc 挂载到容器中的门路

arena serve custom \
    --name=bloom7b1-deepspeed \
    --gpus=2 \
    --version=alpha \
    --replicas=1 \
    --restful-port=8080 \
    --data=bloom7b1-pvc:/model \
    --image=ai-studio-registry-vpc.cn-beijing.cr.aliyuncs.com/kube-ai/djl-serving:2023-05-19 \
    "djl-serving -m"

查看工作运行状况。

$ kubectl get pod | grep bloom7b1-deepspeed-alpha-custom-serving
bloom7b1-deepspeed-alpha-custom-serving-766467967d-j8l2l    1/1     Running     0          8s

# 查看启动日志
kubectl logs bloom7b1-deepspeed-alpha-custom-serving-766467967d-j8l2l -f

服务启动日志如下，通过日志咱们能够看到：

应用的 tensor parallel size 为 2 的分布式并行进行推理
服务中启动了 process id 为 92 和 93 的两个过程，rank id 别离为 0 和 1
rank0 和 rank0 会同时进行 kernel 的转换和模型的加载，以实现分布式推理的工作

INFO  ModelServer Starting model server ...
INFO  ModelServer Starting djl-serving: 0.23.0-SNAPSHOT ...
INFO  ModelServer
INFO  PyModel Loading model in MPI mode with TP: 2.
INFO  PyProcess [1,0]<stdout>:process [92  rank is [0]]
INFO  PyProcess [1,0]<stdout>:rank[0] start load model
INFO  PyProcess [1,1]<stdout>:process [93  rank is [1]]
INFO  PyProcess [1,1]<stdout>:rank[1] start load model
INFO  PyProcess [1,0]<stdout>:rank[0] success to convert model to deepspeed kernel
INFO  PyProcess [1,1]<stdout>:rank[1] success to convert model to deepspeed kernel
INFO  PyProcess [1,0]<stdout>:rank[0] success load model
INFO  PyProcess [1,1]<stdout>:rank[1] success load model
INFO  PyProcess Model [deepspeed] initialized.
INFO  PyProcess Model [deepspeed] initialized.
INFO  PyModel deepspeed model loaded in 297083 ms.
INFO  ModelServer Initialize BOTH server with: EpollServerSocketChannel.
INFO  ModelServer BOTH API bind to: http://0.0.0.0:8080

这里咱们启动 port-forward 来进行疾速验证

# 应用 kubectl 启动 port-forward
kubectl  -n default-group port-forward svc/bloom7b1-deepspeed-alpha 9090:8080

在另一个终端，申请服务

# 关上新的终端，执行下列命令
$ curl -X POST http://127.0.0.1:9090/predictions/deepspeed -H "Content-type: text/plain" -d "I'm very thirsty, I need"
[
  {"generated_text":"I'm very thirsty, I need some water.\nWhat are you?\n- I'm a witch.\n- I thought you'd say that.\nI know a great witch.\nShe's right in here.\n- You know where we can go?\n- That's right, in one moment.\n- You want to"}
]

咱们可配置 Ingress 来将模型服务对外透出，以用来对外部流量进行治理，保障模型可用性。为下面创立的服务配置 Ingress 流程如下：

登录容器服务治理控制台，在左侧导航栏抉择集群。
在集群列表页面，单击指标集群名称，而后在左侧导航栏，选择网络 > 路由。
在路由页面，单击创立 Ingress，在创立 Ingress 对话框配置路由。

更具体的 Ingress 配置策略能够参考：创立 Nginx Ingress [ 9]

填写如下信息

Ingress 创立胜利后，能够 Ingress 配置的域名来对 Bloom 模型进行拜访。

% curl -X POST http://deepspeed-bloom7b1.c78d407e5fa034a5aa9ab10e577e75ae9.cn-beijing.alicontainer.com/predictions/deepspeed -H "Content-type: text/plain" -d "I'm very thirsty, I need"
[
  {"generated_text":"I'm very thirsty, I need to drink!\nI want more water.\nWhere is the water?\nLet me have the water, let me have the water...\nWait!\nYou're the father aren't you?\nDo you have water?\nAre you going to let me have some?\nGive me the"}
]

通过下面的例子，咱们展现了如何应用 Arena 部署了一个 Bloom7B1 模型的单机多卡推理服务，应用 DeepSpeed-Inference 的模型并行推理技术，在多张 GPU 上进行推理。除了 DeepSpeed-Inference，以后也有一些其余的大模型分布式推理计划，比方 FastTransformer + Triton。后续咱们也将一直摸索，心愿可能通过云原生 AI 套件，联合大模型分布式推理计划，用更低的老本反对高性能、低提早、可弹性伸缩的大模型推理服务。

相干链接：

[1] 云原生 AI 套件开发者使用指南

https://help.aliyun.com/document_detail/336968.html?spm=a2c4g…

[2] Ingress 概述

https://help.aliyun.com/document_detail/198892.html?spm=a2c4g…

[3] DeepSpeed Inference

https://www.deepspeed.ai/tutorials/inference-tutorial/

[4] DJLServing

https://github.com/deepjavalibrary/djl-serving

[5] 创立托管 GPU 集群

https://help.aliyun.com/document_detail/171074.html?spm=a2c4g.171073.0.0.4c78f95a00Mb5P

[6] 装置云原生 AI 套件

https://help.aliyun.com/document_detail/201997.html?spm=a2c4g.212117.0.0.115b1cb6yDEAjy

[7] 控制台上传 OSS 文件

https://help.aliyun.com/document_detail/31886.htm?spm=a2c4g.2…

[8] 应用 OSS 动态存储卷

https://help.aliyun.com/document_detail/134903.html?spm=a2c4g…

[9] 创立 Nginx ingress

https://help.aliyun.com/document_detail/86536.html?spm=a2c4g.198892.0.0.3acd663fsFwQPY

关于阿里云:AI-充电揭秘大语言模型实践分布式推理的工程化落地才是关键

分布式推理成为大模型落地的首选计划

工程化落地是大模型分布式推理的要害

实际示例概述

实际示例步骤

环境筹备

大模型推理实际

1. 模型配置编写

2. 启动服务

3. 服务验证

4. Ingress 配置

总结和瞻望