关于人工智能:使用TensorRTLLM进行生产环境的部署指南

TensorRT-LLM 是一个由 Nvidia 设计的开源框架，用于在生产环境中进步大型语言模型的性能。该框架是基于 TensorRT 深度学习编译框架来构建、编译并执行计算图，并借鉴了许多 FastTransformer 中高效的 Kernels 实现，并且能够利用 NCCL 实现设施之间的通信。

尽管像 vLLM 和 TGI 这样的框架是加强推理的一个很好的终点，但它们不足一些优化，因而很难在生产中扩大它们。所以 Nvidia 在 TensorRT 的根底上有开发了 TensorRT-LLM，像 Anthropic, OpenAI, Anyscale 等大公司曾经在应用这个框架为数百万用户提供 LLM 服务。

与其余推理技术不同，TensorRT LLM 不应用原始权重为模型服务。它会编译模型并优化内核，这样能够在 Nvidia GPU 上无效地服务。运行编译模型的性能劣势远远大于运行原始模型。这是 TensorRT LLM 十分快的次要起因之一。

原始模型权重和优化选项 (如量化级别、张量的并行性、管道并行性等) 一起传递给编译器。而后编译器获取该信息并输入针对特定 GPU 优化的模型二进制文件。

然而这里整个模型编译过程必须在 GPU 上进行。生成的编译模型也是专门针对运行它的 GPU 进行优化的。例如，在 A40 GPU 上编译模型，则可能无奈在 A100 GPU 上运行它。所以无论在编译过程中应用哪种 GPU，都必须应用雷同的 GPU 进行推理。

然而 TensorRT LLM 并不反对开箱即用所有的大型语言模型（起因是每个模型架构是不同的）。然而 TensorRT 所作的做深度图级优化是反对大多数风行的模型，如 Mistral、Llama 和 Qwen 等。具体反对的模型能够参考 TensorRT LLM Github 官网的列表

TensorRT LLM python 包容许开发人员在不理解 c ++ 或 CUDA 的状况下以最高性能运行 LLM。

分页注意力

大型语言模型须要大量内存来存储每个令牌的键和值。随着输出序列变长，这种内存应用会变得十分大。

通常状况下，序列的键和值必须间断存储。所以即便你在序列的内存调配中开释了空间，你也不能把这个空间用于其余序列。这会导致碎片化和节约。

分页注意力将键 / 值分成而不是间断的页，这样能够放在内存中的任何中央，如果您在两头开释一些分页，那么这些空间能够用于其余序列。

这能够避免碎片，并容许更高的内存利用率。在生成输入序列时，能够依据须要动静地调配和开释页面。

高效 KV 缓存

llm 有数十亿个参数，这使得它们运行推理时速度迟缓且占用大量内存。KV 缓存通过缓存 LLM 的层输入和激活来帮忙解决这个问题，因而它们不须要为每个推理从新计算。

上面是它的工作原理:

在推理期间，当 LLM 执行每一层时，输入将被缓存到具备惟一键的键值存储中。当后续推断应用雷同的层输出时，不是从新计算层，而是应用键检索缓存的输入。这防止了冗余计算，缩小了激活内存，进步了推理速度和内存效率。

上面咱们开始应用 TensorRT-LLM 部署一个模型

应用 TensorRT-LLM 部署模型首先就是要对模型进行编译，这里咱们将应用 Mistral 7B instruction v0.2。编译阶段须要 GPU，所以为了方便使用咱们间接在 Colab 上操作。

TensorRT LLM 次要反对高端 Nvidia gpu。所以咱们在 Colab 上抉择了 A100 40GB GPU。

下载 TensorRT-LLM git 库。这个 repo 蕴含了编译模型所需的所有模块和脚本。

 !git clone https://github.com/NVIDIA/TensorRT-LLM.git
 %cd TensorRT-LLM/examples/llama

而后装置所需的包

 !pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
 !pip install huggingface_hub pynvml mpi4py
 !pip install -r requirements.txt

下载模型

 from huggingface_hub import snapshot_download
 from google.colab import userdata
 
 
 snapshot_download(
     "mistralai/Mistral-7B-Instruct-v0.2",
     local_dir="tmp/hf_models/mistral-7b-instruct-v0.2",
     max_workers=4
 )

这一步能够查看 Colab 的 tmp/hf_models 目录，在那里能够看到模型权重。

而后是加载模型，并转换成特定的 tensorRT LLM 格局

 !python convert_checkpoint.py --model_dir ./tmp/hf_models/mistral-7b-instruct-v0.2 \
                              --output_dir ./tmp/trt_engines/1-gpu/ \
                              --dtype float16

下一步就是应用 trtllm-build 命令编译模型。如果须要量化和其余优化，能够在这里指定参数。为了简略起见，我没有应用任何额定的优化。

 !trtllm-build --checkpoint_dir ./tmp/trt_engines/1-gpu/ \
             --output_dir ./tmp/trt_engines/compiled-model/ \
             --gpt_attention_plugin float16 \
             --gemm_plugin float16 \
             --max_input_len 32256

Mistral 7B instruction v0.2 反对 32K 上下文长度。所以这里在 max_input_length 标记设置了上下文长度。

编译模型须要 15-30 分钟

模型编译实现后就能够间接应用了。咱们这里还要介绍一下模型部署的办法，有很多办法能够部署这个编译后的模型比方像 FastAPI 这样的简略工具，或者像 triton 推理服务器这样更简单的工具。

当应用像 FastAPI 这样的工具时，开发人员必须设置 API 服务器，编写 Dockerfile，并正确配置 CUDA，这外面蕴含了很多服务器后端的工作，有时候咱们并不相熟，所以这里咱们介绍一个简略的开源工具 Truss。

Truss 容许开发人员应用 GPU 打包他们的模型，并在任何云环境中运行它们。它有很多很棒的性能，使集成模型变得轻而易举。应用 Truss 的次要益处是，能够轻松地将具备 GPU 反对的模型容器化，并将其部署到任何云环境中。

装置:

 pip install --upgrade truss

如果从头开始创立 Truss 我的项目，你能够运行上面的命令:

 truss init mistral-7b-tensort-llm

mistral-7b-tensort-llm 是咱们我的项目的名称，能够轻易编写。运行下面的命令会主动生成部署 Truss 所需的文件。

上面是 mistral-7b- tensort -llm-truss 的目录构造:

 ├── mistral-7b-tensorrt-llm-truss
 │   ├── config.yaml
 │   ├── model
 │   │   ├── __init__.py
 │   │   └── model.py
 |   |   └── utils.py
 |   ├── requirements.txt

以下是上述文件的疾速介绍:

1、config.yaml 用于为模型设置各种配置，包含其资源、依赖项、环境变量等。在这里，咱们能够指定模型名称、要装置的 Python 依赖项以及要装置的零碎包。

2、model/model.py 是 Truss 的外围。它蕴含将在 Truss 服务器上执行的 Python 代码。在 model.py 中有两个次要办法:load()和 predict()。

load 办法是咱们从 hugs face 下载编译模型并初始化 TensorRT LLM 的中央；predict 办法接管 HTTP 申请并调用模型。

3、model/utils.py 蕴含 model.py 文件的一些辅助函数。utils.py 文件不是咱们本人编写的，能够间接从 TensorRT LLM 存储库中获取的。

4、含运行编译模型所需的 Python 依赖项，truss 会应用它来初始化咱们的环境。

model.py 蕴含执行的主代码，让咱们首先看一下 load 函数。

 import subprocess
 subprocess.run(["pip", "install", "tensorrt_llm", "-U", "--pre", "--extra-index-url", "https://pypi.nvidia.com"])
 
 import torch
 from model.utils import (DEFAULT_HF_MODEL_DIRS, DEFAULT_PROMPT_TEMPLATES,
                    load_tokenizer, read_model_name, throttle_generator)
 
 import tensorrt_llm
 import tensorrt_llm.profiler
 from tensorrt_llm.runtime import ModelRunnerCpp, ModelRunner
 from huggingface_hub import snapshot_download
 
 STOP_WORDS_LIST = None
 BAD_WORDS_LIST = None
 PROMPT_TEMPLATE = None
 
 class Model:
     def __init__(self, **kwargs):
         self.model = None
         self.tokenizer = None
         self.pad_id = None
         self.end_id = None
         self.runtime_rank = None
         self._data_dir = kwargs["data_dir"]
 
     def load(self):
         snapshot_download(
             "htrivedi99/mistral-7b-v0.2-trtllm",
             local_dir=self._data_dir,
             max_workers=4,
         )
 
         self.runtime_rank = tensorrt_llm.mpi_rank()
 
         model_name, model_version = read_model_name(f"{self._data_dir}/compiled-model")
         tokenizer_dir = "mistralai/Mistral-7B-Instruct-v0.2"
 
         self.tokenizer, self.pad_id, self.end_id = load_tokenizer(
             tokenizer_dir=tokenizer_dir,
             vocab_file=None,
             model_name=model_name,
             model_version=model_version,
             tokenizer_type="llama",
         )
 
 
         runner_cls = ModelRunner
         runner_kwargs = dict(engine_dir=f"{self._data_dir}/compiled-model",
                              lora_dir=None,
                              rank=self.runtime_rank,
                              debug_mode=False,
                              lora_ckpt_source="hf",
                             )
 
         self.model = runner_cls.from_dir(**runner_kwargs)

在文件的顶部，咱们导入了必要的模块，特地是 tensorrt_llm；而后在 load 函数中，咱们应用 snapshot_download 函数下载编译后的模型；而后应用 model/utils.py 附带的 load_tokenizer 函数下载模型的标记器；最初应用 TensorRT LLM 应用 ModelRunner 类加载编译后的模型。

上面就是 predict 函数

 def predict(self, request: dict):
 
         prompt = request.pop("prompt")
         max_new_tokens = request.pop("max_new_tokens", 2048)
         temperature = request.pop("temperature", 0.9)
         top_k = request.pop("top_k",1)
         top_p = request.pop("top_p", 0)
         streaming = request.pop("streaming", False)
         streaming_interval = request.pop("streaming_interval", 3)
 
         batch_input_ids = self.parse_input(tokenizer=self.tokenizer,
                                       input_text=[prompt],
                                       prompt_template=None,
                                       input_file=None,
                                       add_special_tokens=None,
                                       max_input_length=1028,
                                       pad_id=self.pad_id,
                                       )
         input_lengths = [x.size(0) for x in batch_input_ids]
 
         outputs = self.model.generate(
             batch_input_ids,
             max_new_tokens=max_new_tokens,
             max_attention_window_size=None,
             sink_token_length=None,
             end_id=self.end_id,
             pad_id=self.pad_id,
             temperature=temperature,
             top_k=top_k,
             top_p=top_p,
             num_beams=1,
             length_penalty=1,
             repetition_penalty=1,
             presence_penalty=0,
             frequency_penalty=0,
             stop_words_list=STOP_WORDS_LIST,
             bad_words_list=BAD_WORDS_LIST,
             lora_uids=None,
             streaming=streaming,
             output_sequence_lengths=True,
             return_dict=True)
 
         if streaming:
             streamer = throttle_generator(outputs, streaming_interval)
 
             def generator():
                 total_output = ""
                 for curr_outputs in streamer:
                     if self.runtime_rank == 0:
                         output_ids = curr_outputs['output_ids']
                         sequence_lengths = curr_outputs['sequence_lengths']
                         batch_size, num_beams, _ = output_ids.size()
                         for batch_idx in range(batch_size):
                             for beam in range(num_beams):
                                 output_begin = input_lengths[batch_idx]
                                 output_end = sequence_lengths[batch_idx][beam]
                                 outputs = output_ids[batch_idx][beam][output_begin:output_end].tolist()
                                 output_text = self.tokenizer.decode(outputs)
 
                                 current_length = len(total_output)
                                 total_output = output_text
                                 yield total_output[current_length:]
             return generator()
         else:
             if self.runtime_rank == 0:
                 output_ids = outputs['output_ids']
                 sequence_lengths = outputs['sequence_lengths']
                 batch_size, num_beams, _ = output_ids.size()
                 for batch_idx in range(batch_size):
                     for beam in range(num_beams):
                         output_begin = input_lengths[batch_idx]
                         output_end = sequence_lengths[batch_idx][beam]
                         outputs = output_ids[batch_idx][beam][output_begin:output_end].tolist()
                         output_text = self.tokenizer.decode(outputs)
                         return {"output": output_text}

predict 函数承受一些模型输出，如提醒、max_new_tokens、温度等。咱们应用申请在函数的顶部提取所有这些值。调用 LLM 模型来应用 self.model.generate 函数生成输入。generate 函数承受各种参数，帮忙管制 LLM 的输入。

为了在云中运行咱们的模型，还须要将其容器化。Truss 会负责为咱们创立 Dockerfile 并打包所有内容，所以咱们不须要做太多事件。

在 mistral-7b- tensort -llm-truss 目录之外创立一个名为 main.py 的文件。将以下代码粘贴到其中:

 import truss
 from pathlib import Path
 
 tr = truss.load("./mistral-7b-tensorrt-llm-truss")
 command = tr.docker_build_setup(build_dir=Path("./mistral-7b-tensorrt-llm-truss"))
 print(command)

运行 main.py 文件并查看 mistral-7b- tensort -llm-truss 目录。应该会看到主动生成的一堆文件。上面就能够应用 docker 构建容器。顺次运行以下命令:

 docker build mistral-7b-tensorrt-llm-truss -t mistral-7b-tensorrt-llm-truss:latest
 docker tag mistral-7b-tensorrt-llm-truss <docker_user_id>/mistral-7b-tensorrt-llm-truss
 docker push <docker_user_id>/mistral-7b-tensorrt-llm-truss

这些 docker 的配置文件就是 truss 为咱们主动生成好的，咱们上面简略的介绍一下看 k8s 的部署，我不会深刻探讨如何设置 GKE 集群，因为这不在本文的探讨范畴之内。

创立以下 kubernetes 部署:

 apiVersion: apps/v1
 kind: Deployment
 metadata:
   name: mistral-7b-v2-trt
   namespace: default
 spec:
   replicas: 1
   selector:
     matchLabels:
       component: mistral-7b-v2-trt-layer
   template:
     metadata:
       labels:
         component: mistral-7b-v2-trt-layer
     spec:
       containers:
       - name: mistral-container
         image: htrivedi05/mistral-7b-v0.2-trt:latest
         ports:
           - containerPort: 8080
         resources:
           limits:
             nvidia.com/gpu: 1
       nodeSelector:
         cloud.google.com/gke-accelerator: nvidia-tesla-a100
 ---
 apiVersion: v1
 kind: Service
 metadata:
   name: mistral-7b-v2-trt-service
   namespace: default
 spec:
   type: ClusterIP
   selector:
     component: mistral-7b-v2-trt-layer
   ports:
   - port: 8080
     protocol: TCP
     targetPort: 8080

这是一个规范的 kubernetes 部署，它运行一个映像为 htrivedi05/mistral-7b-v0.2-trt:latest 的容器。

能够通过运行命令创立部署:

 kubectl create -f mistral-deployment.yaml

调配 kubernetes pod 须要几分钟的工夫。一旦 pod 开始运行，咱们之前编写的 load 函数就会被执行。

一旦加载了模型后就能够在 pod 日志中看到相似 Completed model.load()的执行工夫为 449234 毫秒。要通过 HTTP 向模型发送申请，咱们须要对服务进行端口转发。你能够应用上面的命令:

 kubectl port-forward svc/mistral-7b-v2-trt-service 8080

关上任意 Python 脚本并运行以下代码:

 import requests
 
 data = {"prompt": "What is a mistral?"}
 res = requests.post("http://127.0.0.1:8080/v1/models/model:predict", json=data)
 res = res.json()
 print(res)

将看到如下输入:

 {"output": "A Mistral is a strong, cold wind that originates in the Rhone Valley in France. It is named after the Mistral wind system, which is associated with the northern Mediterranean region. The Mistral is known for its consistency and strength, often blowing steadily for days at a time. It can reach speeds of up to 130 kilometers per hour (80 miles per hour), making it one of the strongest winds in Europe. The Mistral is also known for its clear, dry air and its role in shaping the landscape and climate of the Rhone Valley."}

这样咱们的推理服务就部署胜利了

我运行了一些自定义基准测试，失去了以下后果:

能够看到 TensorRT-LLM 的减速推理还是很显著的

在这篇文章中，咱们演示了如何应用 TensorRT LLM 实现模型减速推理，文章内容涵盖了从编译 LLM 到在生产中部署模型的所有内容。

尽管 TensorRT LLM 比其余推理优化器更简单，但性能进步也是非常明显。尽管该框架仍处于晚期阶段，然而能够提供目前最先进的 LLM 优化。并且它是齐全开源的能够商业化，我置信 TensorRT LLM 当前还会有更大的倒退，因为毕竟是 NVIDIA 本人的产品.

TensorRT-LLM 代码：

https://avoid.overfit.cn/post/22b19ff044984de69da655a67721cff3

作者：Het Trivedi

关于人工智能:使用TensorRTLLM进行生产环境的部署指南

TensorRT-LLM

TensorRT-LLM 的益处

TensorRT-LLM 部署教程

性能基准测试

总结

Just My Socks（注册教程内含优惠码）

关于人工智能:使用TensorRTLLM进行生产环境的部署指南

TensorRT-LLM

TensorRT-LLM 的益处

TensorRT-LLM 部署教程

性能基准测试

总结

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）