【Yolo on Triton】在NVIDIA Triton Inference Server上部署Yolov8

- 模型导出
- - 参数解析
- 模型输出分解
- 模型部署
- - 如何设置modelRepository（模型仓库）
  - 如何编写模型配置文件（）：
  - 如何编写docker容器启动脚本
- 客户端
- - Triton Client安装
  - 核心代码编写
  - - 客户端构建
    - 前处理
    - 对象初始化
    - 执行推理请求
    - 后处理
  - 完整代码

模型导出

在当你训练好你的Yolov8模型时，请务必使用batch=1选项以确保Yolov8模型能够以batchSize=1进行输出。

model.export(format="onnx",batch=1,simplify=True)

参数解析

format：代表你要导出的模型格式，这里我使用了onnx。
batch : 批次数量
simplify：是否调用onnx-simplifier简化onnx模型

模型输出分解

这里引用一下Ultralytics开发人员glenn-jocher的英文原文答复来解释一下Yolov8的模型输出：

@RosterMouch(我的Github名) hello,
The output of the YOLOv8 model is a 1D array of length 22xN, where N is the number of detected objects. The first 4 values in the array represent the x, y, width, and height of the bounding box, followed by 18 values representing the confidence scores of each class.

Yolov8的模型输出是一个1维的矩阵，长度为22xN，其中N代表的是模型识别的目标数量。模型的前四位代表着锚点框的x，y，宽度和高度，其余十八个值则代表着每一个类的置信度（在我的任务有18个类，因此18+4=22 x N ）。

模型部署

关于如何部署NVIDIA Triton Inference Server，请阅读本篇文章，这里不做赘述

如何设置modelRepository（模型仓库）

Triton要求开发者有严格的模型仓库格式，如下：

└── classnet
    ├── 1
    │   └── 
    └──

由于这里我仍然没有搞定TensorRT plan模型找不到文件的根本原因，所以这里以onnx模型进行部署。
目录结构解释如下：

classnet：代表你的模型名称
1：代表你的模型版本，如果有大于1的版本时，如果不在文件中指明Triton的加载策略时，Triton默认只会加载最新版本的模型。
：模型的配置文件。

如何编写模型配置文件（）：

文件的常用配置项如下：

name: "classnet" # 模型名称，和你的文件夹一一对应
platform: "onnxruntime_onnx" # 指定模型运行的后端
max_batch_size: 0 # 最大的批次大小，0代表自动
input [ # Triton Input输入对象配置
  {
    name: "images", # 输入名称（不要动，v8的模型定义就是这个）
    data_type: TYPE_FP32, # 根据你的权重类型进行修改，这里我的模型时FP32
    dims: [ 1,3,640,640 ] # 输入为1批次，大小为640x640像素的RGB图像
  }
]
output[ # Triton Output输出对象配置
  {
    name: "output0", # 输出名称
    data_type: TYPE_FP32 # 输出数据类型
    dims: [ 1,22,8400 ] # 输出大小，一般默认是1批次，N个类，8400个目标（当然比这个值小也正常）
  }
]
# 版本策略配置
# 其中latest代表Triton加载的最新版本模型
# num_version代表版本号
version_policy: { latest { num_versions: 1 } }
# instance_group：模型运行实例（设备）组配置
instance_group: [
  {
    count: 1 # 数量
    kind: KIND_GPU # 类型
    gpus: [ 0 ] # 如果参数项为GPU，则该列表将指定对应序号下的可见CUDA设备来运行模型
  }
]

如何编写docker容器启动脚本

关于如何编写docker启动脚本，请阅读Triton部署文章中启动项解析部分进行参考，这里不做赘述。

客户端

Triton Client安装

在开始执行客户端的脚本编写时，你需要安装Triton ClientPython包，安装方法如下：

pip install tritonclient[all] -i /simple

核心代码编写

客户端构建

首先我们要从tritonclinet中引入http模块，以及针对处理推理请求可能时发生的异常类，通过引入：

import tritonclient.http as httpclient
from tritonclient.utils import InferenceServerException

之后我们需要创建一个随机的哈希来用作表示推理请求的ID：

# Creat a random hash for request ID
requestID = random.randint(0, 100000)
requestID = sha256(str(requestID).encode('utf-8')).hexdigest()

接下来实例化Triton客户端：

# Create a HTTP client for inference server running in localhost:8000.
triton_client = httpclient.InferenceServerClient(
   url="localhost:8000",
)

并未为模型的输入和输出创建列表。

前处理

请切记，在将opencv图像流转换为numpy矩阵后，一定要为其扩增维度，因为我们的模型接受的是：(1,3,640,640)的图像，对于此，我们可以使用 numpy.expand_dims()函数对其进行维度扩增：

imageData = cv.imread("./")
imageData = np.array(imageData)
imageData = np.expand_dims(imageData, axis=0)
# resize into 640
imageData = np.resize(imageData, (1,3,640,640))
# change to fp32
imageData = imageData.astype(np.float32)

对象初始化

接下来创建一个推理请求：

inputs.append(httpclient.InferInput('images', imageData.shape, "FP32"))

其中第一位对应的是你模型的输入名称，第二位对应的是输入的大小（1,3,640,640）,第三位代表数据类型。
但这时模型的输入并不完整，我们只是初始化了一个推理请求，但并没有数据，所以我们可以调用.set_data_from_numpy()方法为其添加数据：

inputs[0].set_data_from_numpy(imageData)

与之对应，我们也要创建一个模型请求后的输出：

outputs.append(httpclient.InferRequestedOutput('output0'))

其中output0代表模型的输出名称。

执行推理请求

接下来，我们开始执行推理请求了：调用httpclient.async_infer()函数执行异步推理请求：

try:
    resp = triton_client.async_infer(
            model_name="classnet", # 模型名称
            model_version="1", # 版本号
            inputs=inputs, # 输入
            outputs=outputs, # 输出
            request_id=requestID # 请求ID
    )
except InferenceServerException as e:
     	print(e)

这样，我们就像Triton服务器发送了一个推理请求。

后处理

当我们得到了正确的响应后，我们应该怎么办呢？
我们知道，模型的输出将是一个1D的矩阵，其中N是类别数量，M是识别的目标数量，但是这显然对于后续的如：结果提取和锚点的绘制。
因此我们需要将这一结果进行转置：

result = resp.get_result().as_numpy('output0')[0].T

其中.get_result()函数可以从响应中获取结果，并使用.as_numpy()分离矩阵结果，output0代表模型的输出名，
这样我们就拿到了结果。
还记得我们对模型输出的定义吗？基于此，我们得到了一个后处理函数：

@numba.jit
def extractFeature(inferenceResult:np.ndarray) -> np.ndarray:
    threshold = 0.5 # 设置置信度阈值
    final = [] # 创建最终输出的数组
    for i in inferenceResult: # 遍历结果
        temp = [] # 创建临时结果数组
        box = i[:4] # 截取锚点框信息
        confidence = i[4:] # 截取所有类别的置信度
        if np.max(confidence) < threshold:
            continue # 如果当前置信度小于阈值，则抛弃推理结果
        else:
            for j in box: # 分解锚点框信息
                temp.append(j)
            # 获取最佳置信度对应的类别
            classIndex = np.where(confidence == np.max(confidence))[0][0]
            temp.append(classIndex)
            final.append(temp)
    final = np.array(final)
    # drop duplicate boxes
    # final = (final, axis=1)
    return final

接下来，就是如何将[x,y,w,h]类型的数据转换为[x1,y,1,x2,y2]格式的数据了，因为Yolo的默认输出不适用于OpenCV的rectangle()函数，因此我们需要对数据进行一次转换：

@numba.jit
def xywh2xyxy(box:np.ndarray) -> np.ndarray:
    x = box[0]
    y = box[1]
    w = box[2]
    h = box[3]
    x1 = x - w/2 # 将原始的x - 二分之一的宽度
    y1 = y - h/2 # 将原始的y - 二分之以的高度
    x2 = x + w/2 # 同理将x2和y2与x1,y1做反方向操作即可
    y2 = y + h/2
    return np.array([x1,y1,x2,y2])

接下来就是渲染函数了：

@numba.jit
def drawBox(image:np.ndarray, result:np.ndarray) -> np.ndarray:
    for i in result:
        x = int(i[0])
        y = int(i[1])
        classIndex = str(int(i[4]))
        boxWidth = int(i[2]) - int(i[0])
        boxHeight = int(i[3]) - int(i[1])
        # random RGB
        color = (random.randint(0, 255), random.randint(0, 255), random.randint(0, 255))
        cv.rectangle(image, (x, y), (x+boxWidth, y+boxHeight),color, 2)
        cv.putText(image,str(classIndex), (x, y), cv.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)
    return image

完整代码

import numpy as np
import tritonclient.http as httpclient
from tritonclient.utils import InferenceServerException
import random
from hashlib import sha256
import numba
import cv2 as cv
import json

classRGB = json.loads(open("../config/").read())
# keys to list
className = list(classRGB.keys())
classColors = list(classRGB.values())
@numba.jit
def extractFeature(inferenceResult:np.ndarray) -> np.ndarray:
    threshold = 0.95
    final = []
    for i in inferenceResult:
        temp = []
        box = i[:4]
        confidence = i[4:]
        if np.max(confidence) < threshold:
            continue
        else:
            for j in box:
                temp.append(j)
            classIndex = np.where(confidence == np.max(confidence))[0][0]
            temp.append(classIndex)
            final.append(temp)
    final = np.array(final)
    # drop duplicate boxes
    # final = (final, axis=1)
    return final
@numba.jit
def xywh2xyxy(box:np.ndarray) -> np.ndarray:
    x = box[0]
    y = box[1]
    w = box[2]
    h = box[3]
    x1 = x - w/2
    y1 = y - h/2
    x2 = x + w/2
    y2 = y + h/2
    return np.array([x1,y1,x2,y2])
@numba.jit
def drawBox(image:np.ndarray, result:np.ndarray) -> np.ndarray:
    for i in result:
        # convet [x,y,w,h] to [x1,y1,x2,y2]
        box = xywh2xyxy(i[:4])
        classIndex = className[int(i[4])]
        classColor = classColors[int(i[4])]
        # random RGB
        color = (classColor[0], classColor[1], classColor[2])
        # draw box
        cv.rectangle(image, (int(box[0]), int(box[1])), (int(box[2]), int(box[3])), color, 2)
        # draw text
        cv.putText(image, classIndex, (int(box[0]), int(box[1])), cv.FONT_HERSHEY_SIMPLEX, 1, color, 2)
    return image
def main():
    # Creat a random hash for request ID
    requestID = random.randint(0, 100000)
    requestID = sha256(str(requestID).encode('utf-8')).hexdigest() 
    # Create a HTTP client for inference server running in localhost:8000.
    triton_client = httpclient.InferenceServerClient(
        url="localhost:8000",
    )
    inputs = []
    outputs = []
    imageData = cv.imread("./")
    imageData = np.array(imageData)
    imageData = np.expand_dims(imageData, axis=0)
    # resize into 640
    imageData = np.resize(imageData, (1,3,640,640))
    # change to fp32
    imageData = imageData.astype(np.float32)
    inputs.append(httpclient.InferInput('images', imageData.shape, "FP32"))
    inputs[0].set_data_from_numpy(imageData)
    outputs.append(httpclient.InferRequestedOutput('output0'))

    # Send request to the inference server. Get results for both output tensors.
    try:
        resp = triton_client.async_infer(
            model_name="classnet",
            model_version="1",
            inputs=inputs,
            outputs=outputs,
            request_id=requestID
        )
        result = resp.get_result().as_numpy('output0')[0].T
        result = np.squeeze(result)
        result = extractFeature(result)
        image = cv.imread("./")
        image = cv.resize(image, (640, 640))
        imageOutput = drawBox(image, result)
        cv.imwrite("", imageOutput)
    except InferenceServerException as e:
        print(e)

if __name__ == '__main__':
    main()

秒客网