适配登临科技GPU的python脚本实例

基于登临GPU的YOLOv8目标检测实现方案本文详细介绍了使用登临科技GPU实现YOLOv8目标检测的完整流程。核心内容包括：1) 通过dlnne库初始化GPU推理引擎，配置权重共享模式；2) 图像预处理保持宽高比，归一化并调整维度；3) 利用pycuda进行GPU内存管理，实现高效数据传输；4) 后处理流程包含置信度过滤、NMS非极大值抑制；5) 输出结果可视化，支持COCO数据集80类物体识

就这样吧！588

653人浏览 · 2025-08-05 10:16:28

就这样吧！588 · 2025-08-05 10:16:28 发布

前言

yolo中的NMS是一种常用的目标监测后处理技术

作用：删除冗余的候选框，从而提高目标检测的精度

nms中的瞄点框定义：锚点框是一组预先定义的参考框，其大小和宽高比基于数据集的统计特征设计，旨在覆盖不同尺度和形状的目标。

核心作用：减少预测范围、提升检测精度、多尺度检测

锚点框是目标检测模型的基石，其设计直接影响检测精度和速度。理解锚点框的原理和调整方法，有助于优化模型性能。

“登临GPU中的权重配置”是一个关键步骤，指的是利用登临科技提供的软硬件生态，将训练好的深度学习模型参数高效、正确地部署到其GPU芯片上，使之能够执行AI计算任务。这个过程通常需要依赖登临特定的工具链（如编译器、运行时库）来进行优化和适配。

类比：你可以把它想象成在一种新型号的电脑上安装软件。软件本身（模型结构）需要适配操作系统（登临的软件栈），而软件运行所需的配置文件和数据（权重）需要被正确地“安装”和“放置”（配置）到这台电脑的硬盘（GPU显存）上特定的位置，并且操作系统要知道如何找到和使用它们。

案例

以下是对代码每行功能的详细分析，按代码执行顺序和逻辑模块划分：

1. 导入依赖库

import cv2                  # OpenCV库，用于图像处理（读取、绘制、格式转换等）
import numpy as np          # 数值计算库，用于数组操作和数学运算
import dlnne as nne         # 登临科技GPU推理引擎库，用于模型加载和推理
import pycuda.driver as cuda # CUDA驱动库，用于GPU内存管理和数据传输
import time                 # 时间库，用于计算推理耗时
from utils import nne_util  # 工具函数库，提供dlnne相关辅助功能（如类型转换、权重配置）

2.配置参数（这些参数定义了模型路径、输入输出配置和后处理阈值，可根据需求调整。）

MODEL_PATH = "yolov8n.onnx"  # ONNX模型文件路径（YOLOv8n轻量版模型）
IMAGE_PATH = "test.jpg"      # 测试图像路径
CONF_THRESH = 0.5            # 置信度阈值（过滤低置信度检测结果）
IOU_THRESH = 0.45            # IOU阈值（非极大值抑制参数，过滤重叠框）
INPUT_WIDTH = 640            # YOLOv8模型输入图像宽度
INPUT_HEIGHT = 640           # YOLOv8模型输入图像高度

3.初始化登临科技 GPU 推理引擎（该函数是适配登临 GPU 的核心，负责初始化硬件、解析模型、构建推理引擎，为后续推理做准备。）

def init_dlnne_engine(model_path):
    # 初始化CUDA驱动，确保GPU可用
    cuda.init()

    # 创建dlnne构建器（Builder）和解析器（Parser）
    with nne.Builder() as builder, nne.Parser() as parser:
        # 创建网络对象（用于加载模型结构）
        network = builder.create_network()

        # 配置推理引擎参数
        builder.config.max_batch_size = 1  # 批处理大小为1（单张图像推理）
        # 获取权重共享配置（适配登临GPU硬件特性）
        weight_share = nne_util.get_weight_share('all')
        builder.config.ws_mode = weight_share['weight_mode']  # 设置权重共享模式

        # 解析ONNX模型到网络对象中
        if not parser.parse(model_path, network):
            raise RuntimeError("Failed to parse the model")  # 解析失败则抛出异常

        # 构建推理引擎（将模型编译为适配GPU的可执行格式）
        engine = builder.build_engine(network)

        # 创建执行上下文（用于实际推理计算）
        context = engine.create_execution_context(weight_share['cluster_cfg'])

        return engine, context  # 返回引擎和上下文对象

4.图像预处理（适配 YOLOv8）预处理是模型推理的关键步骤，确保输入图像格式与模型训练时一致。

def preprocess(image, target_size=640):
    # 将BGR格式（OpenCV默认）转换为RGB格式（YOLO模型要求）
    img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    h, w = img.shape[:2]  # 获取原始图像的高和宽

    # 计算缩放比例（保持纵横比，避免图像变形）
    scale = min(target_size / h, target_size / w)
    nh, nw = int(h * scale), int(w * scale)  # 计算缩放后的高和宽
    # 按新尺寸缩放图像
    resized = cv2.resize(img, (nw, nh), interpolation=cv2.INTER_LINEAR)

    # 创建固定尺寸（640x640）的画布，用灰色（114）填充（YOLO训练时的填充策略）
    canvas = np.full((target_size, target_size, 3), 114, dtype=np.uint8)
    canvas[:nh, :nw] = resized  # 将缩放后的图像放置在画布左上角

    # 归一化（像素值从0-255转为0-1）并调整数据格式
    input_img = canvas.astype(np.float32) / 255.0
    # 转换维度顺序：HWC（高度、宽度、通道）→ CHW（通道、高度、宽度），并增加批次维度（N=1）
    input_img = input_img.transpose(2, 0, 1)[None]  # 最终形状：(1, 3, 640, 640)

    return input_img, (w, h), scale  # 返回处理后的输入、原始尺寸、缩放比例

5.非极大值抑制（NMS）NMS 用于去除冗余的重叠检测框，只保留置信度最高且唯一的框。

def _nms(dets, scores, thresh):
    # 提取边界框的左上角和右下角坐标
    x1 = dets[:, 0]
    y1 = dets[:, 1]
    x2 = dets[:, 2]
    y2 = dets[:, 3]

    # 计算每个边界框的面积
    areas = (x2 - x1 + 1) * (y2 - y1 + 1)
    # 按置信度从高到低排序边界框索引
    order = scores.argsort()[::-1]

    keep = []  # 保存保留的边界框索引
    while order.size > 0:
        i = order[0]  # 取置信度最高的框
        keep.append(i)

        # 计算当前框与剩余框的交叠区域
        xx1 = np.maximum(x1[i], x1[order[1:]])
        yy1 = np.maximum(y1[i], y1[order[1:]])
        xx2 = np.minimum(x2[i], x2[order[1:]])
        yy2 = np.minimum(y2[i], y2[order[1:]])

        # 计算交叠区域的宽和高（确保非负）
        w = np.maximum(0.0, xx2 - xx1 + 1)
        h = np.maximum(0.0, yy2 - yy1 + 1)
        inter = w * h  # 交叠面积

        # 计算IOU（交并比）：交叠面积 / (当前框面积 + 其他框面积 - 交叠面积)
        ovr = inter / (areas[i] + areas[order[1:]] - inter)
        # 保留IOU小于阈值的框（过滤重叠度过高的框）
        inds = np.where(ovr <= thresh)[0]
        order = order[inds + 1]  # 更新排序索引

    return keep  # 返回保留的边界框索引

6.后处理（解析模型输出）后处理将模型输出的原始数据转换为可理解的检测结果（类别、置信度、边界框）。

def postprocess(outputs, orig_size, scale, conf_thresh=0.25, iou_thresh=0.45):
    # 处理YOLOv8输出格式：(1, 84, 8400) → 展平为(8400, 84)（8400个候选框，每个框84个参数）
    predictions = np.squeeze(outputs).T

    # 提取边界框（前4个参数：xywh格式）、置信度（取类别得分的最大值）和类别ID
    boxes = predictions[:, :4]  # 边界框坐标（中心x、中心y、宽、高）
    scores = np.max(predictions[:, 4:], axis=1)  # 置信度（每个框的最大类别得分）
    class_ids = np.argmax(predictions[:, 4:], axis=1)  # 类别ID（得分最高的类别）

    # 过滤低置信度的检测结果（只保留置信度>阈值的框）
    mask = scores > conf_thresh
    boxes = boxes[mask]
    scores = scores[mask]
    class_ids = class_ids[mask]

    if len(scores) == 0:  # 若没有符合条件的框，返回空列表
        return []

    # 将边界框从xywh（中心坐标+宽高）转换为xyxy（左上角+右下角）格式，并缩放回原始图像尺寸
    x, y, w, h = boxes.T
    x1 = (x - w / 2) / scale  # 左上角x
    y1 = (y - h / 2) / scale  # 左上角y
    x2 = (x + w / 2) / scale  # 右下角x
    y2 = (y + h / 2) / scale  # 右下角y

    # 裁剪边界框，确保不超出原始图像范围
    x1 = np.clip(x1, 0, orig_size[0])
    y1 = np.clip(y1, 0, orig_size[1])
    x2 = np.clip(x2, 0, orig_size[0])
    y2 = np.clip(y2, 0, orig_size[1])

    # 组合边界框坐标为NMS所需格式
    dets = np.column_stack((x1, y1, x2, y2))

    # 执行非极大值抑制，过滤重叠框
    indices = _nms(dets, scores, iou_thresh)

    # 整理最终结果：类别ID、置信度、边界框（x, y, w, h）
    results = []
    for i in indices:
        results.append([
            int(class_ids[i]),
            float(scores[i]),
            [float(x1[i]), float(y1[i]), float(x2[i] - x1[i]), float(y2[i] - y1[i])]
        ])

    return results  # 返回处理后的检测结果

7. 主函数（流程控制）主函数串联了整个推理流程：初始化引擎→加载图像→预处理→内存准备→推理→后处理→结果可视化。

def main():
    # 初始化登临GPU推理引擎和上下文
    engine, context = init_dlnne_engine(MODEL_PATH)
    num_bindings = engine.num_bindings  # 获取输入输出绑定数量

    # 加载原始图像
    orig_img = cv2.imread(IMAGE_PATH)
    if orig_img is None:  # 检查图像是否加载成功
        print(f"Error: 无法加载图像 {IMAGE_PATH}")
        return

    # 对原始图像进行预处理，得到模型输入格式
    input_img, orig_size, scale = preprocess(orig_img, INPUT_WIDTH)

    # 准备输入输出的GPU内存绑定
    bindings = []          # 存储输入输出的GPU内存缓冲区
    input_buffers = []     # 保存输入缓冲区引用（防止被垃圾回收）
    output_buffers = []    # 保存输出缓冲区引用

    for index in range(num_bindings):
        # 获取绑定的名称、形状和数据类型
        binding_name = engine.get_binding_name(index)
        binding_shape = engine.get_binding_shape(index)
        np_type = nne_util.nne_to_np_type(engine.get_binding_dtype(index))  # 转换为NumPy类型

        # 计算绑定所需的内存大小（元素数量 × 单个元素字节数）
        binding_size = np.prod(binding_shape) * np_type(1).nbytes

        if engine.binding_is_input(index):  # 处理输入绑定
            # 将预处理后的输入转换为模型要求的数据类型
            input_data = input_img.astype(np_type)
            # 将数据从CPU传输到GPU，并创建缓冲区
            input_buffer = cuda.to_device(input_data)
            bindings.append(input_buffer)
            input_buffers.append(input_buffer)
        else:  # 处理输出绑定
            # 为输出分配GPU内存
            output_buffer = cuda.mem_alloc(binding_size)
            bindings.append(output_buffer)
            output_buffers.append(output_buffer)

    # 执行推理并计算耗时
    start = time.time()
    context.execute(1, bindings)  # 1：批次大小；bindings：输入输出缓冲区
    inference_time = time.time() - start  # 推理耗时（秒）

    # 从GPU获取输出数据到CPU
    output_shape = engine.get_binding_shape(1)  # 获取输出形状（假设索引1为输出）
    output_data = np.empty(output_shape, dtype=np.float32)  # 创建空数组接收结果
    cuda.memcpy_dtoh(output_data, bindings[1])  # GPU→CPU数据传输

    # 对输出数据进行后处理，得到最终检测结果
    detections = postprocess(output_data, orig_size, scale, CONF_THRESH, IOU_THRESH)

    # 打印推理结果
    print(f"推理时间: {inference_time * 1000:.2f} ms")  # 转换为毫秒
    print(f"检测到 {len(detections)} 个对象:")

    # 在原始图像上绘制检测结果
    for class_id, score, box in detections:
        x, y, w, h = box  # 边界框坐标（x,y：左上角；w,h：宽高）
        label = f"{class_id}: {score:.2f}"  # 标签：类别ID+置信度
        # 绘制矩形框（绿色，线宽2）
        cv2.rectangle(orig_img, (int(x), int(y)), (int(x + w), int(y + h)), (0, 255, 0), 2)
        # 绘制标签文本（绿色，字体大小0.5，线宽2）
        cv2.putText(orig_img, label, (int(x), int(y) - 10),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

    # 保存绘制结果到本地
    output_path = "result.jpg"
    cv2.imwrite(output_path, orig_img)
    print(f"结果已保存至: {output_path}")


# 程序入口：当脚本直接运行时执行main函数
if __name__ == "__main__":
    main()

总的代码为（注意运行时要将相关环境启动，将登临sdk和相关python环境启动才行）

import cv2                  # OpenCV库，用于图像处理（读取、绘制、格式转换等）
import numpy as np          # 数值计算库，用于数组操作和数学运算
import dlnne as nne         # 登临科技GPU推理引擎库，用于模型加载和推理
import pycuda.driver as cuda # CUDA驱动库，用于GPU内存管理和数据传输
import time                 # 时间库，用于计算推理耗时
from utils import nne_util  # 工具函数库，提供dlnne相关辅助功能（如类型转换、权重配置）

# 配置参数
MODEL_PATH = "yolov8n.onnx"  # ONNX模型文件路径（YOLOv8n轻量版模型）
IMAGE_PATH = "test.jpg"      # 测试图像路径
CONF_THRESH = 0.5            # 置信度阈值（过滤低置信度检测结果）
IOU_THRESH = 0.45            # IOU阈值（非极大值抑制参数，过滤重叠框）
INPUT_WIDTH = 640            # YOLOv8模型输入图像宽度
INPUT_HEIGHT = 640           # YOLOv8模型输入图像高度


# 初始化登临科技GPU推理引擎  该函数是适配登临 GPU 的核心，负责初始化硬件、解析模型、构建推理引擎，为后续推理做准备。
def init_dlnne_engine(model_path):
    # 初始化CUDA驱动，确保GPU可用
    cuda.init()

    # 创建dlnne构建器(Builder)和解析器(Parser)
    with nne.Builder() as builder, nne.Parser() as parser:
        #创建网络对象（用于加载模型结构）
        network = builder.create_network()

        # 配置推理引擎参数
        builder.config.max_batch_size = 1  # 批处理大小为1（单张图像推理）
        # 获取权重共享配置（适配登临GPU硬件特性）
        weight_share = nne_util.get_weight_share('all')
        builder.config.ws_mode = weight_share['weight_mode']  # 设置权重共享模式

        # 解析ONNX模型到网络对象中
        if not parser.parse(model_path, network):
            raise RuntimeError("Failed to parse the model")  # 解析失败则抛出异常

        # 构建推理引擎（将模型编译为适配GPU的可执行格式）
        engine = builder.build_engine(network)

        # 创建执行上下文（用于实际推理计算）
        context = engine.create_execution_context(weight_share['cluster_cfg'])

        return engine, context  #返回引擎和上下文对象


# 图像预处理 - 适配YOLOv8  预处理是模型推理的关键步骤，确保输入图像格式与模型训练时一致。
def preprocess(image, target_size=640):
    # 将BGR格式（OpenCV默认）转换为RGB格式（YOLO模型要求）
    img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    h, w = img.shape[:2]  # 获取原始图像的高和宽

    # 计算缩放比例（保持纵横比，避免图像变形）
    scale = min(target_size / h, target_size / w)
    nh, nw = int(h * scale), int(w * scale)  # 计算缩放后的高和宽
    # 按新尺寸缩放图像
    resized = cv2.resize(img, (nw, nh), interpolation=cv2.INTER_LINEAR)

    # 创建固定尺寸（640x640）的画布，用灰色（114）填充（YOLO训练时的填充策略）
    canvas = np.full((target_size, target_size, 3), 114, dtype=np.uint8)
    canvas[:nh, :nw] = resized  # 将缩放后的图像放置在画布左上角

    # 归一化（像素值从0-255转为0-1）并调整数据格式
    input_img = canvas.astype(np.float32) / 255.0
    # 转换维度顺序：HWC（高度、宽度、通道）→ CHW（通道、高度、宽度），并增加批次维度（N=1）
    input_img = input_img.transpose(2, 0, 1)[None]  # 最终形状：(1, 3, 640, 640)

    return input_img, (w, h), scale  # 返回处理后的输入、原始尺寸、缩放比例


# 非极大值抑制  NMS 用于去除冗余的重叠检测框，只保留置信度最高且唯一的框。
def _nms(dets, scores, thresh):
    # 提取边界框的左上角和右下角坐标
    x1 = dets[:, 0]
    y1 = dets[:, 1]
    x2 = dets[:, 2]
    y2 = dets[:, 3]

    # 计算每个边界框的面积
    areas = (x2 - x1 + 1) * (y2 - y1 + 1)
    # 按置信度从高到低排序边界框索引
    order = scores.argsort()[::-1]

    keep = []  # 保存保留的边界框索引
    while order.size > 0:
        i = order[0]  # 取置信度最高的框
        keep.append(i)

        # 计算当前框与剩余框的交叠区域
        xx1 = np.maximum(x1[i], x1[order[1:]])
        yy1 = np.maximum(y1[i], y1[order[1:]])
        xx2 = np.minimum(x2[i], x2[order[1:]])
        yy2 = np.minimum(y2[i], y2[order[1:]])

        # 计算交叠区域的宽和高（确保非负）
        w = np.maximum(0.0, xx2 - xx1 + 1)
        h = np.maximum(0.0, yy2 - yy1 + 1)
        inter = w * h  # 交叠面积

        # 计算IOU（交并比）：交叠面积 / (当前框面积 + 其他框面积 - 交叠面积)
        ovr = inter / (areas[i] + areas[order[1:]] - inter)
        # 保留IOU小于阈值的框（过滤重叠度过高的框）
        inds = np.where(ovr <= thresh)[0]
        order = order[inds + 1]  # 更新排序索引

    return keep  # 返回保留的边界框索引


# 后处理 - 适配YOLOv8输出格式  后处理将模型输出的原始数据转换为可理解的检测结果（类别、置信度、边界框）。
def postprocess(outputs, orig_size, scale, conf_thresh=0.25, iou_thresh=0.45):
    # 处理YOLOv8输出格式：(1, 84, 8400) → 展平为(8400, 84)（8400个候选框，每个框84个参数）
    predictions = np.squeeze(outputs).T

    # 提取边界框（前4个参数：xywh格式）、置信度（取类别得分的最大值）和类别ID
    boxes = predictions[:, :4]  # 边界框坐标（中心x、中心y、宽、高）
    scores = np.max(predictions[:, 4:], axis=1)  # 置信度（每个框的最大类别得分）
    class_ids = np.argmax(predictions[:, 4:], axis=1)  # 类别ID（得分最高的类别）

    # 过滤低置信度的检测结果（只保留置信度>阈值的框）
    mask = scores > conf_thresh
    boxes = boxes[mask]
    scores = scores[mask]
    class_ids = class_ids[mask]

    if len(scores) == 0:  # 若没有符合条件的框，返回空列表
        return []

    # 将边界框从xywh（中心坐标+宽高）转换为xyxy（左上角+右下角）格式，并缩放回原始图像尺寸
    x, y, w, h = boxes.T
    x1 = (x - w / 2) / scale  # 左上角x
    y1 = (y - h / 2) / scale  # 左上角y
    x2 = (x + w / 2) / scale  # 右下角x
    y2 = (y + h / 2) / scale  # 右下角y

    # 裁剪边界框，确保不超出原始图像范围
    x1 = np.clip(x1, 0, orig_size[0])
    y1 = np.clip(y1, 0, orig_size[1])
    x2 = np.clip(x2, 0, orig_size[0])
    y2 = np.clip(y2, 0, orig_size[1])

    # 组合边界框坐标为NMS所需格式
    dets = np.column_stack((x1, y1, x2, y2))

    # 执行非极大值抑制，过滤重叠框
    indices = _nms(dets, scores, iou_thresh)

    # 整理最终结果：类别ID、置信度、边界框（x, y, w, h）
    results = []
    for i in indices:
        results.append([
            int(class_ids[i]),
            float(scores[i]),
            [float(x1[i]), float(y1[i]), float(x2[i] - x1[i]), float(y2[i] - y1[i])]
        ])

    return results  # 返回处理后的检测结果


# 主函数
def main():
    # 初始化登临GPU推理引擎和上下文
    engine, context = init_dlnne_engine(MODEL_PATH)
    num_bindings = engine.num_bindings  # 获取输入输出绑定数量

    # 加载原始图像
    orig_img = cv2.imread(IMAGE_PATH)
    if orig_img is None:  # 检查图像是否加载成功
        print(f"Error: 无法加载图像 {IMAGE_PATH}")
        return

    # 对原始图像进行预处理，得到模型输入格式
    input_img, orig_size, scale = preprocess(orig_img, INPUT_WIDTH)

    # 准备输入输出的GPU内存绑定
    bindings = []          # 存储输入输出的GPU内存缓冲区
    input_buffers = []     # 保存输入缓冲区引用（防止被垃圾回收）
    output_buffers = []    # 保存输出缓冲区引用

    for index in range(num_bindings):
        # 获取绑定的名称、形状和数据类型
        binding_name = engine.get_binding_name(index)
        binding_shape = engine.get_binding_shape(index)
        np_type = nne_util.nne_to_np_type(engine.get_binding_dtype(index))  # 转换为NumPy类型

        # 计算绑定所需的内存大小（元素数量 × 单个元素字节数）
        binding_size = np.prod(binding_shape) * np_type(1).nbytes

        if engine.binding_is_input(index):  # 处理输入绑定
            # 将预处理后的输入转换为模型要求的数据类型
            input_data = input_img.astype(np_type)
            # 将数据从CPU传输到GPU，并创建缓冲区
            input_buffer = cuda.to_device(input_data)
            bindings.append(input_buffer)
            input_buffers.append(input_buffer)
        else:  # 处理输出绑定
            # 为输出分配GPU内存
            output_buffer = cuda.mem_alloc(binding_size)
            bindings.append(output_buffer)
            output_buffers.append(output_buffer)

    # 执行推理并计算耗时
    start = time.time()
    context.execute(1, bindings)  # 1：批次大小；bindings：输入输出缓冲区
    inference_time = time.time() - start  # 推理耗时（秒）

    # 从GPU获取输出数据到CPU
    output_shape = engine.get_binding_shape(1)  # 获取输出形状（假设索引1为输出）
    output_data = np.empty(output_shape, dtype=np.float32)  # 创建空数组接收结果
    cuda.memcpy_dtoh(output_data, bindings[1])  # GPU→CPU数据传输

    # 对输出数据进行后处理，得到最终检测结果
    detections = postprocess(output_data, orig_size, scale, CONF_THRESH, IOU_THRESH)

    # 打印推理结果
    print(f"推理时间: {inference_time * 1000:.2f} ms")  # 转换为毫秒
    print(f"检测到 {len(detections)} 个对象:")

    # 在原始图像上绘制检测结果
    for class_id, score, box in detections:
        x, y, w, h = box  # 边界框坐标（x,y：左上角；w,h：宽高）
        label = f"{class_id}: {score:.2f}"  # 标签：类别ID+置信度
        # 绘制矩形框（绿色，线宽2）
        cv2.rectangle(orig_img, (int(x), int(y)), (int(x + w), int(y + h)), (0, 255, 0), 2)
        # 绘制标签文本（绿色，字体大小0.5，线宽2）
        cv2.putText(orig_img, label, (int(x), int(y) - 10),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

    # 保存绘制结果到本地
    output_path = "result.jpg"
    cv2.imwrite(output_path, orig_img)
    print(f"结果已保存至: {output_path}")


# 程序入口：当脚本直接运行时执行main函数
if __name__ == "__main__":
    main()

拓展

以上代码输出结果为标签，以下为更改后的输出标签名和处理图片集合，自己测试时根据自身路径进行更改

# 导入必要的库
import cv2               # OpenCV库，用于图像处理
import numpy as np       # NumPy库，用于数值计算
import dlnne as nne      # 登临科技深度学习推理引擎
import pycuda.driver as cuda  # CUDA驱动，用于GPU内存操作
import time              # 时间模块，用于性能计时
import os                # 操作系统模块，用于文件操作
import glob              # 文件路径匹配模块
from utils import nne_util  # 自定义工具函数，用于模型处理

# COCO数据集类别名称（YOLOv8默认训练的80个类别）
COCO_CLASSES = [
    "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
    "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat",
    "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack",
    "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball",
    "kite", "baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket",
    "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple",
    "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair",
    "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote",
    "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
    "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
]

# 配置参数
MODEL_PATH = "yolov8n.onnx"  # ONNX模型文件路径（YOLOv8 nano模型）
TEST_IMAGE_DIR = "data/images/testimg/"  # 测试图片目录
CONF_THRESH = 0.5  # 置信度阈值（低于此值的检测结果将被过滤）
IOU_THRESH = 0.45  # IOU阈值（非极大值抑制参数）
INPUT_WIDTH = 640  # 模型输入图像宽度
INPUT_HEIGHT = 640  # 模型输入图像高度
OUTPUT_DIR = "results/"  # 输出结果目录
WARMUP_ITERATIONS = 5  # 预热迭代次数（用于GPU初始化后的性能稳定）


# 初始化登临科技GPU推理引擎
def init_dlnne_engine(model_path):
    cuda.init()  # 初始化CUDA驱动
    # 创建模型构建器和解析器
    with nne.Builder() as builder, nne.Parser() as parser:
        network = builder.create_network()  # 创建网络
        builder.config.max_batch_size = 1   # 设置最大批处理大小为1
        # 获取权重共享配置
        weight_share = nne_util.get_weight_share('all')
        builder.config.ws_mode = weight_share['weight_mode']  # 设置权重共享模式

        # 解析ONNX模型
        if not parser.parse(model_path, network):
            raise RuntimeError("Failed to parse the model")  # 解析失败则抛出异常

        engine = builder.build_engine(network)  # 构建推理引擎
        # 创建执行上下文（使用权重共享配置）
        context = engine.create_execution_context(weight_share['cluster_cfg'])
        return engine, context  # 返回引擎和上下文


# 图像预处理（适配模型输入要求）
def preprocess(image, target_size=640):
    img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)  # BGR转RGB（OpenCV默认BGR格式）
    h, w = img.shape[:2]  # 获取图像原始尺寸

    # 计算缩放比例（保持宽高比）
    scale = min(target_size / h, target_size / w)
    nh, nw = int(h * scale), int(w * scale)  # 计算缩放后的尺寸
    # 缩放图像
    resized = cv2.resize(img, (nw, nh), interpolation=cv2.INTER_LINEAR)

    # 创建画布（填充114作为背景，YOLO系列常用做法）
    canvas = np.full((target_size, target_size, 3), 114, dtype=np.uint8)
    canvas[:nh, :nw] = resized  # 将缩放后的图像放入画布

    # 归一化并调整维度（适配模型输入格式：[batch, channel, height, width]）
    input_img = canvas.astype(np.float32) / 255.0
    input_img = input_img.transpose(2, 0, 1)[None]

    return input_img, (w, h), scale  # 返回处理后的图像、原始尺寸和缩放比例


# 非极大值抑制（NMS）：去除重叠度高的检测框
def _nms(dets, scores, thresh):
    x1 = dets[:, 0]  # 检测框左上角x坐标
    y1 = dets[:, 1]  # 检测框左上角y坐标
    x2 = dets[:, 2]  # 检测框右下角x坐标
    y2 = dets[:, 3]  # 检测框右下角y坐标

    areas = (x2 - x1 + 1) * (y2 - y1 + 1)  # 计算每个检测框的面积
    order = scores.argsort()[::-1]  # 按置信度降序排序

    keep = []  # 保存保留的检测框索引
    while order.size > 0:
        i = order[0]  # 取置信度最高的检测框
        keep.append(i)

        # 计算与其他检测框的交叠区域
        xx1 = np.maximum(x1[i], x1[order[1:]])
        yy1 = np.maximum(y1[i], y1[order[1:]])
        xx2 = np.minimum(x2[i], x2[order[1:]])
        yy2 = np.minimum(y2[i], y2[order[1:]])

        # 计算交叠区域的宽和高
        w = np.maximum(0.0, xx2 - xx1 + 1)
        h = np.maximum(0.0, yy2 - yy1 + 1)
        inter = w * h  # 交叠面积

        # 计算IOU（交并比）
        ovr = inter / (areas[i] + areas[order[1:]] - inter)
        # 保留IOU小于阈值的检测框
        inds = np.where(ovr <= thresh)[0]
        order = order[inds + 1]  # 更新排序

    return keep  # 返回保留的检测框索引


# 后处理：解析模型输出，转换为实际坐标并过滤低置信度结果
def postprocess(outputs, orig_size, scale, conf_thresh=0.25, iou_thresh=0.45):
    predictions = np.squeeze(outputs).T  # 调整输出维度

    boxes = predictions[:, :4]  # 检测框坐标（中心x, y, 宽w, 高h）
    scores = np.max(predictions[:, 4:], axis=1)  # 计算每个框的最大置信度
    class_ids = np.argmax(predictions[:, 4:], axis=1)  # 获取每个框的类别ID

    # 过滤低置信度检测框
    mask = scores > conf_thresh
    boxes = boxes[mask]
    scores = scores[mask]
    class_ids = class_ids[mask]

    if len(scores) == 0:  # 如果没有符合条件的检测框，返回空列表
        return []

    # 将中心坐标转换为左上角和右下角坐标
    x, y, w, h = boxes.T
    x1 = (x - w / 2) / scale  # 转换回原始图像尺度
    y1 = (y - h / 2) / scale
    x2 = (x + w / 2) / scale
    y2 = (y + h / 2) / scale

    # 确保坐标在图像范围内
    x1 = np.clip(x1, 0, orig_size[0])
    y1 = np.clip(y1, 0, orig_size[1])
    x2 = np.clip(x2, 0, orig_size[0])
    y2 = np.clip(y2, 0, orig_size[1])

    # 应用非极大值抑制
    dets = np.column_stack((x1, y1, x2, y2))
    indices = _nms(dets, scores, iou_thresh)

    # 整理结果：类别ID、置信度、边界框（x, y, w, h）
    results = []
    for i in indices:
        results.append([
            int(class_ids[i]),
            float(scores[i]),
            [float(x1[i]), float(y1[i]), float(x2[i] - x1[i]), float(y2[i] - y1[i])]
        ])

    return results


# 处理单张图片：完整的推理流程
def process_image(engine, context, bindings, input_buffer, output_buffer, image_path, output_dir):
    # 加载原始图像
    orig_img = cv2.imread(image_path)
    if orig_img is None:
        print(f"Error: 无法加载图像 {image_path}")
        return None, 0

    # 预处理图像
    input_img, orig_size, scale = preprocess(orig_img, INPUT_WIDTH)

    # 将输入数据复制到GPU内存
    input_data = input_img.astype(np.float32)
    cuda.memcpy_htod(input_buffer, input_data)  # 主机到设备（CPU->GPU）

    # 执行推理并计时
    start_time = time.time()
    context.execute(1, bindings)  # 执行推理（批大小为1）
    inference_time = time.time() - start_time

    # 从GPU内存获取输出数据
    output_shape = engine.get_binding_shape(1)  # 获取输出形状
    output_data = np.empty(output_shape, dtype=np.float32)
    cuda.memcpy_dtoh(output_data, output_buffer)  # 设备到主机（GPU->CPU）

    # 后处理获取检测结果
    detections = postprocess(output_data, orig_size, scale, CONF_THRESH, IOU_THRESH)

    # 保存结果（如果指定了输出目录）
    if output_dir:
        os.makedirs(output_dir, exist_ok=True)  # 确保输出目录存在
        # 绘制检测框和标签
        for class_id, score, box in detections:
            x, y, w, h = box
            # 从类别ID获取对应的物体名称（使用COCO类别表）
            class_name = COCO_CLASSES[class_id] if 0 <= class_id < len(COCO_CLASSES) else f"class_{class_id}"
            label = f"{class_name}: {score:.2f}"  # 显示物体名称而非数字ID
            # 绘制矩形框
            cv2.rectangle(orig_img, (int(x), int(y)), (int(x + w), int(y + h)), (0, 255, 0), 2)
            # 绘制标签
            cv2.putText(orig_img, label, (int(x), int(y) - 10),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

        # 保存结果图像
        output_path = os.path.join(output_dir, os.path.basename(image_path))
        cv2.imwrite(output_path, orig_img)

    return inference_time, len(detections)  # 返回推理时间和检测到的目标数量


# 主函数：程序入口
def main():
    # 初始化登临GPU推理引擎
    print("正在初始化登临GPU推理引擎...")
    start_init = time.time()
    engine, context = init_dlnne_engine(MODEL_PATH)
    init_time = time.time() - start_init
    print(f"引擎初始化完成，耗时: {init_time:.2f} 秒")

    # 准备输入输出绑定
    num_bindings = engine.num_bindings  # 获取绑定数量
    bindings = []  # 绑定列表
    input_buffers = []  # 输入缓冲区列表
    output_buffers = []  # 输出缓冲区列表

    # 获取所有测试图片路径（支持jpg、png、jpeg格式）
    image_paths = glob.glob(os.path.join(TEST_IMAGE_DIR, "*.jpg")) + \
                  glob.glob(os.path.join(TEST_IMAGE_DIR, "*.png")) + \
                  glob.glob(os.path.join(TEST_IMAGE_DIR, "*.jpeg"))

    if not image_paths:  # 如果没有找到图片
        print(f"在目录 {TEST_IMAGE_DIR} 中未找到图片文件")
        return

    print(f"找到 {len(image_paths)} 张测试图片")

    # 创建输入输出缓冲区绑定
    for index in range(num_bindings):
        binding_shape = engine.get_binding_shape(index)  # 获取绑定形状
        # 获取绑定数据类型（转换为NumPy类型）
        np_type = nne_util.nne_to_np_type(engine.get_binding_dtype(index))
        # 计算绑定内存大小
        binding_size = np.prod(binding_shape) * np_type(1).nbytes

        if engine.binding_is_input(index):  # 如果是输入绑定
            input_buffer = cuda.mem_alloc(binding_size)  # 分配输入缓冲区
            bindings.append(input_buffer)
            input_buffers.append(input_buffer)
        else:  # 如果是输出绑定
            output_buffer = cuda.mem_alloc(binding_size)  # 分配输出缓冲区
            bindings.append(output_buffer)
            output_buffers.append(output_buffer)

    # 预热GPU（确保性能测试准确）
    print("正在进行GPU预热...")
    for _ in range(WARMUP_ITERATIONS):
        process_image(engine, context, bindings, input_buffers[0], output_buffers[0],
                      image_paths[0], None)  # 不保存预热结果
    print("预热完成")

    # 处理所有图片并统计性能
    total_inference_time = 0  # 总推理时间
    total_detections = 0      # 总检测目标数
    processed_images = 0      # 已处理图片数

    print(f"\n开始处理图片 (共 {len(image_paths)} 张)...")
    start_total = time.time()  # 记录总开始时间

    # 遍历所有图片并处理
    for i, image_path in enumerate(image_paths):
        # 处理图片
        inference_time, detections = process_image(
            engine, context, bindings, input_buffers[0], output_buffers[0],
            image_path, OUTPUT_DIR
        )

        # 更新统计信息
        total_inference_time += inference_time
        total_detections += detections
        processed_images += 1

        # 显示进度（每10张或最后一张）
        if (i + 1) % 10 == 0 or (i + 1) == len(image_paths):
            print(f"已处理 {i + 1}/{len(image_paths)} 张图片", end='\r')

    total_time = time.time() - start_total  # 计算总耗时

    # 计算性能指标
    fps = processed_images / total_inference_time  # 每秒处理图片数
    avg_inference_time = total_inference_time / processed_images * 1000  # 平均推理时间（毫秒）
    avg_detections = total_detections / processed_images  # 平均每张图片检测目标数

    # 输出性能报告
    print("\n\n===== 性能报告 =====")
    print(f"处理图片总数: {processed_images} 张")
    print(f"总处理时间: {total_time:.2f} 秒")
    print(f"总推理时间: {total_inference_time:.2f} 秒")
    print(f"平均每张图片推理时间: {avg_inference_time:.2f} 毫秒")
    print(f"登临GPU处理速度: {fps:.2f} FPS (每秒处理图片数)")
    print(f"平均每张图片检测目标数: {avg_detections:.2f}")
    print(f"结果已保存至: {OUTPUT_DIR}")


# 程序入口点
if __name__ == "__main__":
    main()

总结
该代码实现了基于登临科技 GPU 的 YOLOv8 目标检测流程，核心逻辑包括：
利用dlnne库初始化 GPU 推理引擎，适配硬件特性；
对输入图像进行预处理，符合模型输入要求；
通过pycuda管理 GPU 内存，实现数据传输和推理执行；
对模型输出进行后处理（过滤、NMS），得到最终检测结果；

魔乐社区

魔乐社区（Modelers.cn) 是一个中立、公益的人工智能社区，提供人工智能工具、模型、数据的托管、展示与应用协同服务，为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作，由全产业链共同建设、共同运营、共同享有，推动国产AI生态繁荣发展。

更多推荐

全家桶集齐！Qwen3.5四款小模型上线魔乐社区，附昇腾全套实践教程

魔乐社区

Pont - 搭建前后端之桥：高效、灵活的接口管理工具

Pont 是一款强大的数据服务层解决方案，它能够帮助开发者快速搭建前后端之间的桥梁，实现接口的高效管理和代码自动生成。无论是新手还是有经验的开发者，都能通过 Pont 轻松处理接口文档、生成类型安全的 API 代码，从而显著提升开发效率。[![Pont 工具标志](https://raw.gitcode.com/gh_mirrors/po/pont/raw/3f1b7d4bbba3fd2dda

魔乐社区

如何快速上手 hvac：HashiCorp Vault Python 客户端零基础入门指南

**hvac** 是 HashiCorp Vault 的 Python 3.X 客户端库，专为开发者提供简单高效的 Vault 交互方式。无论你是需要管理密钥、配置身份验证，还是实现安全的秘密数据存储，hvac 都能帮助你轻松搞定 Vault 的各项操作。本文将带你零基础快速入门，从安装到基础操作，让你在几分钟内即可上手使用这个强大的工具。[![hvac 客户端 Logo](https://r