从UNet训练到TensorRT部署：完整语义分割模型部署优化指南(windows版本)

本文详细介绍了从UNet模型训练到TensorRT部署的完整流程，涵盖了数据准备、模型训练、ONNX导出、TensorRT优化以及Python/C++推理实现等关键环节。通过TensorRT部署，我们能够显著提升模型的推理速度，满足实际生产环境对实时性的要求

m0_37755875

2067人浏览 · 2025-04-25 10:38:56

m0_37755875 · 2025-04-25 10:38:56 发布

1. 引言

语义分割是计算机视觉领域的重要任务，它要求对图像中的每个像素进行分类。UNet作为一种经典的语义分割网络架构，在医学影像、自动驾驶、遥感图像分析等领域有着广泛应用。然而，在实际生产环境中，我们往往需要将训练好的模型部署到边缘设备或服务器上，这就涉及到模型优化和加速的问题。

TensorRT是NVIDIA推出的高性能深度学习推理库，能够显著提升模型在NVIDIA GPU上的推理速度。本文将详细介绍从UNet模型训练到TensorRT部署的完整流程，包括环境配置、数据准备、模型训练、ONNX导出、TensorRT转换以及C++/Python推理实现。

2. 环境准备

2.1 基础环境安装

前置环境：vs2022社区版，可以自行安装

在开始之前，我们需要配置好基础环境：

2.1.1 推荐版本组合

(可以先用deepseek搜，不然中途容易出错不兼容)

组件	推荐版本	关键说明
python	3.10	兼容tensorRT，vit等，更新的版本目前tensoRT不兼容
CUDA	11.8 (主版本必须匹配)	TensorRT 8.6.0.12 官方明确支持 CUDA 11.8
TensorRT	8.6.0.12
cuDNN	8.9.2.26	TensorRT 8.6.x 依赖 cuDNN 8.9.x
PyTorch	2.0.1+cu118	PyTorch 官方为 CUDA 11.8 提供预编译包（从 2.0 开始支持）
ONNX	1.14.0	兼容 PyTorch 2.0+ 和 TensorRT 8.6
ONNX Runtime	1.15.1	必须匹配 ONNX 版本

2.1.2 CUDA和cuDNN安装：

CUDA安装及环境配置——最新详细版-CSDN博客

2.1.3 pytorch安装

pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 \
  -i https://mirrors.aliyun.com/pypi/simple/ \
  --trusted-host mirrors.aliyun.com

或者（更推荐）

pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 \
  -f https://mirrors.aliyun.com/pytorch-wheels/torch_stable.html

验证安装

import torch
print(torch.__version__)          # 应输出 2.0.1+cu118
print(torch.cuda.is_available())  # 应输出 True
print(torch.cuda.get_device_name(0))  # 显示GPU型号

2.1.4 ONNX 和 ONNX Runtime安装

pip install onnx==1.14.0 onnxruntime-gpu==1.15.1 -i https://mirrors.aliyun.com/pypi/simple/

2.1.5 TensorRT 8.6.0.12 安装

详解Windows系统安装TensorRT_windows安装tensorrt-CSDN博客

注意最后要在python环境中安装whl来导入tensorRT,保证tensoRT的C++与Python环境一致

2.1.6 opencv安装

c++版本

https://blog.csdn.net/qq_41277822/article/details/104018866

https://blog.csdn.net/qq_27278957/article/details/108224325

python版本

pip install opencv-python==4.5.5.64  -i https://mirrors.aliyun.com/pypi/simple/

2.2 python其他依赖库安装

pip install albumentations pandas scikit-learn labelme -i https://mirrors.aliyun.com/pypi/simple/
pip install segmentation-models-pytorch -i https://mirrors.aliyun.com/pypi/simple/
pip install tensorboardx timm tqdm wandb -i https://mirrors.aliyun.com/pypi/simple/

3. 数据准备与标注

3.1 标注流程：

打开图像文件
使用多边形工具绘制目标区域
为每个区域指定类别标签
保存为JSON格式

{
  "version": "5.4.1",
  "flags": {},
  "shapes": [
    {
      "label": "road",
      "points": [
        [
          11.602409638554185,
          770.3949016703393
        ],
        [
          488.9156626506025,
          619.5180722891566
        ],
        [
          1858.7951807228915,
          652.0481927710844
        ],
        [
          1891.1204819277104,
          799.3105643209418
        ],
        [
          1919.0,
          1079.0
        ],
        [
          0.9999999999998863,
          1080.0
        ]
      ],
      "group_id": null,
      "description": null,
      "shape_type": "polygon",
      "flags": {},
      "mask": null
    }
  ],
  "imagePath": "0_0.jpg",
  "imageHeight": 1080,
  "imageWidth": 1920
}

3.2 JSON转Mask转换

LabelMe生成的JSON标注需要转换为二值Mask图像以供训练使用：



import json
import json
from PIL import Image, ImageDraw
import glob
import os

def points_to_mask(json_data, image_path):
    # 解析JSON数据
    data = json.loads(json_data)
    height = data["imageHeight"]
    width = data["imageWidth"]

    # 创建一个新的和原图同样大小的白色背景图像，用于存储mask
    mask = Image.new('1', (width, height), color=0)  # '1'表示1位像素（黑白）
    draw = ImageDraw.Draw(mask)

    # 遍历 shapes 中的多边形
    for shape in data["shapes"]:
        if shape["shape_type"] == "polygon":
            points = [(p[0], p[1]) for p in shape["points"]]
            draw.polygon(points, outline=1, fill=1)  # 填充前景为白色（值为1）

    # 将图像保存为二进制掩模文件（例如：a_mask.png）
    mask.save(f"{image_path.split('.')[0]}_mask.png", "PNG")


if __name__ == '__main__':
    directory = r"C:\Users\lanqi\Desktop\aiDebug\segment-anything-main\dataset\images"
    for root, dirs, files in os.walk(directory):
        # 仅处理.json文件
        for file in files:
            if file.endswith(".json"):
                # 构建完整文件路径
                file_path = os.path.join(root, file)

                # 打开并读取json文件
                with open(file_path, 'r', encoding='utf-8') as f:
                    data = json.load(f)

                height = data["imageHeight"]
                width = data["imageWidth"]

                # 创建一个新的和原图同样大小的白色背景图像，用于存储mask
                mask = Image.new('1', (width, height), color=0)  # '1'表示1位像素（黑白）
                draw = ImageDraw.Draw(mask)

                polyshape = data["shapes"]
                image_path = os.path.join(root, data["imagePath"])
                # 遍历 shapes 中的多边形

                for shape in data["shapes"]:
                    if shape["shape_type"] == "polygon":
                        points = [(p[0], p[1]) for p in shape["points"]]
                        draw.polygon(points, outline=1, fill=1)  # 填充前景为白色（值为1）

                # 将图像保存为二进制掩模文件（例如：a_mask.png）
                mask.save(f"{image_path.split('.')[0]}_mask.png", "PNG")

                # 处理数据（这里只是一个打印示例）
                print(f"Reading JSON file: {file_path}")
                # print(json.dumps(data, indent=2))

                # 在此处进行你对JSON数据的进一步处理或操作

4. UNet模型训练

4.1 模型架构定义

我们使用segmentation_models_pytorch库快速构建UNet模型：

model.py

import segmentation_models_pytorch as smp
from CFG import *
import torch

def build_model():
    model = smp.Unet(
        encoder_name=CFG.backbone,      # choose encoder, e.g. mobilenet_v2 or efficientnet-b7
        encoder_weights="imagenet",     # use `imagenet` pre-trained weights for encoder initialization
        in_channels=3,                  # model input channels (1 for gray-scale images, 3 for RGB, etc.)
        classes=CFG.num_classes,        # model output channels (number of classes in your dataset)
        activation=None,
    )
    model.to(CFG.device)
    return model

def load_model(path):
    model = build_model()
    model.load_state_dict(torch.load(path))
    model.eval()
    return model

4.2 训练配置

CFG.py

import torch
class CFG:
    seed          = 101
    debug         = False # set debug=False for Full Training
    exp_name      = '2.5D'
    comment       = 'unet-efficientnet_b0-160x192-ep=15'
    model_name    = 'Unet'
    backbone      = 'efficientnet-b0'
    train_bs      = 2
    valid_bs      = train_bs
    img_size      = [1088, 1920]

    epochs        = 15
    lr            = 2e-3
    scheduler     = 'CosineAnnealingLR'
    min_lr        = 1e-6
    T_max         = int(30000/train_bs*epochs)+50
    T_0           = 15
    warmup_epochs = 0
    wd            = 1e-6
    n_accumulate  = max(1, 32//train_bs)
    n_fold        = 5
    folds         = [0]
    num_classes   = 3
    device        = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

4.3 数据加载及训练和预测代码


from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np
import pandas as pd
pd.options.plotting.backend = "plotly"
import random
from glob import glob
import os, shutil
from tqdm import tqdm
tqdm.pandas()
import time
import copy
import joblib
from collections import defaultdict
import gc
# from IPython import display as ipd
# visualization
import cv2
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from CFG  import *
# Sklearn

# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
from torch.utils.data import Dataset, DataLoader
from torch.cuda import amp

# Albumentations for augmentations
import albumentations as A
from albumentations.pytorch import ToTensorV2

import rasterio
from joblib import Parallel, delayed
# For colored terminal text
from colorama import Fore, Back, Style
c_  = Fore.GREEN
sr_ = Style.RESET_ALL

import warnings
from loss import *
warnings.filterwarnings("ignore")

data_transforms = {
    "train": A.Compose([
        #         A.Resize(*CFG.img_size, interpolation=cv2.INTER_NEAREST),
        A.OneOf([
            A.HorizontalFlip(p=1),
            A.VerticalFlip(p=1),
        ], p=1),

        # A.RandomCrop(width=256, height=256),
        # A.Affine(scale=(0.9,1.1), rotate = (-15,15), shear = (-7,7), p = 0.5),

        A.RandomBrightnessContrast(brightness_limit=0.001, contrast_limit=0.001, p=0.5),

        A.GaussNoise(var_limit=0.00002, p=0.5),
        A.ShiftScaleRotate(shift_limit=0.0625, scale_limit=0.05, rotate_limit=10, p=0.5),
        A.OneOf([
            A.GridDistortion(num_steps=5, distort_limit=0.05, p=1.0),
            # #             A.OpticalDistortion(distort_limit=0.05, shift_limit=0.05, p=1.0),
            A.ElasticTransform(alpha=1, sigma=50, alpha_affine=50, p=1.0)
        ], p=0.25),

        A.CoarseDropout(max_holes=8, max_height=CFG.img_size[0] // 20, max_width=CFG.img_size[1] // 20,
                        min_holes=5, fill_value=0, mask_fill_value=0, p=0.5),

        A.CenterCrop(width=1920, height=992),
    ], p=1.0),

    "valid": A.Compose([
        #         A.Resize(*CFG.img_size, interpolation=cv2.INTER_NEAREST),
        A.CenterCrop(width=1920, height=992),
    ], p=1.0)
}

def set_seed(seed=42):
    '''Sets the seed of the entire notebook so results are the same every time we run.
    This is for REPRODUCIBILITY.'''
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # Set a fixed value for the hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)
    print('> SEEDING DONE')


# imgage
def load_img(path):
    img = cv2.imread(path)
    img = img.astype('float32')  # original is uint16
    if img is None:
        print(path)
    mx = np.max(img)
    if mx:
        img /= mx  # scale image to [0, 1]
    return img


def load_msk(path):
    msk = cv2.imread(path)
    msk = msk[:,:,0]
    if(msk is None):
        print(path)
    msk = msk.astype('float32')

    msk /= 255.0
    return msk


def show_img(img, mask=None):
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    #     img = clahe.apply(img)
    #     plt.figure(figsize=(10,10))
    plt.imshow(img, cmap='bone')

    if mask is not None:
        # plt.imshow(np.ma.masked_where(mask!=1, mask), alpha=0.5, cmap='autumn')
        plt.imshow(mask, alpha=0.5)
        handles = [Rectangle((0, 0), 1, 1, color=_c) for _c in
                   [(0.667, 0.0, 0.0), (0.0, 0.667, 0.0), (0.0, 0.0, 0.667)]]
        labels = ["Large Bowel", "Small Bowel", "Stomach"]
        plt.legend(handles, labels)
    plt.axis('off')


# rle
# ref: https://www.kaggle.com/paulorzp/run-length-encode-and-decode
def rle_decode(mask_rle, shape):
    '''
    mask_rle: run-length as string formated (start length)
    shape: (height,width) of array to return
    Returns numpy array, 1 - mask, 0 - background

    '''
    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0] * shape[1], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape)  # Needed to align to RLE direction


# dataset


class BuildDataset(torch.utils.data.Dataset):
    def __init__(self,df, label=True, transforms=None):
        self.label = label
        self.img_paths = df["imgpath"]
        self.mask_paths = df["maskpath"]
        # for path in self.img_paths :
        #     new_path = path.replace(".jpg", "_mask.png")
        #     self.mask_paths.append(new_path)

        self.transforms = transforms

    def __len__(self):
        return len(self.img_paths)

    def __getitem__(self, index):
        img_path = self.img_paths[index]

        img = load_img(img_path)

        if self.label:
            msk_path = self.mask_paths[index]
            msk = load_msk(msk_path)
            if self.transforms:
                data = self.transforms(image=img, mask=msk)
                img = data['image']
                msk = data['mask']
            img = np.transpose(img, (2, 0, 1))
            msk = np.transpose(msk, (2, 0, 1))
            return torch.tensor(img), torch.tensor(msk)
        else:
            if self.transforms:
                data = self.transforms(image=img)
                img = data['image']
            img = np.transpose(img, (2, 0, 1))
            return torch.tensor(img)


# train function
def train_one_epoch(model, optimizer, scheduler, dataloader, device, epoch):
    model.train()
    scaler = amp.GradScaler()

    dataset_size = 0
    running_loss = 0.0

    pbar = tqdm(enumerate(dataloader), total=len(dataloader), desc='Train ')
    for step, (images, masks) in pbar:
        images = images.to(device, dtype=torch.float)
        masks = masks.to(device, dtype=torch.float)

        batch_size = images.size(0)

        with amp.autocast(enabled=True):
            y_pred = model(images)
            loss = criterion(y_pred, masks)
            loss = loss / CFG.n_accumulate

        scaler.scale(loss).backward()

        if (step + 1) % CFG.n_accumulate == 0:
            scaler.step(optimizer)
            scaler.update()

            # zero the parameter gradients
            optimizer.zero_grad()

            if scheduler is not None:
                scheduler.step()

        running_loss += (loss.item() * batch_size)
        dataset_size += batch_size

        epoch_loss = running_loss / dataset_size

        mem = torch.cuda.memory_reserved() / 1E9 if torch.cuda.is_available() else 0
        current_lr = optimizer.param_groups[0]['lr']
        pbar.set_postfix(train_loss=f'{epoch_loss:0.4f}',
                         lr=f'{current_lr:0.5f}',
                         gpu_mem=f'{mem:0.2f} GB')
        torch.cuda.empty_cache()
        gc.collect()

    return epoch_loss


# valid

@torch.no_grad()
def valid_one_epoch(model, dataloader, device, optimizer,epoch):
    model.eval()

    dataset_size = 0
    running_loss = 0.0

    val_scores = []

    pbar = tqdm(enumerate(dataloader), total=len(dataloader), desc='Valid ')
    for step, (images, masks) in pbar:
        images = images.to(device, dtype=torch.float)
        masks = masks.to(device, dtype=torch.float)

        batch_size = images.size(0)

        y_pred = model(images)
        loss = criterion(y_pred, masks)

        running_loss += (loss.item() * batch_size)
        dataset_size += batch_size

        epoch_loss = running_loss / dataset_size

        y_pred = nn.Sigmoid()(y_pred)
        val_dice = dice_coef(masks, y_pred).cpu().detach().numpy()
        val_jaccard = iou_coef(masks, y_pred).cpu().detach().numpy()
        val_scores.append([val_dice, val_jaccard])

        mem = torch.cuda.memory_reserved() / 1E9 if torch.cuda.is_available() else 0
        current_lr = optimizer.param_groups[0]['lr']
        pbar.set_postfix(valid_loss=f'{epoch_loss:0.4f}',
                         lr=f'{current_lr:0.5f}',
                         gpu_memory=f'{mem:0.2f} GB')
    val_scores = np.mean(val_scores, axis=0)
    torch.cuda.empty_cache()
    gc.collect()

    return epoch_loss, val_scores


def run_training(model, optimizer, scheduler, device, num_epochs,train_loader,valid_loader):
    # To automatically log gradients


    if torch.cuda.is_available():
        print("cuda: {}\n".format(torch.cuda.get_device_name()))

    start = time.time()
    best_model_wts = copy.deepcopy(model.state_dict())
    best_dice = -np.inf
    best_epoch = -1
    history = defaultdict(list)

    for epoch in range(1, num_epochs + 1):
        gc.collect()
        print(f'Epoch {epoch}/{num_epochs}', end='')
        train_loss = train_one_epoch(model, optimizer, scheduler,
                                     dataloader=train_loader,
                                     device=CFG.device, epoch=epoch)

        val_loss, val_scores = valid_one_epoch(model, valid_loader,
                                               device=CFG.device,
                                               optimizer=optimizer,
                                               epoch=epoch)
        val_dice, val_jaccard = val_scores

        history['Train Loss'].append(train_loss)
        history['Valid Loss'].append(val_loss)
        history['Valid Dice'].append(val_dice)
        history['Valid Jaccard'].append(val_jaccard)

        # Log the metrics

        print(f'Valid Dice: {val_dice:0.4f} | Valid Jaccard: {val_jaccard:0.4f}')

        # deep copy the model
        if val_dice >= best_dice:
            print(f"{c_}Valid Score Improved ({best_dice:0.4f} ---> {val_dice:0.4f})")
            best_dice = val_dice
            best_jaccard = val_jaccard
            best_epoch = epoch

            best_model_wts = copy.deepcopy(model.state_dict())
            PATH = f"best_epoch-{0:02d}.bin"
            torch.save(model.state_dict(), PATH)
            # Save a model file from the current directory
            print(f"Model Saved{sr_}")

        last_model_wts = copy.deepcopy(model.state_dict())
        PATH = f"last_epoch-{0:02d}.bin"
        torch.save(model.state_dict(), PATH)

    end = time.time()
    time_elapsed = end - start
    print('Training complete in {:.0f}h {:.0f}m {:.0f}s'.format(
        time_elapsed // 3600, (time_elapsed % 3600) // 60, (time_elapsed % 3600) % 60))
    print("Best Score: {:.4f}".format(best_jaccard))

    # load best model weights
    model.load_state_dict(best_model_wts)

    return model, history


def fetch_scheduler(optimizer):
    if CFG.scheduler == 'CosineAnnealingLR':
        scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=CFG.T_max,
                                                   eta_min=CFG.min_lr)
    elif CFG.scheduler == 'CosineAnnealingWarmRestarts':
        scheduler = lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=CFG.T_0,
                                                             eta_min=CFG.min_lr)
    elif CFG.scheduler == 'ReduceLROnPlateau':
        scheduler = lr_scheduler.ReduceLROnPlateau(optimizer,
                                                   mode='min',
                                                   factor=0.1,
                                                   patience=7,
                                                   threshold=0.0001,
                                                   min_lr=CFG.min_lr, )
    elif CFG.scheduer == 'ExponentialLR':
        scheduler = lr_scheduler.ExponentialLR(optimizer, gamma=0.85)
    elif CFG.scheduler == None:
        return None

    return scheduler

def get_jpg_files(rootdir, format='.jpg'):
    jpg_files = []
    for root, dirs, files in os.walk(rootdir):
        for file in files:
            if file.endswith(format):
                jpg_files.append(os.path.join(root, file))

    return jpg_files
def prepare_loaders(train_df,valid_df, debug=False):

    train_dataset = BuildDataset(train_df, transforms=data_transforms['train'])
    valid_dataset = BuildDataset(valid_df, transforms=data_transforms['valid'])

    train_loader = DataLoader(train_dataset, batch_size=CFG.train_bs if not debug else 20,
                              num_workers=8, shuffle=True, pin_memory=True, drop_last=False)
    valid_loader = DataLoader(valid_dataset, batch_size=CFG.valid_bs if not debug else 20,
                              num_workers=8, shuffle=False, pin_memory=True)

    return train_loader, valid_loader

4.4 训练主循环


from segFunct import  *
from CFG import *
from model import *


if __name__ == '__main__':

    model = build_model()
    optimizer = optim.Adam(model.parameters(), lr=CFG.lr, weight_decay=CFG.wd)
    scheduler = fetch_scheduler(optimizer)

    imageList = get_jpg_files(r"C:\Users\lanqi\Desktop\aiDebug\segment-anything-main\dataset\images")
    maskList = []
    for path in imageList :
        new_path = path.replace(".jpg", "_mask.png")
        maskList.append(new_path)

    train_df={}
    valid_df={}
    datasize =len(imageList)
    train_size = int(datasize*0.9)
    train_df["imgpath"] = imageList[0:train_size]
    train_df["maskpath"] = maskList[0:train_size]
    valid_df["imgpath"] = imageList[train_size:]
    valid_df["maskpath"] = maskList[train_size:]


    for fold in CFG.folds:

        train_loader, valid_loader = prepare_loaders(train_df, valid_df, debug=CFG.debug)
        model     = build_model()
        optimizer = optim.Adam(model.parameters(), lr=CFG.lr, weight_decay=CFG.wd)
        scheduler = fetch_scheduler(optimizer)
        model, history = run_training(model, optimizer, scheduler,
                                      device=CFG.device,
                                      num_epochs=CFG.epochs,
                                      train_loader=train_loader,
                                      valid_loader=valid_loader)
        
        
        # run.finish()
        # display(ipd.IFrame(run.url, width=1000, height=720))

5 模型推理和部署

5.1 Pytorch推理


import torch
import cv2
import numpy as np
from  segFunct import load_img,data_transforms
from model import  build_model
import time

if __name__ == '__main__':
    # img = cv2.imread(r"C:\Users\lanqi\Desktop\aiDebug\segment-anything-main\Back\2\2024-02-18 162918\0_40.jpg")
    img_copy = cv2.imread(r"C:\Users\lanqi\Desktop\aiDebug\segment-anything-main\Back\2\2024-02-18 162918\0_40.jpg")
    img_copy = data_transforms['valid'](image=img_copy)['image']
    img = load_img(r"C:\Users\lanqi\Desktop\aiDebug\segment-anything-main\Back\2\2024-02-18 162918\0_40.jpg")
    img = data_transforms['valid'](image=img)
    img = img['image']
    img = np.transpose(img, (2, 0, 1))
    img = torch.tensor(img)
    img = torch.unsqueeze(img,dim = 0)
    img = img.cuda()
    model = build_model()
    model.eval()
    PATH = 'best_epoch-00.bin'
    start_time = time.time()
    # 使用torch.load()函数加载模型权重数据
    state_dict = torch.load(PATH, map_location='cuda:0')  # device可以是'cpu'或cuda:0等设备标识符
    # 加载模型权重
    model.load_state_dict(state_dict)
    # 将模型设置为评估模式（如果需要进行预测或者评估）
    with torch.no_grad():
        y_pred = model(img)
    numpy_array = y_pred.cpu().numpy()
    numpy_array = np.squeeze(numpy_array)
    image_hwc = numpy_array.transpose(1, 2, 0)
    image_hwc = np.mean(image_hwc,axis=2,keepdims=True)
    image_hwc = np.where(image_hwc > 0.5, 1, 0.3)
    image_hwc = (img_copy * image_hwc).astype(np.uint8)
    cv2.imwrite("abc.png",image_hwc)
    end_time = time.time()
    execution_time = end_time - start_time
    print(f"Your function executed in: {execution_time} seconds")

分割结果

推理时间

5.2 PyTorch转ONNX


import torch.nn as nn
import cv2
import torch.nn.functional as F
import numpy as np
import onnx
from model import *
class Preprocess(nn.Module):
    def __init__(self,path= 'best_epoch-00.bin'):
        super().__init__()
        self.mean = 1
        self.var = 1
        self.model =     smp.Unet(
        encoder_name=CFG.backbone,      # choose encoder, e.g. mobilenet_v2 or efficientnet-b7
        encoder_weights="imagenet",     # use `imagenet` pre-trained weights for encoder initialization
        in_channels=3,                  # model input channels (1 for gray-scale images, 3 for RGB, etc.)
        classes=CFG.num_classes,        # model output channels (number of classes in your dataset)
        activation=None,
        )
        self.model.to(torch.device("cpu"))
        state_dict = torch.load(path, map_location='cpu')  # device可以是'cpu'或cuda:0等设备标识符
        # 加载模型权重
        self.model.load_state_dict(state_dict)


    def forward(self,x):

        x = x.to(torch.float32)
        mx = torch.max(x)
        x /= mx  # scale image to [0, 1]
        x = F.pad(x, pad=(0,0, 0, 0,4,4), mode='constant', value=0.0)
        x =  x.permute(2, 0, 1)
        x = torch.unsqueeze(x, dim=0)
        x = self.model(x)
        x = torch.squeeze(x)
        x = x.permute(1, 2, 0)
        x = torch.mean(x, dim=2, keepdims=True)
        return x

if __name__ == '__main__':
    img= cv2.imread(r"C:\Users\lanqi\Desktop\aiDebug\segment-anything-main\Back\2\2024-02-18 162918\0_40.jpg")
    img_copy = img.copy()
    img = img.astype(np.float32)
    img = torch.from_numpy(img)#1080 1920 3，uint8
    preMode = Preprocess()
    # with torch.no_grad():
    #     img = preMode(img)
    #
    #
    # numpy_array =img.numpy()
    #
    # image_hwc = np.where(numpy_array > 0.5, 1, 0.3)
    #
    # image_hwc = image_hwc[4:-4,:,:]
    # image_hwc = (img_copy * image_hwc).astype(np.uint8)
    #
    # cv2.imwrite("abcd.png",image_hwc)

    torch.onnx.export(
        preMode,
        # 这里的args，是指输入给model的参数，需要传递tuple，因此用括号
        (img,),
        # 储存的文件路径
        "Preprocess.onnx",
        # 打印详细信息
        verbose=True,
        # 为输入和输出节点指定名称，方便后面查看或者操作
        input_names=["image"],
        output_names=["model_input"],
        # 这里的opset，指，各类算子以何种方式导出，对应于symbolic_opset11
        opset_version=11,

    )

注意：模型导出时并不是直接导出模型，而是进行了输入输出封装，以简化tensor推理在C++的前处理和后处理流程，另外涉及到一些并行运算，也可以提高处理速度

onnx运行验证

import onnx
import cv2
import numpy as np
import onnxruntime

if __name__ == '__main__':
    img= cv2.imread(r"C:\Users\lanqi\Desktop\aiDebug\segment-anything-main\Back\2\2024-02-18 162918\0_40.jpg")
    img = img.astype(np.float32)
    ort_session = onnxruntime.InferenceSession("Preprocess.onnx")
    input_name = ort_session.get_inputs()[0].name  # 获取模型输入名
    img_input =aimg  # 调整维度到CHW（如果需要的话）
    model_output = ort_session.run(None, {input_name: img_input})[0]
    image_hwc = np.where(model_output > 0.5, 1, 0.3)
    image_hwc = image_hwc[4:-4,:,:]
    image_hwc = (img * image_hwc).astype(np.uint8)
    cv2.imwrite("abcde.png",image_hwc)
    a = 0

运行结果

5.3 TensorRT引擎生成和推理

可参考：https://blog.csdn.net/weixin_42492254/article/details/126028199

5.3.1 使用trtexec工具将ONNX转换为TensorRT引擎：

trtexec.exe --onnx=model.onnx --saveEngine=model.trt --fp16

5.3.2 Python tensorRT运行推理

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import cv2
import time
# 设置TensorRT日志级别
TRT_LOGGER = trt.Logger()


def get_engine(engine_path):
    """
    从文件加载序列化的TensorRT引擎
    """
    print(f"从文件 {engine_path} 加载引擎")
    with open(engine_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        return runtime.deserialize_cuda_engine(f.read())


# 加载模型引擎（此处替换为实际的.engine文件路径）
engine = get_engine("model.trt")

# 打印引擎信息及输入输出绑定的相关数据
for binding in engine:
    # 计算当前绑定的内存大小（volume）并获取维度信息
    size = trt.volume(engine.get_binding_shape(binding)) * 1
    dims = engine.get_binding_shape(binding)
    print(size)
    print(dims)
    print(binding)
    # 检查当前binding是否为输入
    print(engine.binding_is_input(binding))
    # 设定输入输出数据类型（假设为float32）
    dtype = np.float32
    print(f"data type: {dtype}")

# 创建执行上下文
context = engine.create_execution_context()


# def preprocess_image(image_path):
#     """
#     预处理图像，将其转换为适合模型输入的格式
#     """
#     img = cv2.imread(image_path)
#     # 这里可能需要对图像进行缩放、归一化等预处理操作以适应模型要求
#     # 示例中未提供具体的预处理步骤，请根据模型需求自行添加
#     return img


def get_landmarks(image):
    """
    使用TensorRT引擎进行推理，并返回预测结果
    """
    # 将输入图像转换为浮点型numpy数组，并准备CUDA pinned内存
    img_in = image
    h_input = cuda.pagelocked_empty(trt.volume(context.get_binding_shape(0)), dtype=np.float32)

    # 分配GPU内存并创建CUDA流
    d_input = cuda.mem_alloc(h_input.nbytes)
    stream = cuda.Stream()

    # 将输入数据复制到CUDA pinned内存
    np.copyto(h_input, img_in.ravel())

    # 异步将输入数据从CPU拷贝到GPU
    cuda.memcpy_htod_async(d_input, h_input, stream)

    # 准备GPU内存用于存储输出结果
    h_output = cuda.pagelocked_empty(trt.volume(context.get_binding_shape(1)), dtype=np.float32)
    d_output = cuda.mem_alloc(h_output.nbytes)

    # 执行推理
    bindings = [int(d_input), int(d_output)]
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)

    # 异步将输出数据从GPU拷贝回CPU
    cuda.memcpy_dtoh_async(h_output, d_output, stream)

    # 等待所有CUDA操作完成
    stream.synchronize()

    # 返回处理后的输出结果
    return h_output.reshape(1088, 1920, 1)


# 测试函数性能
for i in range(0, 1000):
    start_time = time.time()
    img1 = cv2.imread(r"C:\Users\lanqi\Desktop\aiDebug\segment-anything-main\Back\2\2024-02-18 162918\0_40.jpg")
    output = get_landmarks(img1)

    # 对输出结果进行后处理（这里仅为示例，具体逻辑请按实际情况调整）
    image_hwc = np.where(output > 0.5, 1, 0.3)
    image_hwc = image_hwc[4:-4, :]
    image_hwc = (img1 * image_hwc).astype(np.uint8)

    end_time = time.time()
    execution_time = end_time - start_time
    cv2.imwrite("trt.png", image_hwc)
    print(f"您的函数执行耗时: {execution_time} 秒")

运行结果

推理时间

5.4 C++ tensorRT运行推理

5.4.1 Visual studio 配置

kernel32.lib
user32.lib
gdi32.lib
winspool.lib
comdlg32.lib
advapi32.lib
shell32.lib
ole32.lib
oleaut32.lib
uuid.lib
odbc32.lib
odbccp32.lib
%(AdditionalDependencies)
nvinfer.lib
nvinfer_plugin.lib
nvonnxparser.lib
nvparsers.lib
cudnn.lib
cublas.lib
cudart.lib
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\lib\x64\*.lib
C:\tensorRT\TensorRT-8.5.1.7\lib\*.lib
opencv_world3410.lib

5.4.2 C++代码

#include <fstream>
#include <iostream> 

#include <NvInfer.h> 
#include "logger.h"
#include <opencv2/core/core.hpp> 
#include <opencv2/highgui/highgui.hpp> 
#include <opencv2/imgproc.hpp>
#include <chrono>
#define CHECK(status) do{auto ret = (status); if (ret != 0){std::cerr << "Cuda failure: " << ret << std::endl; abort(); }} while (0)

using namespace nvinfer1;
using namespace sample;
using namespace std;
const char* IN_NAME = "image";
const char* OUT_NAME = "model_input";
static const int IN_H = 1080;
static const int IN_W = 1920;
static const int BATCH_SIZE = 1;
static const int EXPLICIT_BATCH = 1 << (int)(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);

class MyLogger : public nvinfer1::ILogger
{
public:
    void log(Severity severity, const char* msg) noexcept override
    {
        // You can add your custom logging or store the error message in a variable.
        if (severity == nvinfer1::ILogger::Severity::kERROR)
        {
            std::cerr << "TensorRT Error: " << msg << std::endl;
            // Store the error message in a variable if needed.
        }
    }
};



void doInference(IExecutionContext & context, float* input, float* output, int batchSize)
{

    const ICudaEngine& engine = context.getEngine();

    // Pointers to input and output device buffers to pass to engine. 
    // Engine requires exactly IEngine::getNbBindings() number of buffers. 
    assert(engine.getNbBindings() == 2);
    void* buffers[2];

    // In order to bind the buffers, we need to know the names of the input and output tensors. 
    // Note that indices are guaranteed to be less than IEngine::getNbBindings() 
    const int inputIndex = engine.getBindingIndex(IN_NAME);
    const int outputIndex = engine.getBindingIndex(OUT_NAME);
 


    // Create stream 
    cudaStream_t stream;
    CHECK(cudaStreamCreate(&stream));
    // Create GPU buffers on device 
    CHECK(cudaMalloc(&buffers[inputIndex],  3 * IN_H * IN_W * sizeof(float)));
    CHECK(cudaMalloc(&buffers[outputIndex], 1088 * IN_W * sizeof(float)));
    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host 


    if (cudaMemcpyAsync(buffers[inputIndex], input, 3 * IN_H * IN_W * sizeof(float), cudaMemcpyHostToDevice, stream) != cudaSuccess)
    {
        std::cerr << "cudaMemcpyAsync from host to device failed: " << cudaGetErrorString(cudaGetLastError()) << std::endl;

        // Proper error handling and cleanup
        cudaFree(buffers[inputIndex]);
        cudaFree(buffers[outputIndex]);
        cudaStreamDestroy(stream);
        abort();
    }

    // Synchronize stream to ensure input data transfer is complete
    cudaStreamSynchronize(stream);

    // Run inference
    context.enqueue(batchSize, buffers, stream, nullptr);

    // Copy output data to host
    if (cudaMemcpyAsync(output, buffers[outputIndex], 1088 * IN_W * sizeof(float), cudaMemcpyDeviceToHost, stream) != cudaSuccess)
    {
        std::cerr << "cudaMemcpyAsync from device to host failed: " << cudaGetErrorString(cudaGetLastError()) << std::endl;

        // Proper error handling and cleanup
        cudaFree(buffers[inputIndex]);
        cudaFree(buffers[outputIndex]);
        cudaStreamDestroy(stream);
        abort();
    }

    // Synchronize stream to ensure output data transfer is complete
    cudaStreamSynchronize(stream);

    // Release stream and buffers
    cudaStreamDestroy(stream);
    cudaFree(buffers[inputIndex]);
    cudaFree(buffers[outputIndex]);
    
};
int main()
{
    MyLogger m_logger; 
    IExecutionContext* context = NULL;
    ICudaEngine* engine = NULL;
    IRuntime* runtime = NULL;
    char* trtModelStream{ nullptr };
    try {
        // create a model using the API directly and serialize it to a stream

        size_t size{ 0 };

        std::ifstream file("model.trt", std::ios::binary);
        if (file.good()) {
            file.seekg(0, file.end);
            size = file.tellg();
            file.seekg(0, file.beg);
            trtModelStream = new char[size];
            assert(trtModelStream);
            file.read(trtModelStream, size);
            file.close();
        }


        runtime = createInferRuntime(m_logger);
        assert(runtime != nullptr);


        engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);

        if (!engine)
        {
            throw std::runtime_error("Failed to create TensorRT engine.");
        }

        context = engine->createExecutionContext();
        if (!context)
        {
            throw std::runtime_error("Failed to create TensorRT execution context.");
        }

        // generate input data 
       // float data[BATCH_SIZE * 3 * IN_H * IN_W];
       //很重要，分配堆空间，不要在栈空间
        float* data = new float[BATCH_SIZE * 3 * IN_H * IN_W];
        float* prob = new float[1088 * IN_W];

        for(int i = 0;i<100;i++)
          {
            auto start_time = std::chrono::high_resolution_clock::now();
            cv::Mat img = cv::imread("0_40.jpg");

            if (img.empty())
            {
                throw std::runtime_error("Failed to read image.");
            }

            img.convertTo(img, CV_32FC3); // 将图像转换为float32并归一化

            // cv::Mat img_float32_flat;

            memcpy(data, img.data, sizeof(float) * IN_H * IN_W * 3);

            // Run inference 

            doInference(*context, data, prob, BATCH_SIZE);

            // Post-processing...
            cv::Mat output_mat(1088, IN_W, CV_32FC1, prob);
            cv::Mat binary_image;
            cv::threshold(output_mat, binary_image, 0.5, 1.0, cv::THRESH_BINARY);

            cv::Mat cropped_binary_image = binary_image(cv::Rect(0, 4, IN_W, 1080));

            cv::Mat img1_float32 = img; // 已经读取并转换为float32类型的图像
            cv::Mat cropped_color_image;
            cv::cvtColor(cropped_binary_image, cropped_color_image, cv::COLOR_GRAY2BGR);

            cv::Mat processed_img;

            cv::multiply(img1_float32, cropped_color_image, processed_img);
            processed_img.convertTo(processed_img, CV_8UC3);
            cv::imwrite("trt.png", processed_img);
            auto end_time = std::chrono::high_resolution_clock::now();

            auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);

            // 输出程序运行时间
            std::cout << "程序运行时间: " << duration.count() << " 毫秒" << std::endl;
        }

        delete[] prob;
        delete[] data;
    }
    catch (const std::exception& e)
    {
        std::cerr << "Caught exception: " << e.what() << std::endl;
        return 1;
    }

    // Destroy the engine
    if (context)
    {
        context->destroy();
    }
    if (engine)
    {
        engine->destroy();
    }
    if (runtime)
    {
        runtime->destroy();
    }

    if (trtModelStream)
    {
        delete[] trtModelStream;
    }
    getchar();
    return 0;
}

5.4.3 推理结果

时间

6. 结论

本文详细介绍了从UNet模型训练到TensorRT部署的完整流程，涵盖了数据准备、模型训练、ONNX导出、TensorRT优化以及Python/C++推理实现等关键环节。通过TensorRT部署，我们能够显著提升模型的推理速度，满足实际生产环境对实时性的要求。

在实际应用中，还需要根据具体场景进行调整和优化，例如：

针对特定硬件平台进行调优
平衡精度和速度的关系
实现高效的流水线处理

希望本文能够为读者提供一条清晰的模型部署路径，帮助大家将训练好的语义分割模型高效地部署到生产环境中。

魔乐社区

魔乐社区（Modelers.cn) 是一个中立、公益的人工智能社区，提供人工智能工具、模型、数据的托管、展示与应用协同服务，为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作，由全产业链共同建设、共同运营、共同享有，推动国产AI生态繁荣发展。

更多推荐

替你试过了，消费级显卡可以跑的开源文生图SOTA模型，顶级渲染、高密度文本绘图

魔乐社区

量化挑战赛冠军专访：4小时啃下W4A8量化，我靠的是这些经验

魔乐社区

小参数・大码力・易部署 | Qwen3.6-27B上线魔乐社区，基于昇腾的部署教程来了

继一周前模型开源发布后，千问再度开源Qwen3.6-27B —— 一个拥有270亿参数的稠密多模态模型，也是社区呼声最高的模型规格。Qwen3.6-27B 依然支持多模态思考与非思考模式，在智能体编程方面达到了旗舰级表现，全面超越前代开源旗舰 Qwen3.5-397B-A17B（总参数397B / 激活参数17B的MoE模型）。作为稠密架构，它无需MoE路由即可部署，是开发者在实用、可广泛部署规模