算法拾贝

深度实战

发布日期: 2021-08-22

作者: shakeryoung

文章字数: 1.3k

阅读时长: 6 分

阅读次数:

经过几天的奋战，终于完成tensorrt的官方文档的梳理并实现了基于pycuda的后端部署，

因为python相关资料确实少，所以想分享一下自己的经验。首先总结一下tensorrt的三个特性

速度就是正义：GPU相对加速50%以上。半精度和int8加持可以翻倍，官方案例TensorRT 8，BERT-Large推理仅需1.2毫秒！（只支持GPU，因为运行需要特定的算子kenel，被一些文章误导了很久，笔者有误必勘，欢迎大家指正）
通用性：支持多深度学习框架，目前最流行torch->onnx->tensorRT，tf原生支持（pb->tf-trt，如果trt没有的算子就会用tf自带的算子替代，速度比纯trt要慢一些）
定制性：tensorrt根据不同显卡定制优化引擎,同等计算能力才能勉强兼容，而且同一型号显卡如果显存和gpu核心时钟频率不一致也会对性能造成影响。tf2onnx

一、tensorrt workflow

tensorrt runtime api的核心是trt engine和trt context（这里和cuda context管理区分开）

官方工作流程
实战编码流程:Builder->Network->Parser/Self define->Engine->Context->Cuda coding

二、tensorrt(7+) Dynamic Shapes

6以上支持batch，7以上支持batch和动态尺寸，8以上支持更多onnx算子(如gather elements)，这点和onnx一样，向下兼容性做的比较好

explicit batch dimension

Python

create_network(1 <<int(tensorrt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))

runtime dimension 动态的维度用-1表示。从onnx解析的network会自动推测动态的维度。
buildtime dimension build engine时指定optimization profiles来设置动态tensor形状的范围
use engine
a. Create an execution context from the engine, the same as without dynamic shapes.
b. Specify one of the optimization profiles from step 3 that covers the input dimensions.
c. Specify the input dimensions for the execution context.
d. Enqueue work.

runtime dimension

if define a network from scrach,input and output shapes should be specified
if network is parsed from onnx or other,shapes can be infered automatically

python

  # when defining a network yourself
  network_definition.add_input("foo", trt.float32,(3, -1, -1))
  # at runtime
  context.set_binding_shape(0, (3, 150, 250))
  engine.get_binding_shape(0) # returns (3, -1, -1).
  context.get_binding_shape(0) # returns (3, 150, 250).

buildtime dimension

python

  # at least one profile should be specified
  # minimum size of [3,100,200], a optimization size of [3,200,300], and maximum dimensions of [3,150,250]
  profile = builder.create_optimization_profile();
  profile.set_shape("foo", (3, 100, 200), (3, 150, 250), (3, 200, 300)) 
  config.add_optimization_profile(profile)

三、engine and context

docker环境配置 cuda10.2-python3.6-trt8

FROM nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04

ENV TRT_VERSION=8.0.1.6 \
TRT_HOME=/usr/local/Tensorrt \
CUDA_HOME=/usr/local/cuda \
LIBRARY_PATH=/usr/local/cuda/lib64 \
LD_LIBRARY_PATH=/usr/local/Tensorrt/lib:/usr/local/cuda/lib64

# 需要下载tensorrt对应版本的压缩包解压到当前目录
COPY . /usr/local/Tensorrt-$TRT_VERSION/
RUN ln -s /usr/local/Tensorrt-$TRT_VERSION /usr/local/Tensorrt \
&& pip install $TRT_HOME/python/tensorrt-$TRT_VERSION-cp36-none-linux_x86_64.whl
# start
# docker run -it --gpus all cuda10.2-python3.6-trt8 bash

trt engine

def build_or_create_engine(onnx_path, trt_path, input_optimitions=None, max_batch_size=1, save_engine=1,
                           re_build=0):
    TRT_LOGGER = trt.Logger()
    EXPLICIT_BATCH = 1
    if os.path.exists(trt_path) and not re_build:
        with trt.Runtime(TRT_LOGGER) as runtime, open(trt_path, 'rb') as f:
            engine = runtime.deserialize_cuda_engine(f.read())
            print('deserialized trt engine from file')
            return engine
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(EXPLICIT_BATCH) as network, \
            builder.create_builder_config() as config:
        if not os.path.exists(onnx_path):
            quit('onnx file {} not found'.format(onnx_path))
        with trt.OnnxParser(network, TRT_LOGGER) as parser, open(onnx_path, 'rb') as model:
            print('parsing onnx file {}'.format(onnx_path))
            if not parser.parse(model.read()):
                print('ERROR: Failed to parse the ONNX file.')
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
                return
        builder.max_batch_size = max_batch_size
        config.max_workspace_size = 1 << 30
        profile = builder.create_optimization_profile()
        [profile.set_shape(*s) for s in input_optimitions]
        config.add_optimization_profile(profile)
        print('building an engine from file {}; this may take a while...'.format(onnx_path))
        engine = builder.build_engine(network, config)
        if not engine:
            print("tensorRT engine built failed")
        else:
            print("tensorRT engine built success")
            if save_engine:
                with open(trt_path, "wb") as f:
                    f.write(engine.serialize())
                    print('tensorRT engine saved')
        return engine

trt context

def allocate_buffers_and_inference(context, engine, trt_inputs: list = None):
    """
    allocate cpu and gpu buffer of inputs and outputs,
    init bindings of gpu memory and cuda stream
    :param context: trt context
    :param engine: trt engine
    :param trt_inputs: a list of real input, which is type of numpy array
    :return: inputs, outputs, bindings, stream
    """
    if trt_inputs and trt_inputs[0]:
        real_size = min(engine.max_batch_size, len(trt_inputs[0]))
        trt_inputs = trt_inputs[: real_size]
    else:
        real_size = engine.max_batch_size
    print('batch size %d' % real_size)
    inputs = []
    outputs = []
    bindings = []
    for i, binding in enumerate(engine):
        print(binding, engine.get_binding_shape(binding))
        size = abs(trt.volume(engine.get_binding_shape(binding))) * real_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype) if not trt_inputs else trt_inputs[i]
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        shape = (real_size, *engine.get_binding_shape(binding)[1:]) if not trt_inputs else trt_inputs[i].shape
        if engine.binding_is_input(binding):
            inputs.append({'host': host_mem, 'device': device_mem, 'shape': shape})
        else:
            outputs.append({'host': host_mem, 'device': device_mem, 'shape': shape})
    stream = cuda.Stream()
    for i in range(len(inputs)):
        context.set_binding_shape(i, inputs[i]['shape'])
    [cuda.memcpy_htod_async(inp['device'], inp['host'], stream) for inp in inputs]
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    [cuda.memcpy_dtoh_async(out['host'], out['device'], stream) for out in outputs]
    stream.synchronize()
    print(outputs)
    return real_size, [out['host'] for out in outputs]

with engine.create_execution_context() as context:
    t = time.time()
    batch_size, trt_outputs = allocate_buffers_and_inference(context, engine, trt_inputs=None)
    print('trt infernce average %dms' % ((time.time() - t) * 1000 / batch_size))

四、pycuda编程和django uwsgi部署

    stream = cuda.Stream()
    for i in range(len(inputs)):
        context.set_binding_shape(i, inputs[i]['shape'])
    [cuda.memcpy_htod_async(inp['device'], inp['host'], stream) for inp in inputs]
    # trt context
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    [cuda.memcpy_dtoh_async(out['host'], out['device'], stream) for out in outputs]
    stream.synchronize()

cuda也有context的概念，主要负责gpu运算以及gpu和cpu之间的数据交换。

新建cuda context的花销比较大，所以在后端要使用pop和push方法隐藏和激活

import pycuda.driver as cuda
ctx = cuda.Device(0).make_context()
# 调用pop和push方法隐藏和激活
cuda.push()
cuda.pop()

五、最后

测试tensorrt花销为20ms，正常gpu速度为30ms，提升了50%。如果是一些v100或者a100这样的专业显卡的话提升会更多。

转载规则

《python搭起鹊桥！cuda和tensorrt终相会》由 shakeryoung 采用知识共享署名 4.0 国际许可协议进行许可。

NLP对比学习论文笔记

SimCSE表现最好，在提升一致性的同时很好的保持了对齐性。

2022-02-10 论文学习

算法拾贝

混合精度，用好loss scale让pytorch凌波微步

书到用时方恨少，训练慢时恨卡少。不过混合精度训练可以带来30%以上的up体验

2021-08-03 深度实战

算法拾贝

一、tensorrt workflow

二、tensorrt(7+) Dynamic Shapes

runtime dimension

buildtime dimension

三、engine and context

docker环境配置 cuda10.2-python3.6-trt8

trt engine

trt context

四、pycuda编程和django uwsgi部署

五、最后

写作不易，客官能否打赏一杯奶茶？