python搭起鹊桥!cuda和tensorrt终相会

经过几天的奋战,终于完成tensorrt的官方文档的梳理并实现了基于pycuda的后端部署,

因为python相关资料确实少,所以想分享一下自己的经验。首先总结一下tensorrt的三个特性

  • 速度就是正义:GPU相对加速50%以上。半精度和int8加持可以翻倍,官方案例TensorRT 8,BERT-Large推理仅需1.2毫秒!(只支持GPU,因为运行需要特定的算子kenel,被一些文章误导了很久,笔者有误必勘,欢迎大家指正)
  • 通用性:支持多深度学习框架,目前最流行torch->onnx->tensorRT,tf原生支持(pb->tf-trt,如果trt没有的算子就会用tf自带的算子替代,速度比纯trt要慢一些)
  • 定制性:tensorrt根据不同显卡定制优化引擎,同等计算能力才能勉强兼容,而且同一型号显卡如果显存和gpu核心时钟频率不一致也会对性能造成影响。tf2onnx

一、tensorrt workflow

tensorrt runtime api的核心是trt engine和trt context(这里和cuda context管理区分开)

  • 官方工作流程
  • 实战编码流程:Builder->Network->Parser/Self define->Engine->Context->Cuda coding

二、tensorrt(7+) Dynamic Shapes

6以上支持batch,7以上支持batch和动态尺寸,8以上支持更多onnx算子(如gather elements),这点和onnx一样,向下兼容性做的比较好

  1. explicit batch dimension
    • Python
      create_network(1 <<int(tensorrt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
  2. runtime dimension 动态的维度用-1表示。从onnx解析的network会自动推测动态的维度。
  3. buildtime dimension build engine时指定optimization profiles来设置动态tensor形状的范围
  4. use engine
    a. Create an execution context from the engine, the same as without dynamic shapes.
    b. Specify one of the optimization profiles from step 3 that covers the input dimensions.
    c. Specify the input dimensions for the execution context.
    d. Enqueue work.

runtime dimension

  • if define a network from scrach,input and output shapes should be specified
  • if network is parsed from onnx or other,shapes can be infered automatically
  • python
      # when defining a network yourself
      network_definition.add_input("foo", trt.float32,(3, -1, -1))
      # at runtime
      context.set_binding_shape(0, (3, 150, 250))
      engine.get_binding_shape(0) # returns (3, -1, -1).
      context.get_binding_shape(0) # returns (3, 150, 250).

buildtime dimension

  • python
      # at least one profile should be specified
      # minimum size of [3,100,200], a optimization size of [3,200,300], and maximum dimensions of [3,150,250]
      profile = builder.create_optimization_profile();
      profile.set_shape("foo", (3, 100, 200), (3, 150, 250), (3, 200, 300)) 
      config.add_optimization_profile(profile)

三、engine and context

docker环境配置 cuda10.2-python3.6-trt8

FROM nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04

ENV TRT_VERSION=8.0.1.6 \
TRT_HOME=/usr/local/Tensorrt \
CUDA_HOME=/usr/local/cuda \
LIBRARY_PATH=/usr/local/cuda/lib64 \
LD_LIBRARY_PATH=/usr/local/Tensorrt/lib:/usr/local/cuda/lib64

# 需要下载tensorrt对应版本的压缩包解压到当前目录
COPY . /usr/local/Tensorrt-$TRT_VERSION/
RUN ln -s /usr/local/Tensorrt-$TRT_VERSION /usr/local/Tensorrt \
&& pip install $TRT_HOME/python/tensorrt-$TRT_VERSION-cp36-none-linux_x86_64.whl
# start
# docker run -it --gpus all cuda10.2-python3.6-trt8 bash

trt engine

def build_or_create_engine(onnx_path, trt_path, input_optimitions=None, max_batch_size=1, save_engine=1,
                           re_build=0):
    TRT_LOGGER = trt.Logger()
    EXPLICIT_BATCH = 1
    if os.path.exists(trt_path) and not re_build:
        with trt.Runtime(TRT_LOGGER) as runtime, open(trt_path, 'rb') as f:
            engine = runtime.deserialize_cuda_engine(f.read())
            print('deserialized trt engine from file')
            return engine
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(EXPLICIT_BATCH) as network, \
            builder.create_builder_config() as config:
        if not os.path.exists(onnx_path):
            quit('onnx file {} not found'.format(onnx_path))
        with trt.OnnxParser(network, TRT_LOGGER) as parser, open(onnx_path, 'rb') as model:
            print('parsing onnx file {}'.format(onnx_path))
            if not parser.parse(model.read()):
                print('ERROR: Failed to parse the ONNX file.')
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
                return
        builder.max_batch_size = max_batch_size
        config.max_workspace_size = 1 << 30
        profile = builder.create_optimization_profile()
        [profile.set_shape(*s) for s in input_optimitions]
        config.add_optimization_profile(profile)
        print('building an engine from file {}; this may take a while...'.format(onnx_path))
        engine = builder.build_engine(network, config)
        if not engine:
            print("tensorRT engine built failed")
        else:
            print("tensorRT engine built success")
            if save_engine:
                with open(trt_path, "wb") as f:
                    f.write(engine.serialize())
                    print('tensorRT engine saved')
        return engine

trt context

def allocate_buffers_and_inference(context, engine, trt_inputs: list = None):
    """
    allocate cpu and gpu buffer of inputs and outputs,
    init bindings of gpu memory and cuda stream
    :param context: trt context
    :param engine: trt engine
    :param trt_inputs: a list of real input, which is type of numpy array
    :return: inputs, outputs, bindings, stream
    """
    if trt_inputs and trt_inputs[0]:
        real_size = min(engine.max_batch_size, len(trt_inputs[0]))
        trt_inputs = trt_inputs[: real_size]
    else:
        real_size = engine.max_batch_size
    print('batch size %d' % real_size)
    inputs = []
    outputs = []
    bindings = []
    for i, binding in enumerate(engine):
        print(binding, engine.get_binding_shape(binding))
        size = abs(trt.volume(engine.get_binding_shape(binding))) * real_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype) if not trt_inputs else trt_inputs[i]
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        shape = (real_size, *engine.get_binding_shape(binding)[1:]) if not trt_inputs else trt_inputs[i].shape
        if engine.binding_is_input(binding):
            inputs.append({'host': host_mem, 'device': device_mem, 'shape': shape})
        else:
            outputs.append({'host': host_mem, 'device': device_mem, 'shape': shape})
    stream = cuda.Stream()
    for i in range(len(inputs)):
        context.set_binding_shape(i, inputs[i]['shape'])
    [cuda.memcpy_htod_async(inp['device'], inp['host'], stream) for inp in inputs]
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    [cuda.memcpy_dtoh_async(out['host'], out['device'], stream) for out in outputs]
    stream.synchronize()
    print(outputs)
    return real_size, [out['host'] for out in outputs]

with engine.create_execution_context() as context:
    t = time.time()
    batch_size, trt_outputs = allocate_buffers_and_inference(context, engine, trt_inputs=None)
    print('trt infernce average %dms' % ((time.time() - t) * 1000 / batch_size))

四、pycuda编程和django uwsgi部署

    stream = cuda.Stream()
    for i in range(len(inputs)):
        context.set_binding_shape(i, inputs[i]['shape'])
    [cuda.memcpy_htod_async(inp['device'], inp['host'], stream) for inp in inputs]
    # trt context
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    [cuda.memcpy_dtoh_async(out['host'], out['device'], stream) for out in outputs]
    stream.synchronize()

cuda也有context的概念,主要负责gpu运算以及gpu和cpu之间的数据交换。

新建cuda context的花销比较大,所以在后端要使用pop和push方法隐藏和激活

import pycuda.driver as cuda
ctx = cuda.Device(0).make_context()
# 调用pop和push方法隐藏和激活
cuda.push()
cuda.pop()

五、最后

测试tensorrt花销为20ms,正常gpu速度为30ms,提升了50%。如果是一些v100或者a100这样的专业显卡的话提升会更多。


   转载规则


《python搭起鹊桥!cuda和tensorrt终相会》 shakeryoung 采用 知识共享署名 4.0 国际许可协议 进行许可。
  目录