TVM Compiler中文教程：TVM使用autotvm调优NVIDIA GPU上的高性能卷积

川长思鸟来 2022-01-16 06:55 691阅读 0赞

### 文章目录 ###

*  TVM使用autotvm调优NVIDIA GPU上的高性能卷积
 *   *  加载依赖库
     *  步骤一：定义搜索空间
     *  步骤二：搜索空间

# TVM使用autotvm调优NVIDIA GPU上的高性能卷积 #

这是为NVIDIA GPU编写高性能可调模板的高级教程。通过在此模板上运行自动调优器，在许多情况下，我们可以胜过供应商提供的CuDNN库。

## 加载依赖库 ##

安装步骤与[TVM Compiler中文教程：使用TVM编写可调模板和使用自动调优器][TVM Compiler_TVM]中步骤一样，安装相应的软件包即可。

python代码中导入包：

import logging
    import sys
    import numpy as np
    import tvm
    import topi
    from topi.testing import conv2d_nchw_python
    
    from tvm import autotvm

## 步骤一：定义搜索空间 ##

tvm中有很多有用的调度原语。还可以找到一些更详细描述它们的教程，例如（1）[TVM Compiler中文教程：TVM如何生成优化GPU卷积][TVM Compiler_TVM_GPU]（2）[NVIDIA GPU上优化DepthwiseConv][NVIDIA GPU_DepthwiseConv]

但是，它们的实现是针对某些特殊输入形状手动调整的。在本节中，我们构建了足够大的空间来涵盖这些教程中使用的技术。然后，我们依靠高效的自动调优器来搜索这个空间，并选择一些好的配置。

如果你熟悉编写cuda调度，可以发现以下模板非常通用。实际上，这个模板可以很容易地修改，来调优其他算子，如深度卷积和通用矩阵乘法。为了完全理解此模板，您应该熟悉调度原语和自动调优API。你可以参考上面的教程和[autotvm教程][autotvm]。

值得注意的是，conv2d算子的搜索空间可能非常大（某些输入形状的级别为10 ^ 9）。

@autotvm.template
    def conv2d_no_batching(N,H,W,CO,CI,KH,KW,stride,padding):
        assert N == 1, "Only consider batch_size = 1 in this template"
        
        #定义计算
        data = tvm.placeholder((N,CI,H,W), name='data')
        kernel = tvm.placeholder((CO,CI,KH,KW), name='kernel')
        conv = topi.nn.conv2d_nchw(data,kernel,stride,padding,dilation=1,out_dtype='float32')
        s = tvm.create_schedule([conv.op])
        
        #定义搜索空间
        n, f, y, x = s[conv].op.axis #获取轴
        rc, ry, rx = s[conv].op.reduce_axis#获取reduce轴
        
        cfg = autotvm.get_config()
        cfg.define_split("tile_f", f, num_outputs=4)
        cfg.define_split("tile_y", y, num_outputs=4)
        cfg.define_split("tile_x", x, num_outputs=4)
        cfg.define_split("tile_rc", rc, num_outputs=3)
        cfg.define_split("tile_ry", ry, num_outputs=3)
        cfg.define_split("tile_rx", rx, num_outputs=3)
        cfg.define_knob("auto_unroll_max_step", [0, 512, 1500])
        cfg.define_knob("unroll_explicit",[0, 1])
        
        #内联padding
        pad_data = s[conv].op.input_tensors[0]
        s[pad_data].compute_inline()
        data, raw_data = pad_data, data
        
        output = conv
        OL = s.cache_write(conv, 'local')
        
        #创建cache，输入数据用shared内存，然后在放入线程的local cache
        AA = s.cache_read(data, 'shared', [OL])
        WW = s.cache_read(kernel, 'shared', [OL])
        AL = s.cache_read(AA, 'local', [OL])
        WL = s.cache_read(WW, 'local', [OL])
        
        #平铺并绑定空间轴
        n, f, y, x = s[output].op.axis
        bf, vf, tf, fi = cfg["tile_f"].apply(s, output, f)
        by, vy, ty, yi = cfg["tile_y"].apply(s, output, y)
        bx, vx, tx, xi = cfg["tile_x"].apply(s, output, x)
        kernel_scope = n #内核函数中附加全局配置cfg的范围
        
        #观察最外层、中间层、次内层循环分别对应blockIdx，vthread，threadIdx
            s[output].bind(bf, tvm.thread_axis("blockIdx.z"))
        s[output].bind(by, tvm.thread_axis("blockIdx.y"))
        s[output].bind(bx, tvm.thread_axis("blockIdx.x"))
        s[output].bind(vf, tvm.thread_axis("vthread"))
        s[output].bind(vy, tvm.thread_axis("vthread"))
        s[output].bind(vx, tvm.thread_axis("vthread"))
        s[output].bind(tf, tvm.thread_axis("threadIdx.z"))
        s[output].bind(ty, tvm.thread_axis("threadIdx.y"))
        s[output].bind(tx, tvm.thread_axis("threadIdx.x"))
        s[output].reorder(n, bf, by, bx, vf, vy, vx, tf, ty, tx, fi, yi, xi)
        s[OL].compute_at(s[output], tx)
        
        #平铺缩减轴
        n, f, y, x = s[OL].op.axis
        rc, ry, rx = s[OL].op.reduce_axis
        rco, rcm, rci = cfg['tile_rc'].apply(s, OL, rc)
        ryo, rym, ryi = cfg['tile_rx'].apply(s, OL, ry)
        rxo, rxm, rxi = cfg['tile_ry'].apply(s, OL, rx)
        s[OL].reorder(rco, ryo, rxo, rcm, rym, rxm, rci, ryi, rxi, n, f, y, x)
        
        s[AA].compute_at(s[OL], rxo)
        s[WW].compute_at(s[OL], rxo)
        s[AL].compute_at(s[OL], rxm)
        s[WL].compute_at(s[OL], rxm)
        
        #并行取数据
        for load in [AA, WW]:
            n, f, y, x = s[load].op.axis
            fused = s[load].fuse(n, f, y, x)
            tz, fused = s[load].split(fused, nparts=cfg["tile_f"].size[2])
            ty, fused = s[load].split(fused, nparts=cfg["tile_y"].size[2])
            tx, fused = s[load].split(fused, nparts=cfg["tile_x"].size[2])
            s[load].bind(tz, tvm.thread_axis("threadIdx.z"))
            s[load].bind(ty, tvm.thread_axis("threadIdx.y"))
            s[load].bind(tx, tvm.thread_axis("threadIdx.x"))
    	
        #调优展开unroll
        s[output].pragma(kernel_scope, 'auto_unroll_max_step', cfg['auto_unroll_max_step'].val)
        s[output].pragma(kernel_scope, 'unroll_explicit', cfg['unroll_explicit'].val)

## 步骤二：搜索空间 ##

我们选择resnet的最后一层作为测试用例。由于我们的空间非常大，XGBoostTuner最适合我们的情况。在这里，我们只进行了20次试验。在实践中，进行1000次试验通常可以为此模板找到一些好的内核。

# logging config (for printing tuning log to screen)
    logging.getLogger('autotvm').setLevel(logging.DEBUG)
    logging.getLogger('autotvm').addHandler(logging.StreamHandler(sys.stdout))
    
    # the last layer in resnet
    N, H, W, CO, CI, KH, KW, strides, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1)
    task = autotvm.task.create(conv2d_no_batching,
                               args=(N, H, W, CO, CI, KH, KW, strides, padding),
                               target='cuda')
    print(task.config_space)
    
    # Use local gpu, measure 10 times for every config to reduce variance
    # The timeout of compiling a program is 10 seconds, the timeout for running is 4 seconds
    measure_option = autotvm.measure_option(
        builder=autotvm.LocalBuilder(),
        runner=autotvm.LocalRunner(repeat=3, min_repeat_ms=100, timeout=4)
    )
    
    # Begin tuning, log records to file `conv2d.log`
    # During tuning we will also try many invalid configs, so you are expected to
    # see many error reports. As long as you can see non-zero GFLOPS, it is okay.
    tuner = autotvm.tuner.XGBTuner(task)
    tuner.tune(n_trial=20,
               measure_option=measure_option,
               callbacks=[autotvm.callback.log_to_file('conv2d.log')])

输出：

ConfigSpace (len=10454400, space_map=
       0 tile_f: Split(policy=all, product=512, num_outputs=4) len=220
       1 tile_y: Split(policy=all, product=7, num_outputs=4) len=4
       2 tile_x: Split(policy=all, product=7, num_outputs=4) len=4
       3 tile_rc: Split(policy=all, product=512, num_outputs=3) len=55
       4 tile_ry: Split(policy=all, product=3, num_outputs=3) len=3
       5 tile_rx: Split(policy=all, product=3, num_outputs=3) len=3
       6 auto_unroll_max_step: OtherOption([0, 512, 1500]) len=3
       7 unroll_explicit: OtherOption([0, 1]) len=2
    )
    Get devices for measurement successfully!
    No: 1   GFLOPS: 0.00/0.00       result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):\n  [bt] (3) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fb24766b8f1]\n  [bt] (2) /workspace/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::RPCModuleNode::WrapRemote(void*)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0x3b) [0x7fb247697fbb]\n  [bt] (1) /workspace/build/libtvm.so(tvm::runtime::RPCSession::CallFunc(void*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, tvm::runtime::PackedFunc const*)+0x154) [0x7fb247686814]\n  [bt] (0) /workspace/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7fb246e41432]\n  File "/workspace/src/runtime/rpc/rpc_session.cc", line 962\nTVMError: Check failed: code == RPCCode: :kReturn: code=4',),), error_no=4, all_cost=12.47830605506897, timestamp=1560554988.6336548)        [('tile_f', [1, 128, 2, 2]), ('tile_y', [1, 7, 1, 1]), ('tile_x', [1, 7, 1, 1]), ('tile_rc', [256, 2, 1]), ('tile_ry', [1, 3, 1]), ('tile_rx', [1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],,None,1359891
    No: 2   GFLOPS: 0.00/0.00       result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fb24766b8f1]\n  [bt] (0) /workspace/build/libtvm.so(+0xc3bb1b) [0x7fb247666b1b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 71, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 596, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.18091177940368652, timestamp=1560554980.3693938) [('tile_f', [8, 2, 32, 1]), ('tile_y', [1, 1, 1, 7]), ('tile_x', [1, 1, 7, 1]), ('tile_rc', [4, 2, 64]), ('tile_ry', [3, 1, 1]), ('tile_rx', [1, 1, 3]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],,None,8295581
    No: 3   GFLOPS: 0.00/0.00       result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fb24766b8f1]\n  [bt] (0) /workspace/build/libtvm.so(+0xc3bb1b) [0x7fb247666b1b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 71, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 596, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.3263852596282959, timestamp=1560554980.369509)   [('tile_f', [16, 1, 2, 16]), ('tile_y', [1, 1, 7, 1]), ('tile_x', [1, 1, 7, 1]), ('tile_rc', [1, 128, 4]), ('tile_ry', [1, 3, 1]), ('tile_rx', [1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],,None,7837890
    No: 4   GFLOPS: 0.00/0.00       result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fb24766b8f1]\n  [bt] (0) /workspace/build/libtvm.so(+0xc3bb1b) [0x7fb247666b1b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 71, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 596, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.10707521438598633, timestamp=1560554980.3695998) [('tile_f', [4, 128, 1, 1]), ('tile_y', [7, 1, 1, 1]), ('tile_x', [7, 1, 1, 1]), ('tile_rc', [2, 64, 4]), ('tile_ry', [1, 1, 3]), ('tile_rx', [1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],,None,6864007
    No: 5   GFLOPS: 0.00/0.00       result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fb24766b8f1]\n  [bt] (0) /workspace/build/libtvm.so(+0xc3bb1b) [0x7fb247666b1b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 71, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 596, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.10010313987731934, timestamp=1560554990.5786412) [('tile_f', [8, 16, 1, 4]), ('tile_y', [1, 7, 1, 1]), ('tile_x', [1, 1, 1, 7]), ('tile_rc', [16, 2, 16]), ('tile_ry', [1, 1, 3]), ('tile_rx', [1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],,None,8063764
    No: 6   GFLOPS: 91.91/91.91     result: MeasureResult(costs=(0.0025187342295081966,), error_no=0, all_cost=3.649610757827759, timestamp=1560554994.2921977)     [('tile_f', [8, 1, 64, 1]), ('tile_y', [1, 7, 1, 1]), ('tile_x', [1, 7, 1, 1]), ('tile_rc', [64, 2, 4]), ('tile_ry', [1, 1, 3]), ('tile_rx', [1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],,None,9751545
    No: 7   GFLOPS: 0.00/91.91      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fb24766b8f1]\n  [bt] (0) /workspace/build/libtvm.so(+0xc3bb1b) [0x7fb247666b1b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 71, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 596, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.1447432041168213, timestamp=1560554993.0343776)  [('tile_f', [4, 32, 1, 4]), ('tile_y', [7, 1, 1, 1]), ('tile_x', [1, 1, 1, 7]), ('tile_rc', [8, 2, 32]), ('tile_ry', [1, 3, 1]), ('tile_rx', [1, 1, 3]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],,None,8471865
    No: 8   GFLOPS: 0.00/91.91      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fb24766b8f1]\n  [bt] (0) /workspace/build/libtvm.so(+0xc3bb1b) [0x7fb247666b1b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 71, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 596, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.12843632698059082, timestamp=1560554993.0344782) [('tile_f', [4, 4, 32, 1]), ('tile_y', [1, 7, 1, 1]), ('tile_x', [1, 7, 1, 1]), ('tile_rc', [2, 32, 8]), ('tile_ry', [1, 1, 3]), ('tile_rx', [3, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],,None,500982
    No: 9   GFLOPS: 1.53/91.91      result: MeasureResult(costs=(0.15157555475,), error_no=0, all_cost=8.876856088638306, timestamp=1560555008.2197192)     [('tile_f', [8, 32, 1, 2]), ('tile_y', [7, 1, 1, 1]), ('tile_x', [1, 1, 7, 1]), ('tile_rc', [16, 2, 16]), ('tile_ry', [1, 3, 1]), ('tile_rx', [3, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],,None,2061020
    No: 10  GFLOPS: 0.00/91.91      result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1560555005.6728013)   [('tile_f', [16, 1, 8, 4]), ('tile_y', [1, 7, 1, 1]), ('tile_x', [1, 1, 1, 7]), ('tile_rc', [16, 16, 2]), ('tile_ry', [1, 3, 1]), ('tile_rx', [1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],,None,9538661
    No: 11  GFLOPS: 0.00/91.91      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fb24766b8f1]\n  [bt] (0) /workspace/build/libtvm.so(+0xc3bb1b) [0x7fb247666b1b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 71, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 596, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.1292111873626709, timestamp=1560555005.673)      [('tile_f', [64, 2, 4, 1]), ('tile_y', [1, 7, 1, 1]), ('tile_x', [1, 1, 1, 7]), ('tile_rc', [1, 128, 4]), ('tile_ry', [3, 1, 1]), ('tile_rx', [3, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],,None,5321600
    No: 12  GFLOPS: 0.00/91.91      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fb24766b8f1]\n  [bt] (0) /workspace/build/libtvm.so(+0xc3bb1b) [0x7fb247666b1b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 71, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 596, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.10695314407348633, timestamp=1560555005.6730995) [('tile_f', [2, 16, 4, 4]), ('tile_y', [1, 1, 1, 7]), ('tile_x', [7, 1, 1, 1]), ('tile_rc', [16, 4, 8]), ('tile_ry', [3, 1, 1]), ('tile_rx', [1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],,None,4168459
    No: 13  GFLOPS: 2.44/91.91      result: MeasureResult(costs=(0.09480758825,), error_no=0, all_cost=5.516159534454346, timestamp=1560555015.663064)      [('tile_f', [16, 8, 1, 4]), ('tile_y', [1, 1, 1, 7]), ('tile_x', [7, 1, 1, 1]), ('tile_rc', [32, 8, 2]), ('tile_ry', [1, 1, 3]), ('tile_rx', [1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],,None,6822523
    No: 14  GFLOPS: 9.88/91.91      result: MeasureResult(costs=(0.023441233833333335,), error_no=0, all_cost=4.780797004699707, timestamp=1560555016.906882)       [('tile_f', [4, 16, 2, 4]), ('tile_y', [7, 1, 1, 1]), ('tile_x', [7, 1, 1, 1]), ('tile_rc', [128, 4, 1]), ('tile_ry', [3, 1, 1]), ('tile_rx', [3, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],,None,1749552
    No: 15  GFLOPS: 6.26/91.91      result: MeasureResult(costs=(0.03700436825,), error_no=0, all_cost=3.7564845085144043, timestamp=1560555018.090892)     [('tile_f', [64, 1, 1, 8]), ('tile_y', [7, 1, 1, 1]), ('tile_x', [1, 7, 1, 1]), ('tile_rc', [128, 2, 2]), ('tile_ry', [1, 3, 1]), ('tile_rx', [1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],,None,6622136
    No: 16  GFLOPS: 0.00/91.91      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fb24766b8f1]\n  [bt] (0) /workspace/build/libtvm.so(+0xc3bb1b) [0x7fb247666b1b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 71, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 596, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.138352632522583, timestamp=1560555013.9409487)   [('tile_f', [1, 4, 128, 1]), ('tile_y', [1, 1, 7, 1]), ('tile_x', [1, 1, 1, 7]), ('tile_rc', [32, 1, 16]), ('tile_ry', [1, 1, 3]), ('tile_rx', [1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],,None,2833211
    No: 17  GFLOPS: 0.00/91.91      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fb24766b8f1]\n  [bt] (0) /workspace/build/libtvm.so(+0xc3bb1b) [0x7fb247666b1b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 71, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 596, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.10079503059387207, timestamp=1560555019.544184)  [('tile_f', [16, 16, 1, 2]), ('tile_y', [1, 1, 7, 1]), ('tile_x', [7, 1, 1, 1]), ('tile_rc', [1, 2, 256]), ('tile_ry', [3, 1, 1]), ('tile_rx', [1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],,None,1348659
    No: 18  GFLOPS: 0.00/91.91      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fb24766b8f1]\n  [bt] (0) /workspace/build/libtvm.so(+0xc3bb1b) [0x7fb247666b1b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 71, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 596, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.12068891525268555, timestamp=1560555019.7290518) [('tile_f', [32, 2, 1, 8]), ('tile_y', [1, 7, 1, 1]), ('tile_x', [1, 1, 1, 7]), ('tile_rc', [1, 128, 4]), ('tile_ry', [3, 1, 1]), ('tile_rx', [1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],,None,4740917
    No: 19  GFLOPS: 0.00/91.91      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fb24766b8f1]\n  [bt] (0) /workspace/build/libtvm.so(+0xc3bb1b) [0x7fb247666b1b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 71, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 596, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.21637392044067383, timestamp=1560555019.8936267) [('tile_f', [4, 8, 2, 8]), ('tile_y', [1, 7, 1, 1]), ('tile_x', [1, 7, 1, 1]), ('tile_rc', [4, 128, 1]), ('tile_ry', [1, 1, 3]), ('tile_rx', [3, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],,None,7382686
    No: 20  GFLOPS: 0.00/91.91      result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):\n  [bt] (3) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fb24766b8f1]\n  [bt] (2) /workspace/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::RPCModuleNode::WrapRemote(void*)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0x3b) [0x7fb247697fbb]\n  [bt] (1) /workspace/build/libtvm.so(tvm::runtime::RPCSession::CallFunc(void*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, tvm::runtime::PackedFunc const*)+0x154) [0x7fb247686814]\n  [bt] (0) /workspace/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7fb246e41432]\n  File "/workspace/src/runtime/rpc/rpc_session.cc", line 962\nTVMError: Check failed: code == RPCCode: :kReturn: code=4',),), error_no=4, all_cost=6.365118026733398, timestamp=1560555027.0861742)        [('tile_f', [512, 1, 1, 1]), ('tile_y', [7, 1, 1, 1]), ('tile_x', [1, 1, 1, 7]), ('tile_rc', [1, 8, 64]), ('tile_ry', [1, 1, 3]), ('tile_rx', [3, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],,None,7528400

最后，我们可以从log文件中检查最佳配置，检查正确性并测量运行时间。

# inspect the best config
    dispatch_context = autotvm.apply_history_best("conv2d.log")
    best_config = dispatch_context.query(task.target, task.workload)
    print("\nBest config:")
    print(best_config)
    
    # apply history best from log file
    with autotvm.apply_history_best('conv2d.log'):
        with tvm.target.create("cuda"):
            s, arg_bufs = conv2d_no_batching(N, H, W, CO, CI, KH, KW, strides, padding)
            func = tvm.build(s, arg_bufs)
    
    # check correctness
    a_np = np.random.uniform(size=(N, CI, H, W)).astype(np.float32)
    w_np = np.random.uniform(size=(CO, CI, KH, KW)).astype(np.float32)
    c_np = conv2d_nchw_python(a_np, w_np, strides, padding)
    
    ctx = tvm.gpu()
    a_tvm = tvm.nd.array(a_np, ctx=ctx)
    w_tvm = tvm.nd.array(w_np, ctx=ctx)
    c_tvm = tvm.nd.empty(c_np.shape, ctx=ctx)
    func(a_tvm, w_tvm, c_tvm)
    
    tvm.testing.assert_allclose(c_np, c_tvm.asnumpy(), rtol=1e-2)
    
    # Evaluate running time. Here we choose a large repeat number (400) to reduce the noise
    # and the overhead of kernel launch. You can also use nvprof to validate the result.
    evaluator = func.time_evaluator(func.entry_name, ctx, number=400)
    print('Time cost of this operator: %f' % evaluator(a_tvm, w_tvm, c_tvm).mean)

输出：

Best config:
    [('tile_f', [8, 1, 64, 1]), ('tile_y', [1, 7, 1, 1]), ('tile_x', [1, 7, 1, 1]), ('tile_rc', [64, 2, 4]), ('tile_ry', [1, 1, 3]), ('tile_rx', [1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],,None,9751545
    Time cost of this operator: 0.002476

[TVM Compiler_TVM]: https://blog.csdn.net/hw5226349/article/details/92019491
[TVM Compiler_TVM_GPU]: https://blog.csdn.net/hw5226349/article/details/91647225
[NVIDIA GPU_DepthwiseConv]: https://tvm.ai/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html
[autotvm]: https://docs.tvm.ai/tutorials/autotvm/tune_simple_template.html