TVM Target Codegen

Table of Contents

1. TVM Target Codegen

Target Codegen 分为以下几类:

  1. 基于 LLVM

    包括 llvm, hexagon, arm, x86 等

  2. 基于源码 (source)

    这类 codegen 会输入源码

    包括:c, opencl, cuda, metal 等

Target Codegen 与 BYOC Codegen 有些相似的地方, 以 opencl target 为例:

  1. 通过 target.build.opencl 找到 opencl codegen
  2. 编译生成的 symbol 和 opencl kernel 同样通过 SaveToBinary 保存在 elf 中
  3. 运行时也是通过 GetFunction dispatch 到 opencl_module, 后者会通过 opencl 调用对应的 kernel

1.1. Opencl Codegen Example

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# 2021-08-03 11:11
import tvm
from tvm import relay

x = relay.var("x", shape=(1, 1000), dtype="float32")
y = relay.add(x, x)
func = relay.Function([x], y)
mod = tvm.IRModule.from_expr(func)
print(mod)

with tvm.transform.PassContext(opt_level=3):
    graph, lib, params = relay.build(
        mod, target="opencl", params=None, target_host="llvm"
    )

print(graph)
print(lib.imported_modules[0].get_source())
lib.export_library("/tmp/a.elf")
def @main(%x: Tensor[(1, 1000), float32]) {
  add(%x, %x)
}

{
  "nodes": [
    {
      "op": "null",
      "name": "x",
      "inputs": []
    },
    {
      "op": "tvm_op",
      "name": "tvmgen_default_fused_add",
      "attrs": {
        "num_inputs": "1",
        "num_outputs": "1",
        "hash": "e28096c8374f91b0",
        "flatten_data": "0",
        "func_name": "tvmgen_default_fused_add"
      },
      "inputs": [
        [
          0,
          0,
          0
        ]
      ]
    }
  ],
  "arg_nodes": [0],
  "heads": [
    [
      1,
      0,
      0
    ]
  ],
  "attrs": {
    "dltype": [
      "list_str",
      [
        "float32",
        "float32"
      ]
    ],
    "shape": [
      "list_shape",
      [
        [1, 1000],
        [1, 1000]
      ]
    ],
    "storage_id": [
      "list_int",
      [0, 1]
    ]
  },
  "node_row_ptr": [0, 1, 2]
}
// Function: tvmgen_default_fused_add_kernel0
__kernel void tvmgen_default_fused_add_kernel0(__global float* restrict T_add, __global float* restrict placeholder) {
  if (((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))) < 1000) {
    T_add[(((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))))] = (placeholder[(((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))))] + placeholder[(((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))))]);
  }
}
readelf -p .rodata /tmp/a.elf
objdump -d /tmp/a.elf
String dump of section '.rodata':
...

  [ ab7] // Function: tvmgen_default_fused_add_kernel0^J__kernel void
tvmgen_default_fused_add_kernel0(__global float* restrict T_add, __global float*
restrict placeholder) {^J if (((((int)get_group_id(0)) * 256) +
((int)get_local_id(0))) < 1000) {^J T_add[(((((int)get_group_id(0)) * 256) +
((int)get_local_id(0))))] = (placeholder[(((((int)get_group_id(0)) * 256) +
((int)get_local_id(0))))] + placeholder[(((((int)get_group_id(0)) * 256) +
((int)get_local_id(0))))]);^J }^J}^J^J^L ...


/tmp/a.elf:     file format elf64-x86-64


...
0000000000001100 <tvmgen_default_fused_add>:
    1100:	41 57                	push   %r15
    1102:	41 56                	push   %r14
    1104:	53                   	push   %rbx
    1105:	48 83 ec 50          	sub    $0x50,%rsp
    1109:	83 fa 02             	cmp    $0x2,%edx
    110c:	0f 85 ce 01 00 00    	jne    12e0 <tvmgen_default_fused_add+0x1e0>
...
    12aa:	e8 91 01 00 00       	callq  1440 <tvmgen_default_fused_add_compute_>
...

0000000000001440 <tvmgen_default_fused_add_compute_>:
...
    1466:	48 8b 3d e3 2b 00 00 	mov    0x2be3(%rip),%rdi        # 4050 <.tvm_func.tvmgen_default_fused_add_kernel0>
    146d:	48 85 ff             	test   %rdi,%rdi
    1470:	74 24                	je     1496 <tvmgen_default_fused_add_compute_+0x56>
    1472:	48 8b 05 67 2b 00 00 	mov    0x2b67(%rip),%rax        # 3fe0 <__TVMFuncCall-0x50>
...
    14b1:	ff 10                	callq  *(%rax)
    14b3:	85 c0                	test   %eax,%eax
    14b5:	75 d7                	jne    148e <tvmgen_default_fused_add_compute_+0x4e>
    14b7:	48 8b 3c 24          	mov    (%rsp),%rdi
    14bb:	48 89 3d 8e 2b 00 00 	mov    %rdi,0x2b8e(%rip)        # 4050 <.tvm_func.tvmgen_default_fused_add_kernel0>
    14c2:	eb ae                	jmp    1472 <tvmgen_default_fused_add_compute_+0x32>

从上面输出的信息可以得到以下结论:

  1. llvm 生成了 host 端的 tvmgen_default_fused_add 和 tvmgen_default_fused_add_compute_
  2. opencl runtime 会通过 tvmgen_default_fused_add_kernel0 这个 symbol 执行 opencl kernel
  3. tvmgen_default_fused_add_compute_ 通过 __TVMFuncCall 来调用 tvmgen_default_fused_add_kernel0, 后者由 opencl runtime 来响应, 最终调用到 opencl kernel

1.2. Target Codegen Impl

1.2.1. runtime

target runtime 的逻辑 与 BYOC Runtime 完全相同:

  1. 根据 elf 中的 __tvm_dev_mblob 加载 runtime.module.loadbinarytkey, 由后者生成对应的 runtime (或者叫 ModuleNode)

    tvm/src/runtime/opencl/opencl_module.cc::TVM_REGISTER_GLOBAL("runtime.module.loadbinary_opencl").set_body_typed(OpenCLModuleLoadBinary);

  2. runtime 的 GetFunction 负责真正的调用, 以 opencl 为例, 负责通过 opencl 启动对应的 kernel

    tvm/src/runtime/opencl/opencl_module.cc::PackedFunc OpenCLModuleNode::GetFunction(const std::string& name,

1.2.2. Codegen

codegen 与 Relay Codegen 相比更简化一些, 因为它不需要 annotate. 其它的步骤比如

  • 注册 codegen

    tvm/src/target/source/codegen_opencl.cc::TVM_REGISTER_GLOBAL("target.build.opencl").set_body_typed(BuildOpenCL);

  • 编译

    tvm/src/target/codegen.cc::runtime::Module Build(IRModule mod, Target target) {

  • SaveToBinary

与 Relay Codegen 基本类似

1.2.3. What about the __TVMFuncCall

以 opencl 为例, 除了通过 opencl codegen 生成了一些 symbol (例如 tvmgen_default_fused_add_kernel0) 外, 还有一些 host 端代码通过 __TVMFuncCall 来调用这些 symbol.

1.2.3.1. __TVMFuncCall 在哪儿
readelf -a /tmp/a.elf|grep __TVMFuncCall -B 20
objdump -D /tmp/a.elf|grep 4030 -B 20

 0x0000000000000005 (STRTAB)             0x3a8
 0x0000000000000006 (SYMTAB)             0x2a0
 0x000000000000000a (STRSZ)              205 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000003 (PLTGOT)             0x4000
 0x0000000000000007 (RELA)               0x478
 0x0000000000000008 (RELASZ)             264 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffff9 (RELACOUNT)          3
 0x0000000000000000 (NULL)               0x0

Relocation section '.rela.dyn' at offset 0x478 contains 11 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000003e60  000000000008 R_X86_64_RELATIVE                    10f0
000000003e68  000000000008 R_X86_64_RELATIVE                    10b0
000000004018  000000000008 R_X86_64_RELATIVE                    4018
000000003fc0  000100000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize + 0
000000003fc8  000600000006 R_X86_64_GLOB_DAT 0000000000004028 __tvm_module_ctx + 0
000000003fd0  000200000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCloneTa + 0
000000003fd8  000300000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTMClone + 0
000000003fe0  000a00000006 R_X86_64_GLOB_DAT 0000000000004030 __TVMFuncCall + 0
000000003fe8  000700000006 R_X86_64_GLOB_DAT 0000000000004040 __TVMAPISetLastError + 0
000000003ff0  000900000006 R_X86_64_GLOB_DAT 0000000000004038 __TVMBackendGetFuncFro + 0
000000003ff8  000400000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0

The decoding of unwind sections for machine type Advanced Micro Devices X86-64 is not currently supported.

Symbol table '.dynsym' contains 11 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __cxa_finalize
     2: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_registerTMCloneTable
     3: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_deregisterTMCloneTab
     4: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __gmon_start__
     5: 0000000000001100   822 FUNC    GLOBAL DEFAULT    9 tvmgen_default_fused_add
     6: 0000000000004028     8 OBJECT  WEAK   DEFAULT   20 __tvm_module_ctx
     7: 0000000000004040     8 OBJECT  WEAK   DEFAULT   20 __TVMAPISetLastError
     8: 0000000000002a06   715 OBJECT  GLOBAL DEFAULT   11 __tvm_dev_mblob
     9: 0000000000004038     8 OBJECT  WEAK   DEFAULT   20 __TVMBackendGetFuncFromEn
    10: 0000000000004030     8 OBJECT  WEAK   DEFAULT   20 __TVMFuncCall
--
    37: 0000000000004048     8 OBJECT  LOCAL  DEFAULT   20 .tvm_func.__tvm_set_devic
    38: 0000000000004050     8 OBJECT  LOCAL  DEFAULT   20 .tvm_func.tvmgen_default_
    39: 0000000000001440   132 FUNC    LOCAL  DEFAULT    9 tvmgen_default_fused_add_
    40: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS devc
    41: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS crtstuff.c
    42: 0000000000002db8     0 OBJECT  LOCAL  DEFAULT   13 __FRAME_END__
    43: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS
    44: 0000000000003e70     0 OBJECT  LOCAL  DEFAULT   16 _DYNAMIC
    45: 0000000000004020     0 OBJECT  LOCAL  DEFAULT   19 __TMC_END__
    46: 0000000000004018     0 OBJECT  LOCAL  DEFAULT   19 __dso_handle
    47: 0000000000001000     0 FUNC    LOCAL  DEFAULT    6 _init
    48: 0000000000002cd4     0 NOTYPE  LOCAL  DEFAULT   12 __GNU_EH_FRAME_HDR
    49: 00000000000014c4     0 FUNC    LOCAL  DEFAULT   10 _fini
    50: 0000000000004000     0 OBJECT  LOCAL  DEFAULT   18 _GLOBAL_OFFSET_TABLE_
    51: 0000000000001100   822 FUNC    GLOBAL DEFAULT    9 tvmgen_default_fused_add
    52: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __cxa_finalize
    53: 0000000000004028     8 OBJECT  WEAK   DEFAULT   20 __tvm_module_ctx
    54: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_registerTMCloneTable
    55: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_deregisterTMCloneTab
    56: 0000000000002a06   715 OBJECT  GLOBAL DEFAULT   11 __tvm_dev_mblob
    57: 0000000000004030     8 OBJECT  WEAK   DEFAULT   20 __TVMFuncCall
0000000000004000 <_GLOBAL_OFFSET_TABLE_>:
    4000:	70 3e                	jo     4040 <__TVMAPISetLastError>
    ...

Disassembly of section .data:

0000000000004018 <__dso_handle>:
    4018:	18 40 00             	sbb    %al,0x0(%rax)
    401b:	00 00                	add    %al,(%rax)
    401d:	00 00                	add    %al,(%rax)
    ...

Disassembly of section .bss:

0000000000004020 <completed.8060>:
    ...

0000000000004028 <__tvm_module_ctx>:
    ...

0000000000004030 <__TVMFuncCall>:

可见 __TVMFuncCall 是一个 dynsym (类似于 libc 中的 print), 调用这个符号时最终会用 .bss 中 0x4030 处的值做为函数的地址.

实现这一目标分为两步:

  1. llvm 初始化一个名为 __TVMFuncCall 的全局变量, 并生成了对这个全局变量对应的地址的调用

    tvm/src/target/llvm/codegen_cpu.cc::void CodeGenCPU::InitGlobalContext(bool dynamic_lookup) {

    gv_tvm_func_call_ = InitContextPtr(ftype_tvm_func_call_->getPointerTo(), "__TVMFuncCall");
      llvm::GlobalVariable* gv = new llvm::GlobalVariable(
        *module_, p_type, false, llvm::GlobalValue::LinkOnceAnyLinkage, nullptr, name);
      gv->setInitializer(llvm::Constant::getNullValue(p_type));
    
    
    llvm::BasicBlock* CodeGenCPU::MakeCallPacked(const Array<PrimExpr>& args, llvm::Value** rvalue,
                                                 llvm::Value** ret_tcode, const DataType& r_type,
                                                 const int64_t begin, const int64_t end) {
      auto call_callee = RuntimeTVMFuncCall();
      builder_->CreateCall(
          call_callee, {handle, arg_value, arg_tcode, ConstInt32(nargs), ret_value, *ret_tcode})
      ...
    
  2. tvm runtime 会负责在运行时用真正的 TVMFuncCall 来初始化 __TVMFuncCall

    tvm/src/runtime/library_module.cc::void InitContextFunctions(std::function<void*(const char*)> fgetsymbol) {

    TVM_INIT_CONTEXT_FUNC(TVMFuncCall);
      // fgetsymbol 在这里实际就是 dlsym
      if (auto* fp = reinterpret_cast<decltype(&FuncName)*>(fgetsymbol("__" #FuncName))) { \
        *fp = FuncName;                                                                    \
      }
    

Author: [email protected]
Date: 2021-08-04 Wed 00:00
Last updated: 2023-12-01 Fri 18:28

知识共享许可协议