TVM Target Codegen
Table of Contents
1. TVM Target Codegen
Target Codegen 分为以下几类:
基于 LLVM
包括 llvm, hexagon, arm, x86 等
基于源码 (source)
这类 codegen 会输入源码
包括:c, opencl, cuda, metal 等
Target Codegen 与 BYOC Codegen 有些相似的地方, 以 opencl target 为例:
- 通过 target.build.opencl 找到 opencl codegen
- 编译生成的 symbol 和 opencl kernel 同样通过 SaveToBinary 保存在 elf 中
- 运行时也是通过 GetFunction dispatch 到 opencl_module, 后者会通过 opencl 调用对应的 kernel
1.1. Opencl Codegen Example
#!/usr/bin/env python3 # -*- coding: utf-8 -*- # 2021-08-03 11:11 import tvm from tvm import relay x = relay.var("x", shape=(1, 1000), dtype="float32") y = relay.add(x, x) func = relay.Function([x], y) mod = tvm.IRModule.from_expr(func) print(mod) with tvm.transform.PassContext(opt_level=3): graph, lib, params = relay.build( mod, target="opencl", params=None, target_host="llvm" ) print(graph) print(lib.imported_modules[0].get_source()) lib.export_library("/tmp/a.elf")
def @main(%x: Tensor[(1, 1000), float32]) { add(%x, %x) } { "nodes": [ { "op": "null", "name": "x", "inputs": [] }, { "op": "tvm_op", "name": "tvmgen_default_fused_add", "attrs": { "num_inputs": "1", "num_outputs": "1", "hash": "e28096c8374f91b0", "flatten_data": "0", "func_name": "tvmgen_default_fused_add" }, "inputs": [ [ 0, 0, 0 ] ] } ], "arg_nodes": [0], "heads": [ [ 1, 0, 0 ] ], "attrs": { "dltype": [ "list_str", [ "float32", "float32" ] ], "shape": [ "list_shape", [ [1, 1000], [1, 1000] ] ], "storage_id": [ "list_int", [0, 1] ] }, "node_row_ptr": [0, 1, 2] } // Function: tvmgen_default_fused_add_kernel0 __kernel void tvmgen_default_fused_add_kernel0(__global float* restrict T_add, __global float* restrict placeholder) { if (((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))) < 1000) { T_add[(((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))))] = (placeholder[(((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))))] + placeholder[(((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))))]); } }
readelf -p .rodata /tmp/a.elf objdump -d /tmp/a.elf
String dump of section '.rodata': ... [ ab7] // Function: tvmgen_default_fused_add_kernel0^J__kernel void tvmgen_default_fused_add_kernel0(__global float* restrict T_add, __global float* restrict placeholder) {^J if (((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))) < 1000) {^J T_add[(((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))))] = (placeholder[(((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))))] + placeholder[(((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))))]);^J }^J}^J^J^L ... /tmp/a.elf: file format elf64-x86-64 ... 0000000000001100 <tvmgen_default_fused_add>: 1100: 41 57 push %r15 1102: 41 56 push %r14 1104: 53 push %rbx 1105: 48 83 ec 50 sub $0x50,%rsp 1109: 83 fa 02 cmp $0x2,%edx 110c: 0f 85 ce 01 00 00 jne 12e0 <tvmgen_default_fused_add+0x1e0> ... 12aa: e8 91 01 00 00 callq 1440 <tvmgen_default_fused_add_compute_> ... 0000000000001440 <tvmgen_default_fused_add_compute_>: ... 1466: 48 8b 3d e3 2b 00 00 mov 0x2be3(%rip),%rdi # 4050 <.tvm_func.tvmgen_default_fused_add_kernel0> 146d: 48 85 ff test %rdi,%rdi 1470: 74 24 je 1496 <tvmgen_default_fused_add_compute_+0x56> 1472: 48 8b 05 67 2b 00 00 mov 0x2b67(%rip),%rax # 3fe0 <__TVMFuncCall-0x50> ... 14b1: ff 10 callq *(%rax) 14b3: 85 c0 test %eax,%eax 14b5: 75 d7 jne 148e <tvmgen_default_fused_add_compute_+0x4e> 14b7: 48 8b 3c 24 mov (%rsp),%rdi 14bb: 48 89 3d 8e 2b 00 00 mov %rdi,0x2b8e(%rip) # 4050 <.tvm_func.tvmgen_default_fused_add_kernel0> 14c2: eb ae jmp 1472 <tvmgen_default_fused_add_compute_+0x32>
从上面输出的信息可以得到以下结论:
- llvm 生成了 host 端的 tvmgen_default_fused_add 和 tvmgen_default_fused_add_compute_
- opencl runtime 会通过 tvmgen_default_fused_add_kernel0 这个 symbol 执行 opencl kernel
- tvmgen_default_fused_add_compute_ 通过 __TVMFuncCall 来调用 tvmgen_default_fused_add_kernel0, 后者由 opencl runtime 来响应, 最终调用到 opencl kernel
1.2. Target Codegen Impl
1.2.1. runtime
target runtime 的逻辑 与 BYOC Runtime 完全相同:
根据 elf 中的 __tvm_dev_mblob 加载 runtime.module.loadbinarytkey, 由后者生成对应的 runtime (或者叫 ModuleNode)
tvm/src/runtime/opencl/opencl_module.cc::TVM_REGISTER_GLOBAL("runtime.module.loadbinary_opencl").set_body_typed(OpenCLModuleLoadBinary);
runtime 的 GetFunction 负责真正的调用, 以 opencl 为例, 负责通过 opencl 启动对应的 kernel
tvm/src/runtime/opencl/opencl_module.cc::PackedFunc OpenCLModuleNode::GetFunction(const std::string& name,
1.2.2. Codegen
codegen 与 Relay Codegen 相比更简化一些, 因为它不需要 annotate. 其它的步骤比如
注册 codegen
tvm/src/target/source/codegen_opencl.cc::TVM_REGISTER_GLOBAL("target.build.opencl").set_body_typed(BuildOpenCL);
编译
tvm/src/target/codegen.cc::runtime::Module Build(IRModule mod, Target target) {
- SaveToBinary
与 Relay Codegen 基本类似
1.2.3. What about the __TVMFuncCall
以 opencl 为例, 除了通过 opencl codegen 生成了一些 symbol (例如 tvmgen_default_fused_add_kernel0) 外, 还有一些 host 端代码通过 __TVMFuncCall 来调用这些 symbol.
1.2.3.1. __TVMFuncCall 在哪儿
readelf -a /tmp/a.elf|grep __TVMFuncCall -B 20 objdump -D /tmp/a.elf|grep 4030 -B 20
0x0000000000000005 (STRTAB) 0x3a8 0x0000000000000006 (SYMTAB) 0x2a0 0x000000000000000a (STRSZ) 205 (bytes) 0x000000000000000b (SYMENT) 24 (bytes) 0x0000000000000003 (PLTGOT) 0x4000 0x0000000000000007 (RELA) 0x478 0x0000000000000008 (RELASZ) 264 (bytes) 0x0000000000000009 (RELAENT) 24 (bytes) 0x000000006ffffff9 (RELACOUNT) 3 0x0000000000000000 (NULL) 0x0 Relocation section '.rela.dyn' at offset 0x478 contains 11 entries: Offset Info Type Sym. Value Sym. Name + Addend 000000003e60 000000000008 R_X86_64_RELATIVE 10f0 000000003e68 000000000008 R_X86_64_RELATIVE 10b0 000000004018 000000000008 R_X86_64_RELATIVE 4018 000000003fc0 000100000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize + 0 000000003fc8 000600000006 R_X86_64_GLOB_DAT 0000000000004028 __tvm_module_ctx + 0 000000003fd0 000200000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCloneTa + 0 000000003fd8 000300000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTMClone + 0 000000003fe0 000a00000006 R_X86_64_GLOB_DAT 0000000000004030 __TVMFuncCall + 0 000000003fe8 000700000006 R_X86_64_GLOB_DAT 0000000000004040 __TVMAPISetLastError + 0 000000003ff0 000900000006 R_X86_64_GLOB_DAT 0000000000004038 __TVMBackendGetFuncFro + 0 000000003ff8 000400000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0 The decoding of unwind sections for machine type Advanced Micro Devices X86-64 is not currently supported. Symbol table '.dynsym' contains 11 entries: Num: Value Size Type Bind Vis Ndx Name 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND 1: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __cxa_finalize 2: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_registerTMCloneTable 3: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_deregisterTMCloneTab 4: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __gmon_start__ 5: 0000000000001100 822 FUNC GLOBAL DEFAULT 9 tvmgen_default_fused_add 6: 0000000000004028 8 OBJECT WEAK DEFAULT 20 __tvm_module_ctx 7: 0000000000004040 8 OBJECT WEAK DEFAULT 20 __TVMAPISetLastError 8: 0000000000002a06 715 OBJECT GLOBAL DEFAULT 11 __tvm_dev_mblob 9: 0000000000004038 8 OBJECT WEAK DEFAULT 20 __TVMBackendGetFuncFromEn 10: 0000000000004030 8 OBJECT WEAK DEFAULT 20 __TVMFuncCall -- 37: 0000000000004048 8 OBJECT LOCAL DEFAULT 20 .tvm_func.__tvm_set_devic 38: 0000000000004050 8 OBJECT LOCAL DEFAULT 20 .tvm_func.tvmgen_default_ 39: 0000000000001440 132 FUNC LOCAL DEFAULT 9 tvmgen_default_fused_add_ 40: 0000000000000000 0 FILE LOCAL DEFAULT ABS devc 41: 0000000000000000 0 FILE LOCAL DEFAULT ABS crtstuff.c 42: 0000000000002db8 0 OBJECT LOCAL DEFAULT 13 __FRAME_END__ 43: 0000000000000000 0 FILE LOCAL DEFAULT ABS 44: 0000000000003e70 0 OBJECT LOCAL DEFAULT 16 _DYNAMIC 45: 0000000000004020 0 OBJECT LOCAL DEFAULT 19 __TMC_END__ 46: 0000000000004018 0 OBJECT LOCAL DEFAULT 19 __dso_handle 47: 0000000000001000 0 FUNC LOCAL DEFAULT 6 _init 48: 0000000000002cd4 0 NOTYPE LOCAL DEFAULT 12 __GNU_EH_FRAME_HDR 49: 00000000000014c4 0 FUNC LOCAL DEFAULT 10 _fini 50: 0000000000004000 0 OBJECT LOCAL DEFAULT 18 _GLOBAL_OFFSET_TABLE_ 51: 0000000000001100 822 FUNC GLOBAL DEFAULT 9 tvmgen_default_fused_add 52: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __cxa_finalize 53: 0000000000004028 8 OBJECT WEAK DEFAULT 20 __tvm_module_ctx 54: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_registerTMCloneTable 55: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_deregisterTMCloneTab 56: 0000000000002a06 715 OBJECT GLOBAL DEFAULT 11 __tvm_dev_mblob 57: 0000000000004030 8 OBJECT WEAK DEFAULT 20 __TVMFuncCall 0000000000004000 <_GLOBAL_OFFSET_TABLE_>: 4000: 70 3e jo 4040 <__TVMAPISetLastError> ... Disassembly of section .data: 0000000000004018 <__dso_handle>: 4018: 18 40 00 sbb %al,0x0(%rax) 401b: 00 00 add %al,(%rax) 401d: 00 00 add %al,(%rax) ... Disassembly of section .bss: 0000000000004020 <completed.8060>: ... 0000000000004028 <__tvm_module_ctx>: ... 0000000000004030 <__TVMFuncCall>:
可见 __TVMFuncCall 是一个 dynsym (类似于 libc 中的 print), 调用这个符号时最终会用 .bss 中 0x4030 处的值做为函数的地址.
实现这一目标分为两步:
llvm 初始化一个名为 __TVMFuncCall 的全局变量, 并生成了对这个全局变量对应的地址的调用
tvm/src/target/llvm/codegen_cpu.cc::void CodeGenCPU::InitGlobalContext(bool dynamic_lookup) {
gv_tvm_func_call_ = InitContextPtr(ftype_tvm_func_call_->getPointerTo(), "__TVMFuncCall"); llvm::GlobalVariable* gv = new llvm::GlobalVariable( *module_, p_type, false, llvm::GlobalValue::LinkOnceAnyLinkage, nullptr, name); gv->setInitializer(llvm::Constant::getNullValue(p_type)); llvm::BasicBlock* CodeGenCPU::MakeCallPacked(const Array<PrimExpr>& args, llvm::Value** rvalue, llvm::Value** ret_tcode, const DataType& r_type, const int64_t begin, const int64_t end) { auto call_callee = RuntimeTVMFuncCall(); builder_->CreateCall( call_callee, {handle, arg_value, arg_tcode, ConstInt32(nargs), ret_value, *ret_tcode}) ...
tvm runtime 会负责在运行时用真正的 TVMFuncCall 来初始化 __TVMFuncCall
tvm/src/runtime/library_module.cc::void InitContextFunctions(std::function<void*(const char*)> fgetsymbol) {
TVM_INIT_CONTEXT_FUNC(TVMFuncCall); // fgetsymbol 在这里实际就是 dlsym if (auto* fp = reinterpret_cast<decltype(&FuncName)*>(fgetsymbol("__" #FuncName))) { \ *fp = FuncName; \ }