TVM Target Codegen
Table of Contents
1. TVM Target Codegen
Target Codegen 分为以下几类:
基于 LLVM
包括 llvm, hexagon, arm, x86 等
基于源码 (source)
这类 codegen 会输入源码
包括:c, opencl, cuda, metal 等
Target Codegen 与 BYOC Codegen 有些相似的地方, 以 opencl target 为例:
- 通过 target.build.opencl 找到 opencl codegen
- 编译生成的 symbol 和 opencl kernel 同样通过 SaveToBinary 保存在 elf 中
- 运行时也是通过 GetFunction dispatch 到 opencl_module, 后者会通过 opencl 调用对应的 kernel
1.1. Opencl Codegen Example
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# 2021-08-03 11:11
import tvm
from tvm import relay
x = relay.var("x", shape=(1, 1000), dtype="float32")
y = relay.add(x, x)
func = relay.Function([x], y)
mod = tvm.IRModule.from_expr(func)
print(mod)
with tvm.transform.PassContext(opt_level=3):
graph, lib, params = relay.build(
mod, target="opencl", params=None, target_host="llvm"
)
print(graph)
print(lib.imported_modules[0].get_source())
lib.export_library("/tmp/a.elf")
def @main(%x: Tensor[(1, 1000), float32]) {
add(%x, %x)
}
{
"nodes": [
{
"op": "null",
"name": "x",
"inputs": []
},
{
"op": "tvm_op",
"name": "tvmgen_default_fused_add",
"attrs": {
"num_inputs": "1",
"num_outputs": "1",
"hash": "e28096c8374f91b0",
"flatten_data": "0",
"func_name": "tvmgen_default_fused_add"
},
"inputs": [
[
0,
0,
0
]
]
}
],
"arg_nodes": [0],
"heads": [
[
1,
0,
0
]
],
"attrs": {
"dltype": [
"list_str",
[
"float32",
"float32"
]
],
"shape": [
"list_shape",
[
[1, 1000],
[1, 1000]
]
],
"storage_id": [
"list_int",
[0, 1]
]
},
"node_row_ptr": [0, 1, 2]
}
// Function: tvmgen_default_fused_add_kernel0
__kernel void tvmgen_default_fused_add_kernel0(__global float* restrict T_add, __global float* restrict placeholder) {
if (((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))) < 1000) {
T_add[(((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))))] = (placeholder[(((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))))] + placeholder[(((((int)get_group_id(0)) * 256) + ((int)get_local_id(0))))]);
}
}
readelf -p .rodata /tmp/a.elf objdump -d /tmp/a.elf
String dump of section '.rodata':
...
[ ab7] // Function: tvmgen_default_fused_add_kernel0^J__kernel void
tvmgen_default_fused_add_kernel0(__global float* restrict T_add, __global float*
restrict placeholder) {^J if (((((int)get_group_id(0)) * 256) +
((int)get_local_id(0))) < 1000) {^J T_add[(((((int)get_group_id(0)) * 256) +
((int)get_local_id(0))))] = (placeholder[(((((int)get_group_id(0)) * 256) +
((int)get_local_id(0))))] + placeholder[(((((int)get_group_id(0)) * 256) +
((int)get_local_id(0))))]);^J }^J}^J^J^L ...
/tmp/a.elf: file format elf64-x86-64
...
0000000000001100 <tvmgen_default_fused_add>:
1100: 41 57 push %r15
1102: 41 56 push %r14
1104: 53 push %rbx
1105: 48 83 ec 50 sub $0x50,%rsp
1109: 83 fa 02 cmp $0x2,%edx
110c: 0f 85 ce 01 00 00 jne 12e0 <tvmgen_default_fused_add+0x1e0>
...
12aa: e8 91 01 00 00 callq 1440 <tvmgen_default_fused_add_compute_>
...
0000000000001440 <tvmgen_default_fused_add_compute_>:
...
1466: 48 8b 3d e3 2b 00 00 mov 0x2be3(%rip),%rdi # 4050 <.tvm_func.tvmgen_default_fused_add_kernel0>
146d: 48 85 ff test %rdi,%rdi
1470: 74 24 je 1496 <tvmgen_default_fused_add_compute_+0x56>
1472: 48 8b 05 67 2b 00 00 mov 0x2b67(%rip),%rax # 3fe0 <__TVMFuncCall-0x50>
...
14b1: ff 10 callq *(%rax)
14b3: 85 c0 test %eax,%eax
14b5: 75 d7 jne 148e <tvmgen_default_fused_add_compute_+0x4e>
14b7: 48 8b 3c 24 mov (%rsp),%rdi
14bb: 48 89 3d 8e 2b 00 00 mov %rdi,0x2b8e(%rip) # 4050 <.tvm_func.tvmgen_default_fused_add_kernel0>
14c2: eb ae jmp 1472 <tvmgen_default_fused_add_compute_+0x32>
从上面输出的信息可以得到以下结论:
- llvm 生成了 host 端的 tvmgen_default_fused_add 和 tvmgen_default_fused_add_compute_
- opencl runtime 会通过 tvmgen_default_fused_add_kernel0 这个 symbol 执行 opencl kernel
- tvmgen_default_fused_add_compute_ 通过 __TVMFuncCall 来调用 tvmgen_default_fused_add_kernel0, 后者由 opencl runtime 来响应, 最终调用到 opencl kernel
1.2. Target Codegen Impl
1.2.1. runtime
target runtime 的逻辑 与 BYOC Runtime 完全相同:
根据 elf 中的 __tvm_dev_mblob 加载 runtime.module.loadbinarytkey, 由后者生成对应的 runtime (或者叫 ModuleNode)
tvm/src/runtime/opencl/opencl_module.cc::TVM_REGISTER_GLOBAL("runtime.module.loadbinary_opencl").set_body_typed(OpenCLModuleLoadBinary);runtime 的 GetFunction 负责真正的调用, 以 opencl 为例, 负责通过 opencl 启动对应的 kernel
tvm/src/runtime/opencl/opencl_module.cc::PackedFunc OpenCLModuleNode::GetFunction(const std::string& name,
1.2.2. Codegen
codegen 与 Relay Codegen 相比更简化一些, 因为它不需要 annotate. 其它的步骤比如
注册 codegen
tvm/src/target/source/codegen_opencl.cc::TVM_REGISTER_GLOBAL("target.build.opencl").set_body_typed(BuildOpenCL);编译
tvm/src/target/codegen.cc::runtime::Module Build(IRModule mod, Target target) {- SaveToBinary
与 Relay Codegen 基本类似
1.2.3. What about the __TVMFuncCall
以 opencl 为例, 除了通过 opencl codegen 生成了一些 symbol (例如 tvmgen_default_fused_add_kernel0) 外, 还有一些 host 端代码通过 __TVMFuncCall 来调用这些 symbol.
1.2.3.1. __TVMFuncCall 在哪儿
readelf -a /tmp/a.elf|grep __TVMFuncCall -B 20 objdump -D /tmp/a.elf|grep 4030 -B 20
0x0000000000000005 (STRTAB) 0x3a8
0x0000000000000006 (SYMTAB) 0x2a0
0x000000000000000a (STRSZ) 205 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000003 (PLTGOT) 0x4000
0x0000000000000007 (RELA) 0x478
0x0000000000000008 (RELASZ) 264 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x000000006ffffff9 (RELACOUNT) 3
0x0000000000000000 (NULL) 0x0
Relocation section '.rela.dyn' at offset 0x478 contains 11 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000003e60 000000000008 R_X86_64_RELATIVE 10f0
000000003e68 000000000008 R_X86_64_RELATIVE 10b0
000000004018 000000000008 R_X86_64_RELATIVE 4018
000000003fc0 000100000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize + 0
000000003fc8 000600000006 R_X86_64_GLOB_DAT 0000000000004028 __tvm_module_ctx + 0
000000003fd0 000200000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCloneTa + 0
000000003fd8 000300000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTMClone + 0
000000003fe0 000a00000006 R_X86_64_GLOB_DAT 0000000000004030 __TVMFuncCall + 0
000000003fe8 000700000006 R_X86_64_GLOB_DAT 0000000000004040 __TVMAPISetLastError + 0
000000003ff0 000900000006 R_X86_64_GLOB_DAT 0000000000004038 __TVMBackendGetFuncFro + 0
000000003ff8 000400000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0
The decoding of unwind sections for machine type Advanced Micro Devices X86-64 is not currently supported.
Symbol table '.dynsym' contains 11 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __cxa_finalize
2: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_registerTMCloneTable
3: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_deregisterTMCloneTab
4: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __gmon_start__
5: 0000000000001100 822 FUNC GLOBAL DEFAULT 9 tvmgen_default_fused_add
6: 0000000000004028 8 OBJECT WEAK DEFAULT 20 __tvm_module_ctx
7: 0000000000004040 8 OBJECT WEAK DEFAULT 20 __TVMAPISetLastError
8: 0000000000002a06 715 OBJECT GLOBAL DEFAULT 11 __tvm_dev_mblob
9: 0000000000004038 8 OBJECT WEAK DEFAULT 20 __TVMBackendGetFuncFromEn
10: 0000000000004030 8 OBJECT WEAK DEFAULT 20 __TVMFuncCall
--
37: 0000000000004048 8 OBJECT LOCAL DEFAULT 20 .tvm_func.__tvm_set_devic
38: 0000000000004050 8 OBJECT LOCAL DEFAULT 20 .tvm_func.tvmgen_default_
39: 0000000000001440 132 FUNC LOCAL DEFAULT 9 tvmgen_default_fused_add_
40: 0000000000000000 0 FILE LOCAL DEFAULT ABS devc
41: 0000000000000000 0 FILE LOCAL DEFAULT ABS crtstuff.c
42: 0000000000002db8 0 OBJECT LOCAL DEFAULT 13 __FRAME_END__
43: 0000000000000000 0 FILE LOCAL DEFAULT ABS
44: 0000000000003e70 0 OBJECT LOCAL DEFAULT 16 _DYNAMIC
45: 0000000000004020 0 OBJECT LOCAL DEFAULT 19 __TMC_END__
46: 0000000000004018 0 OBJECT LOCAL DEFAULT 19 __dso_handle
47: 0000000000001000 0 FUNC LOCAL DEFAULT 6 _init
48: 0000000000002cd4 0 NOTYPE LOCAL DEFAULT 12 __GNU_EH_FRAME_HDR
49: 00000000000014c4 0 FUNC LOCAL DEFAULT 10 _fini
50: 0000000000004000 0 OBJECT LOCAL DEFAULT 18 _GLOBAL_OFFSET_TABLE_
51: 0000000000001100 822 FUNC GLOBAL DEFAULT 9 tvmgen_default_fused_add
52: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __cxa_finalize
53: 0000000000004028 8 OBJECT WEAK DEFAULT 20 __tvm_module_ctx
54: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_registerTMCloneTable
55: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_deregisterTMCloneTab
56: 0000000000002a06 715 OBJECT GLOBAL DEFAULT 11 __tvm_dev_mblob
57: 0000000000004030 8 OBJECT WEAK DEFAULT 20 __TVMFuncCall
0000000000004000 <_GLOBAL_OFFSET_TABLE_>:
4000: 70 3e jo 4040 <__TVMAPISetLastError>
...
Disassembly of section .data:
0000000000004018 <__dso_handle>:
4018: 18 40 00 sbb %al,0x0(%rax)
401b: 00 00 add %al,(%rax)
401d: 00 00 add %al,(%rax)
...
Disassembly of section .bss:
0000000000004020 <completed.8060>:
...
0000000000004028 <__tvm_module_ctx>:
...
0000000000004030 <__TVMFuncCall>:
可见 __TVMFuncCall 是一个 dynsym (类似于 libc 中的 print), 调用这个符号时最终会用 .bss 中 0x4030 处的值做为函数的地址.
实现这一目标分为两步:
llvm 初始化一个名为 __TVMFuncCall 的全局变量, 并生成了对这个全局变量对应的地址的调用
tvm/src/target/llvm/codegen_cpu.cc::void CodeGenCPU::InitGlobalContext(bool dynamic_lookup) {gv_tvm_func_call_ = InitContextPtr(ftype_tvm_func_call_->getPointerTo(), "__TVMFuncCall"); llvm::GlobalVariable* gv = new llvm::GlobalVariable( *module_, p_type, false, llvm::GlobalValue::LinkOnceAnyLinkage, nullptr, name); gv->setInitializer(llvm::Constant::getNullValue(p_type)); llvm::BasicBlock* CodeGenCPU::MakeCallPacked(const Array<PrimExpr>& args, llvm::Value** rvalue, llvm::Value** ret_tcode, const DataType& r_type, const int64_t begin, const int64_t end) { auto call_callee = RuntimeTVMFuncCall(); builder_->CreateCall( call_callee, {handle, arg_value, arg_tcode, ConstInt32(nargs), ret_value, *ret_tcode}) ...
tvm runtime 会负责在运行时用真正的 TVMFuncCall 来初始化 __TVMFuncCall
tvm/src/runtime/library_module.cc::void InitContextFunctions(std::function<void*(const char*)> fgetsymbol) {TVM_INIT_CONTEXT_FUNC(TVMFuncCall); // fgetsymbol 在这里实际就是 dlsym if (auto* fp = reinterpret_cast<decltype(&FuncName)*>(fgetsymbol("__" #FuncName))) { \ *fp = FuncName; \ }
