RFC 2603: Rust Symbol Name Mangling v0

作者:王江桐、周紫鹏


RFC 2603 更新了编译器生成组名的规则,使其更一致并且更易解码。在之后,这一规则可能会成为语言规范或/和 Rust 稳定 ABI 的一部分,不过目前这一规则的实现只保留在 Rust 编译器的逻辑里。

目前的组名规则存在以下问题:

  • 组名过程中,存在泛型参数以及其他信息的丢失,无法从符号名称中辨认出原先的类型参数;
  • 现有的规则并不一致,大部分遵从 Itanium ABI 规则,但是有一些并不是这样;
  • 有些符号名字会包含符号 .,但是这一符号并不被所有平台支持;
  • 目前符号的生成与编译器内部逻辑强耦合,生成的符号名字无法被其他编译器实现或外部工具复现。
    • 目前的组名会包含一个哈希值来确保不同的函数最后会映射不同的符号名,哈希值由编译器的内部逻辑生成,这导致了组名无法被快速复现。

新的规则一一修复了这些问题,并去除了哈希值。

组名规则与例子

Unicode 参数名

Rust 允许 Unicode 参数名,不过会使用 Punycode 编码将其转换为 ASCII 字符串。编码举例如下:

OriginalPunycodePunycode + Encoding
føøf-5gaaf_5gaa
α_ω_-ylb7e__ylb7e
铁锈n84amfn84amf
🤦fq9hfq9h
ρυστ2xaedc2xaedc

对于以下模块:


#![allow(unused)]
fn main() {
mod gödel {
  mod escher {
    fn bach() {}
  }
}
}

组名为:

_RNvNtNtC7mycrateu8gdel_5qa6escher4bach
                 <-------->
              Unicode component

转换组名时,在编码之后,会在开头添加 u 表示这是 Unicode 部分。

无泛型参数的函数

对于以下可以通过绝对路径确定的函数 bar


#![allow(unused)]
fn main() {
mod foo {
  fn bar() {}
}
}

其对应组名可能为 NNC7mycrate3foo3bar。字母 N 表示开始标记(Start-Tag)。为了区分不同 crate 中的同名函数,组名需要包含 crate-id,crate-id 以 C 作为开始标记。数字 3 来自游程编码,形似于 Itanium C++ ABI 用于标识符的编码格式。游程编码(Run-Length Encoding,RLE),又称行程长度编码或变动长度编码法,是一种与资料性质无关的无损数据压缩技术,基于“使用变动长度的码来取代连续重复出现的原始资料来实现压缩。举例来说,一组资料串"AAAABBBCCDEEEE",由4个A、3个B、2个C、1个D、4个E组成,经过变动长度编码法可将资料压缩为4A3B2C1D4E(由14个单位转成10个单位)。

对于同名但是类型不同的两个函数:


#![allow(unused)]
fn main() {
fn foo() {
    fn bar() {}
}

mod foo {
    fn bar() {}
}
}

前者组名为 NvNvC7mycrate3foo3bar ,后者组名为 NvNtC7mycrate3foo3bar ,来区分前者,即函数 foo ,定义于值命名空间,而后者,即模块 foo,定义于类型命名空间。

在使用宏的情况下,在同一上下文中,实例可能会含有重名的情况。例如:


#![allow(unused)]
fn main() {
mod foo {
    fn bar'0() {}
    // The second `bar` function was introduced by macro expansion.
    fn bar'1() {}
}
}

编译器会自动增加数字索引来区分这两者,生成的组名分别为 NvNtC7mycrate3foo3bar NvNtC7mycrate3foos_3bar s_ 代表该同名函数的第二种情况。这一区分规则也同样应用于区分同名 crate。

Rust 不允许函数重载,即函数同名,但是参数不同的情况。组名不需要对此种情况进行区分。

最后,完整生成的组名可能会包含一个 _R 前缀。举例如下:

_RNvNtCs1234_7mycrate3foo3bar
<>^^^^^<----><------><--><-->
||||||   |      |     |   |
||||||   |      |     |   +--- "bar" identifier
||||||   |      |     +------- "foo" identifier
||||||   |      +------------- "mycrate" identifier
||||||   +-------------------- disambiguator for "mycrate"
|||||+------------------------ start-tag for "mycrate"
||||+------------------------- namespace tag for "foo"
|||+-------------------------- start-tag for "foo"
||+--------------------------- namespace tag for "bar"
|+---------------------------- start-tag for "bar"
+----------------------------- common Rust symbol prefix

泛型函数

泛型函数的具体参数会添加到组名中,以 I 作为开始标记,来区分不同的泛型函数。例如:


#![allow(unused)]
fn main() {
std::mem::align_of::<f64>
}

组名为:

_RINvNtC3std3mem8align_ofdE
  ^                      ^^
  |                      ||
  |                      |+--- end of argument list
  |                      +--- f64
  +--- start-tag

原始类型会遵从 Itanium 规范,编码为单一的小写字母,其他的具体类型会以它们的绝对路径编码,形式类似于编码无反省参数的函数。引用、元组、以及函数指针则以如下例子中的规则编码:

std::mem::align_of::<usize>: _RINvNtC3std3mem8align_ofjE
std::mem::align_of::<&char>: _RINvNtC3std3mem8align_ofRcE
std::mem::align_of::<std::mem::Discriminant>: _RINvNtC3std3mem8align_ofNtNtC3std3mem12DiscriminantE
std::mem::align_of::<&mut (&str,())>: _RINvNtC3std3mem8align_ofQTReuEE

同样,为了区分不同上下文(例如不同 crate)中的同名泛型函数,编译器会在后缀添加 crate-id:

_RINvNtC3std3mem8align_ofjEC3foo (for crate "foo")
_RINvNtC3std3mem8align_ofjEC3bar (for crate "bar")

闭包

编译器会为匿名函数闭包生成名字,这个名字会用于编译组名。例如:


#![allow(unused)]
fn main() {
mod foo {
  fn bar(x: u32) {
    let a = |x| { x + 1 }; // local name: NC<...>0
    let b = |x| { x + 2 }; // local name: NC<...>s_0

    a(b(x))
  }
}
}

闭包的局部名称为 NC<...>0C 是命名域标记,<...> 部分则表示父路径,数字 0 代表它们的名字长度为0。因此,这两个闭包的组名分别为 NCNvNtC7mycrate3foo3bar0 以及 NCNvNtC7mycrate3foo3bars_0

方法

方法的组名规则如下:

  • 对于内部 impl 块:e.g. <Foo<u32, char>>::some_method
  • 对于 trait 方法:e.g. <Foo<u32, char> as SomeTrait<Quux>>::some_method

父路径不仅包含 crate-id,而是以 impl 块开头,impl 块包含自身的类型以及实现的 trait 名。扩展后的父路径包含如下语法:

<path> = C <identifier>                      // crate-id root
       | M <type>                            // inherent impl root
       | X <type> <path>                     // trait impl root
       | N <namespace> <path> <identifier>   // nested path
       | I <path> {<generic-arg>} E          // generic arguments
       
<impl-path> = [<disambiguator>] <path>

实际例子:

<mycrate::Foo<u32>>::foo => _RNvMINtC7mycrate3FoomE3foo
<u32 as mycrate::Foo>::foo => _RNvXmNtC7mycrate3Foo3foo
<mycrate::Foo<u32> as mycrate::Bar<u64>>::foo => _RNvXINtC7mycrate3FoomEINtC7mycrate3BaryE3foo

泛型 trait 中的类型

在如下例子中:

#![allow(unused)]
fn main() {
    struct Foo<T>(T);

    impl<T> From<T> for Foo<T> {
      default fn from(x: T) -> Self {
        static MSG: &str = "...";
        panic!("{}", MSG)
      }
    }

    impl<T: Default> From<T> for Foo<T> {
      fn from(x: T) -> Self {
        static MSG: &str = "123";
        panic!("{}", MSG)
      }
    }
}

MSG 的绝对路径为 <Foo<_> as From<_>>::foo::MSG。两个 trait 都是 From,因此需要在组名中添加父路径来区分这两个 trait。

对应的组名为:

_RNvNvXs2_C7mycrateINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG
_RNvNvXs3_C7mycrateINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG
       <----------><----------------><----------------------->
        impl-path      self-type            trait-name

压缩和替代

在一些情况下,Rust 泛型会导致参数的绝对名字非常长,而参数的名字长度会影响编译器、链接器、等其它程序的性能。例如,对于如下函数:

#![allow(unused)]
fn main() {
    fn quux(s: Vec<u32>) -> impl Iterator<Item = (u32, usize)> {
        s.into_iter()
         .map(|x| x + 1)
         .filter(|&x| x > 10)
         .zip(0..)
         .chain(iter::once((0, 0)))
    }
}

它的返回值为:

std::iter::Chain<
  std::iter::Zip<
    std::iter::Filter<
      std::iter::Map<
        std::vec::IntoIter<u32>,
        [closure@src/main.rs:16:11: 16:18]>,
      [closure@src/main.rs:17:14: 17:25]>,
    std::ops::RangeFrom<usize>>,
  std::iter::Once<(u32, usize)>>

对于此种情况,如果定义中某一部分出现了多次,那么编译器会采用压缩算法,使用某一个标记来替代这一重复元素。粗略来说,这一算法为:当组名开始时,对于后续出现的重复部分,以反向引用代替。反向引用的形式为 B<base-62-number>_

可以替代的部分有:

  • 所有路径的前缀,以及重复路径本身
  • 除了基础类型以外的所有类型
  • 类型级常量

例如,对于以下参数名:


#![allow(unused)]
fn main() {
std::iter::Chain<std::iter::Zip<std::vec::IntoIter<u32>, std::vec::IntoIter<u32>>>
}

不使用压缩的情况下,完整的组名为:

  0         10        20        30        40        50        60        70        80        90
_RINtNtC3std4iter5ChainINtNtC3std4iter3ZipINtNtC3std3vec8IntoItermEINtNtC3std3vec8IntoItermEEE
                            5----              5----                    5----
                          3-----------                                43---------
                                                                    41--------------------
                                                                   40-----------------------

压缩之后的组名为:

_RINtNtC3std4iter5ChainINtB2_3ZipINtNtB4_3vec8IntoItermEBt_EE
                          ^^^         ^^^               ^^^     back references

完整的组名语法集

  • 非终端相关参数由尖括号表示(e.g. <path>
  • 终端相关以双引号表示(e.g. "_R"
  • 可选项以方括号表示(e.g. [<disambiguator>]
  • 重复项(0或多次)由花括号表示(e.g. {<type>}
  • 备注由 // 表示
// The <decimal-number> specifies the encoding version.
<symbol-name> =
    "_R" [<decimal-number>] <path> [<instantiating-crate>] [<vendor-specific-suffix>]

<path> = "C" <identifier>                    // crate root
       | "M" <impl-path> <type>              // <T> (inherent impl)
       | "X" <impl-path> <type> <path>       // <T as Trait> (trait impl)
       | "Y" <type> <path>                   // <T as Trait> (trait definition)
       | "N" <namespace> <path> <identifier> // ...::ident (nested path)
       | "I" <path> {<generic-arg>} "E"      // ...<T, U> (generic args)
       | <backref>

// Path to an impl (without the Self type or the trait).
// The <path> is the parent, while the <disambiguator> distinguishes
// between impls in that same parent (e.g. multiple impls in a mod).
// This exists as a simple way of ensure uniqueness, and demanglers
// don't need to show it (unless the location of the impl is desired).
<impl-path> = [<disambiguator>] <path>

// The <decimal-number> is the length of the identifier in bytes.
// <bytes> is the identifier itself, and it's optionally preceded by "_",
// to separate it from its length - this "_" is mandatory if the <bytes>
// starts with a decimal digit, or "_", in order to keep it unambiguous.
// If the "u" is present then <bytes> is Punycode-encoded.
<identifier> = [<disambiguator>] <undisambiguated-identifier>
<disambiguator> = "s" <base-62-number>
<undisambiguated-identifier> = ["u"] <decimal-number> ["_"] <bytes>

// Namespace of the identifier in a (nested) path.
// It's an a-zA-Z character, with a-z reserved for implementation-internal
// disambiguation categories (and demanglers should never show them), while
// A-Z are used for special namespaces (e.g. closures), which the demangler
// can show in a special way (e.g. `NC...` as `...::{closure}`), or just
// default to showing the uppercase character.
<namespace> = "C"      // closure
            | "S"      // shim
            | <A-Z>    // other special namespaces
            | <a-z>    // internal namespaces

<generic-arg> = <lifetime>
              | <type>
              | "K" <const> // forward-compat for const generics

// An anonymous (numbered) lifetime, either erased or higher-ranked.
// Index 0 is always erased (can show as '_, if at all), while indices
// starting from 1 refer (as de Bruijn indices) to a higher-ranked
// lifetime bound by one of the enclosing <binder>s.
<lifetime> = "L" <base-62-number>

// Specify the number of higher-ranked (for<...>) lifetimes to bound.
// <lifetime> can then later refer to them, with lowest indices for
// innermost lifetimes, e.g. in `for<'a, 'b> fn(for<'c> fn(...))`,
// any <lifetime>s in ... (but not inside more binders) will observe
// the indices 1, 2, and 3 refer to 'c, 'b, and 'a, respectively.
// The number of bound lifetimes is value of <base-62-number> + 1.
<binder> = "G" <base-62-number>

<type> = <basic-type>
       | <path>                      // named type
       | "A" <type> <const>          // [T; N]
       | "S" <type>                  // [T]
       | "T" {<type>} "E"            // (T1, T2, T3, ...)
       | "R" [<lifetime>] <type>     // &T
       | "Q" [<lifetime>] <type>     // &mut T
       | "P" <type>                  // *const T
       | "O" <type>                  // *mut T
       | "F" <fn-sig>                // fn(...) -> ...
       | "D" <dyn-bounds> <lifetime> // dyn Trait<Assoc = X> + Send + 'a
       | <backref>

<basic-type> = "a"      // i8
             | "b"      // bool
             | "c"      // char
             | "d"      // f64
             | "e"      // str
             | "f"      // f32
             | "h"      // u8
             | "i"      // isize
             | "j"      // usize
             | "l"      // i32
             | "m"      // u32
             | "n"      // i128
             | "o"      // u128
             | "s"      // i16
             | "t"      // u16
             | "u"      // ()
             | "v"      // ...
             | "x"      // i64
             | "y"      // u64
             | "z"      // !
             | "p"      // placeholder (e.g. for generic params), shown as _

// If the "U" is present then the function is `unsafe`.
// The return type is always present, but demanglers can
// choose to omit the ` -> ()` by special-casing "u".
<fn-sig> = [<binder>] ["U"] ["K" <abi>] {<type>} "E" <type>

<abi> = "C"
      | <undisambiguated-identifier>

<dyn-bounds> = [<binder>] {<dyn-trait>} "E"
<dyn-trait> = <path> {<dyn-trait-assoc-binding>}
<dyn-trait-assoc-binding> = "p" <undisambiguated-identifier> <type>
<const> = <type> <const-data>
        | "p" // placeholder, shown as _
        | <backref>

// The encoding of a constant depends on its type. Integers use their value,
// in base 16 (0-9a-f), not their memory representation. Negative integer
// values are preceded with "n". The bool value false is encoded as `0_`, true
// value as `1_`. The char constants are encoded using their Unicode scalar
// value.
<const-data> = ["n"] {<hex-digit>} "_"

// <base-62-number> uses 0-9-a-z-A-Z as digits, i.e. 'a' is decimal 10 and
// 'Z' is decimal 61.
// "_" with no digits indicates the value 0, while any other value is offset
// by 1, e.g. "0_" is 1, "Z_" is 62, "10_" is 63, etc.
<base-62-number> = {<0-9a-zA-Z>} "_"

<backref> = "B" <base-62-number>

// We use <path> here, so that we don't have to add a special rule for
// compression. In practice, only a crate root is expected.
<instantiating-crate> = <path>

// There are no restrictions on the characters that may be used
// in the suffix following the `.` or `$`.
<vendor-specific-suffix> = ("." | "$") <suffix>