Release zld 0.0.2 · kubkon/bold

The major achievement of this release is in rewriting of the majority of the MachO linker in the spirit of data-oriented design which led to

significantly reduced link times to the point where, dare I say, we start to become competitive with lld and ld64 - you can find some numbers below, and
reduced memory usage by avoiding unnecessary allocs, and instead re-parsing data when actually needed.

Some benchmarks

zld refers to our linker as a standalone binary, lld to LLVM's linker, and ld64 to Apple's linker. All in all, I should point out that we are still missing a number of optimisations in the linker such as cstring deduplication, compression of dynamic linker's relocations, and synthesising of the unwind info section, so this difference between us and other linkers will most likely shrink a little.

Results on M1Pro

linking redis-server binary

❯ hyperfine ./zld ./lld ./ld64 --warmup 60
Benchmark 1: ./zld
  Time (mean ± σ):      35.6 ms ±   0.4 ms    [User: 35.9 ms, System: 10.4 ms]
  Range (min … max):    34.8 ms …  36.4 ms    79 runs
 
Benchmark 2: ./lld
  Time (mean ± σ):      49.2 ms ±   0.8 ms    [User: 42.6 ms, System: 17.6 ms]
  Range (min … max):    48.0 ms …  51.2 ms    59 runs
 
Benchmark 3: ./ld64
  Time (mean ± σ):      47.2 ms ±   0.5 ms    [User: 60.1 ms, System: 14.4 ms]
  Range (min … max):    46.2 ms …  48.1 ms    61 runs
 
Summary
  './zld' ran
    1.32 ± 0.02 times faster than './ld64'
    1.38 ± 0.03 times faster than './lld'

linking Zig's stage3 compiler

❯ hyperfine ./zld ./lld ./ld64 --warmup 5
Benchmark 1: ./zld
  Time (mean ± σ):      1.934 s ±  0.012 s    [User: 2.870 s, System: 0.468 s]
  Range (min … max):    1.923 s …  1.962 s    10 runs
 
Benchmark 2: ./lld
  Time (mean ± σ):      1.153 s ±  0.014 s    [User: 1.289 s, System: 0.230 s]
  Range (min … max):    1.141 s …  1.179 s    10 runs
 
Benchmark 3: ./ld64
  Time (mean ± σ):      2.349 s ±  0.006 s    [User: 3.875 s, System: 0.218 s]
  Range (min … max):    2.341 s …  2.357 s    10 runs
 
Summary
  './lld' ran
    1.68 ± 0.02 times faster than './zld'
    2.04 ± 0.02 times faster than './ld64'

Results on Intel i9

linking Zig's stage3 compiler

❯ hyperfine ./zld ./lld ./ld64 --warmup 5                                                                                                                                                                                          
Benchmark 1: ./zld
  Time (mean ± σ):      3.039 s ±  0.018 s    [User: 2.339 s, System: 0.671 s]
  Range (min … max):    3.000 s …  3.064 s    10 runs
 
Benchmark 2: ./lld
  Time (mean ± σ):      1.383 s ±  0.015 s    [User: 1.393 s, System: 0.483 s]
  Range (min … max):    1.363 s …  1.416 s    10 runs
 
Benchmark 3: ./ld64
  Time (mean ± σ):      2.090 s ±  0.018 s    [User: 3.000 s, System: 0.620 s]
  Range (min … max):    2.066 s …  2.126 s    10 runs
 
Summary
  './lld' ran
    1.51 ± 0.02 times faster than './ld64'
    2.20 ± 0.03 times faster than './zld'

Detailed overview of major changes

No relocs/code pre-parsing per `Atom`

Prior to this rewrite, we would preparse the code and relocs per each Atom aka a subsection of an input section per relocatable object file, and store the results on the heap. This is not only slow but also completely unnecessary. We can actually delay the work until we actually need it. This approach is now followed throughout.

Linker now follows standard stages

Like lld, mold and ld64, we also implement linking in stages, e.g., first comes symbol resolution, then we parse input sections into atoms, we then do dead code stripping (if desired), then create synthetic atoms such GOT cells, then create thunks if required, etc. This significantly simplified the entire linker as we do a very specialised work per stage and no more.
We do not store any code or relocs per synthetic atoms

Instead of generating the code and relocs per synthetic atoms (GOT, stubs, etc) we only track their numbers, VM addresses and targets, while we generate the code and relocate when writing to the final image. In fact, we do not even need to track the addresses beyond the start and size of each synthetic section. I will refactor this in the future also.

Thunks

While at it, I also went ahead and implemented range extending thunks which mean we can now link larger programs on arm64 without erroring out in the linker. One word of explanation is that contrary to what the issue suggested, we extend jump range via thunks rather than branch islands. For those unfamiliar, both methods extend the range of jump for the given RISC ISA, however, thunks use the scratch register and a load to load unreachable target's address into the scratch register and branch via register. As such, a thunk is 12 bytes on arm64. Branch islands on the other hand are 4 bytes as they are simple bl #next_label instructions. Branch islands are thus short range extenders where in order to jump further in the file, we chain the jumps by jumping between islands until reaching the actual target.

What's Changed

macho: improve linking speed, reduce memory usage by @kubkon in #9
elf: simplify like macho linker by @kubkon in #10
fixes: stage3 the new default by @kubkon in #11

Full Changelog: v0.0.1...v0.0.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zld 0.0.2

Some benchmarks

Results on M1Pro

Results on Intel i9

Detailed overview of major changes

No relocs/code pre-parsing per `Atom`

Linker now follows standard stages

Thunks

What's Changed

Contributors

zld 0.0.2

Some benchmarks

Results on M1Pro

Results on Intel i9

Detailed overview of major changes

No relocs/code pre-parsing per Atom

Linker now follows standard stages

Thunks

What's Changed

Contributors

No relocs/code pre-parsing per `Atom`