Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize representation of runfiles in compact execution log #23321

Closed
wants to merge 51 commits into from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Aug 16, 2024

Runfiles trees are now represented with a custom RunfilesTree message in the compact execution log. This allows using InputSets to representing all artifacts staged at canonical locations, with only symlinks and root symlinks stored flattened and with explicit runfiles paths.

Since runfile paths can collide, this change makes it necessary to preserve the order of elements in an InputSet. The previous representation as repeated ID fields for each type (file, symlink, directory) made this impossible, so the representation has been modified to reference all direct entry IDs in a single repeated field. Since this also reduces the potential for type mismatches between the ID field type and the referenced message type, all other typed IDs are replaced with untyped ID fields. By slightly tweaking the way IDs are generated for nested entries and not emitting IDs for entries that are never referenced (e.g. Spawns), IDs are now consecutive, which simplifies the (possibly concurrent) bookkeeping for consumers by allowing them to use an array to store the entries.

Progress on #18643.

RELNOTES: The compact execution log now stores runfiles in a more compact representation that should reduce the memory overhead and log output size, in particular for test spawns. This change required breaking changes to the (experimental) log format.

@fmeum fmeum force-pushed the runfiles-spawn-log branch from 1dfb461 to 0275370 Compare August 18, 2024 11:41
@fmeum fmeum changed the title Runfiles spawn log Optimize representation of runfiles in compact execution log Aug 19, 2024
@fmeum fmeum force-pushed the runfiles-spawn-log branch from 0275370 to f71f8ad Compare August 21, 2024 10:43
@fmeum fmeum marked this pull request as ready for review August 21, 2024 10:50
@fmeum fmeum requested review from a team as code owners August 21, 2024 10:50
@fmeum fmeum requested review from aranguyen and removed request for a team and aranguyen August 21, 2024 10:50
@github-actions github-actions bot added team-Performance Issues for Performance teams team-Configurability platforms, toolchains, cquery, select(), config transitions team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Aug 21, 2024
@fmeum fmeum removed team-Configurability platforms, toolchains, cquery, select(), config transitions team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Aug 21, 2024
@fmeum fmeum requested a review from tjgq August 21, 2024 10:59
@fmeum fmeum force-pushed the runfiles-spawn-log branch from 4304e70 to 7a3e2dd Compare August 22, 2024 15:06
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 22, 2024

I resolved the conflict and subsequently fixed the existing tests after the merge of the SymlinkAction PR.

src/main/protobuf/spawn.proto Outdated Show resolved Hide resolved
src/main/protobuf/spawn.proto Outdated Show resolved Hide resolved
src/main/protobuf/spawn.proto Outdated Show resolved Hide resolved
src/main/protobuf/spawn.proto Outdated Show resolved Hide resolved
src/main/protobuf/spawn.proto Show resolved Hide resolved
@fmeum fmeum force-pushed the runfiles-spawn-log branch 2 times, most recently from 94f75fc to dfe1deb Compare August 30, 2024 13:12
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 30, 2024

While adding more tests, I noticed that there is a pathological case that the new representation currently doesn't handle: If multiple artifacts at canonical locations collide, then the last in nested set order wins. Runfiles always use postorder, which I can almost emulate, but the distinction by type in InputSet loses some information about the order. For example, a directory might have come before a file, but InputSet doesn't allow me to see that.

@tjgq Since entry IDs are globally unique anyway, what do you think of merging file_ids, directory_ids and unresolved_symlink_ids into a single field that preserves the order?

@tjgq
Copy link
Contributor

tjgq commented Aug 30, 2024

While adding more tests, I noticed that there is a pathological case that the new representation currently doesn't handle: If multiple artifacts at canonical locations collide, then the last in nested set order wins. Runfiles always use postorder, which I can almost emulate, but the distinction by type in InputSet loses some information about the order. For example, a directory might have come before a file, but InputSet doesn't allow me to see that.

@tjgq Since entry IDs are globally unique anyway, what do you think of merging file_ids, directory_ids and unresolved_symlink_ids into a single field that preserves the order?

Oof, it's a disruptive change for any tools already consuming the compact format, but that's exactly why I wanted the format to remain experimental until Bazel 8 is released. Let's do it.

@fmeum fmeum force-pushed the runfiles-spawn-log branch 2 times, most recently from 7cb21e0 to 147ffc2 Compare September 2, 2024 08:45
@fmeum fmeum requested a review from tjgq September 2, 2024 12:55
@fmeum
Copy link
Collaborator Author

fmeum commented Sep 2, 2024

@tjgq I added a bunch of tests and addressed your comments. Since we were already making breaking changes to the format, I slightly tweaked how we represent IDs also in other parts of the protocol. I can revert that part of the change if you prefer.

@fmeum fmeum force-pushed the runfiles-spawn-log branch from a962ffa to 786440e Compare September 17, 2024 22:46
@fmeum fmeum requested a review from tjgq September 17, 2024 22:49
Copy link
Contributor

@tjgq tjgq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this amazing contribution. I'll get started on the import.

@fmeum
Copy link
Collaborator Author

fmeum commented Sep 19, 2024

@bazel-io fork 7.4.0

@tjgq
Copy link
Contributor

tjgq commented Sep 19, 2024

I'm making the following changes on import:

  • Use new field ID for InputSet.input_ids and Output.output_id instead of reusing an existing ID (makes it easier to write a polyglot tool, as discussed out of band)
  • Replace occurrences of bazel in tests with TestConstants.PRODUCT_NAME (which is blaze internally)

@fmeum fmeum deleted the runfiles-spawn-log branch September 20, 2024 10:26
iancha1992 pushed a commit that referenced this pull request Sep 20, 2024
Runfiles trees are now represented with a custom `RunfilesTree` message in the compact execution log. This allows using `InputSet`s to representing all artifacts staged at canonical locations, with only symlinks and root symlinks stored flattened and with explicit runfiles paths.

Since runfile paths can collide, this change makes it necessary to preserve the order of elements in an `InputSet`. The previous representation as repeated ID fields for each type (file, symlink, directory) made this impossible, so the representation has been modified to reference all direct entry IDs in a single repeated field. Since this also reduces the potential for type mismatches between the ID field type and the referenced message type, all other typed IDs are replaced with untyped ID fields. By slightly tweaking the way IDs are generated for nested entries and not emitting IDs for entries that are never referenced (e.g. `Spawn`s), IDs are now consecutive, which simplifies the (possibly concurrent) bookkeeping for consumers by allowing them to use an array to store the entries.

Progress on #18643.

RELNOTES: The compact execution log now stores runfiles in a more compact representation that should reduce the memory overhead and log output size, in particular for test spawns. This change required breaking changes to the (experimental) log format.

Closes #23321.

PiperOrigin-RevId: 676773599
Change-Id: I010653681ffa44557142bf25009e9178b5d68515
fmeum added a commit to fmeum/bazel that referenced this pull request Sep 23, 2024
Runfiles trees are now represented with a custom `RunfilesTree` message in the compact execution log. This allows using `InputSet`s to representing all artifacts staged at canonical locations, with only symlinks and root symlinks stored flattened and with explicit runfiles paths.

Since runfile paths can collide, this change makes it necessary to preserve the order of elements in an `InputSet`. The previous representation as repeated ID fields for each type (file, symlink, directory) made this impossible, so the representation has been modified to reference all direct entry IDs in a single repeated field. Since this also reduces the potential for type mismatches between the ID field type and the referenced message type, all other typed IDs are replaced with untyped ID fields. By slightly tweaking the way IDs are generated for nested entries and not emitting IDs for entries that are never referenced (e.g. `Spawn`s), IDs are now consecutive, which simplifies the (possibly concurrent) bookkeeping for consumers by allowing them to use an array to store the entries.

Progress on bazelbuild#18643.

RELNOTES: The compact execution log now stores runfiles in a more compact representation that should reduce the memory overhead and log output size, in particular for test spawns. This change required breaking changes to the (experimental) log format.

Closes bazelbuild#23321.

PiperOrigin-RevId: 676773599
Change-Id: I010653681ffa44557142bf25009e9178b5d68515
(cherry picked from commit c2f539c)
fmeum added a commit to fmeum/bazel that referenced this pull request Sep 23, 2024
Runfiles trees are now represented with a custom `RunfilesTree` message in the compact execution log. This allows using `InputSet`s to representing all artifacts staged at canonical locations, with only symlinks and root symlinks stored flattened and with explicit runfiles paths.

Since runfile paths can collide, this change makes it necessary to preserve the order of elements in an `InputSet`. The previous representation as repeated ID fields for each type (file, symlink, directory) made this impossible, so the representation has been modified to reference all direct entry IDs in a single repeated field. Since this also reduces the potential for type mismatches between the ID field type and the referenced message type, all other typed IDs are replaced with untyped ID fields. By slightly tweaking the way IDs are generated for nested entries and not emitting IDs for entries that are never referenced (e.g. `Spawn`s), IDs are now consecutive, which simplifies the (possibly concurrent) bookkeeping for consumers by allowing them to use an array to store the entries.

Progress on bazelbuild#18643.

RELNOTES: The compact execution log now stores runfiles in a more compact representation that should reduce the memory overhead and log output size, in particular for test spawns. This change required breaking changes to the (experimental) log format.

Closes bazelbuild#23321.

PiperOrigin-RevId: 676773599
Change-Id: I010653681ffa44557142bf25009e9178b5d68515
(cherry picked from commit c2f539c)
fmeum added a commit to fmeum/bazel that referenced this pull request Sep 27, 2024
Runfiles trees are now represented with a custom `RunfilesTree` message in the compact execution log. This allows using `InputSet`s to representing all artifacts staged at canonical locations, with only symlinks and root symlinks stored flattened and with explicit runfiles paths.

Since runfile paths can collide, this change makes it necessary to preserve the order of elements in an `InputSet`. The previous representation as repeated ID fields for each type (file, symlink, directory) made this impossible, so the representation has been modified to reference all direct entry IDs in a single repeated field. Since this also reduces the potential for type mismatches between the ID field type and the referenced message type, all other typed IDs are replaced with untyped ID fields. By slightly tweaking the way IDs are generated for nested entries and not emitting IDs for entries that are never referenced (e.g. `Spawn`s), IDs are now consecutive, which simplifies the (possibly concurrent) bookkeeping for consumers by allowing them to use an array to store the entries.

Progress on bazelbuild#18643.

RELNOTES: The compact execution log now stores runfiles in a more compact representation that should reduce the memory overhead and log output size, in particular for test spawns. This change required breaking changes to the (experimental) log format.

Closes bazelbuild#23321.

PiperOrigin-RevId: 676773599
Change-Id: I010653681ffa44557142bf25009e9178b5d68515
(cherry picked from commit c2f539c)
bazel-io pushed a commit to bazel-io/bazel that referenced this pull request Sep 30, 2024
Cherry-picks the following changes:

* Optimize representation of runfiles in compact execution log (bazelbuild#23321)
* Keep runfiles tree IDs in memory for multiple test attempts (bazelbuild#23703)
* Fix naming inconsistency in `spawn.proto` (bazelbuild#23706)
* Mark tool runfiles as such in expanded execution log (bazelbuild#23702)

The cherry-picks required introducing a `Map<Artifact, RunfilesTree>`
shim to `RunfilesSupplier` that matches the Bazel 8 way of obtaining a
`RunfilesTree` from a runfiles middleman via `InputMetadataProvider`.

Closes bazelbuild#23683 
Closes bazelbuild#23710
Closes bazelbuild#23711
Closes bazelbuild#23734
@iancha1992
Copy link
Member

The changes in this PR have been included in Bazel 7.4.0 RC1. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.4.0rc1. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-user-response Awaiting a response from the author team-Performance Issues for Performance teams
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants