-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a way to be able to debug cache misses that can be always on without too much overhead #18643
Comments
Our exec logs turn out to be ~120GB (and ~150GB once run through the parser) so we'd appreciate anything that can significantly reduce the amount of output while still letting us answer questions about which actions needed a rebuild (rather than being reusable from cache) and which dependencies drove that need. |
cc @ramymedhat |
Ok, I did look into this a bit and found some issues and some avenues. Reducing the sizeThere is is a couple things we could do to reduce the size of the output:
I tried zstd compression outside of Bazel (but this can easily be done within Bazel as well if we decide this is the way to go) and also did some dirty hacks to understand the benefits from artifact indexing and a leaner proto. large build:
medium build:
The path indexing contains the exact same information as before, but needs to be re-constructed to compare. The lean format does leave out information but probably most of these are not relevant, one exception is command lines where I would store one high level hash. You can still see that the command line has changed and you can reconstruct it, e.g. using Performance overheadI noticed that action execution was quite a bit slower with the execution log turned on (regardless of which variant). I added more profiling to the JSON trace profile to better understand it (and will send a PR to add this permanently). For the large build from above, the with the exec log it took ~75mins, without ~7mins. I think we should move to a producer / consumer pattern where we put spawns to be processed in a queue and work on writing the file in a separate thread. We have to see how much overhead this pattern is memory-wise. I am not sure whether that is enough to reduce the overhead into acceptable levels. Different formatsWe should use the opportunity to clean up the different logging options anyway. Currently we have:
Perhaps it would also make sense to add an option to only write the exec log for specific actions (basically filtering on mnemonics and targets, see also #12075) and for the other actions only include high-level information? This would not only help addressing size issues but also performance concerns. How to compare protosFor large exec logs, a tool to compare them is essential. @coeuvre created https://github.com/coeuvre/bazel-find-non-deterministic-actions - it helps finding deterministic actions, but is built on top of the JSON output. The tool needs to be changed to support the lean proto format. Then we could also let the tool spit out other helpful commands like Related issues
Next stepsI am aware that I am abusing this feature request a little bit to do one step back and understand the bigger picture. I want to give people some time to voice their opinions here before proceeding, so please let me know what you think. |
This might help to understand when it is costly, see also #18643 PiperOrigin-RevId: 542194475 Change-Id: I85932d8a5d7a832ee4675722642deb675f7db81f
@meisterT thanks for that detailed outline, it aligns with what I see too: massive exec logs (50GB+), and the build is slower if exec log is on. I'd prefer to have exec logging always on, but this makes it unworkable. |
we would prefer this data be visible through BES somehow, be it as a cas artifact (like the profile) or via special messages...the BES also prints actions (generally on failure only) but it can emit all actions in a slightly different format...it's annoying the BES actions and exec log spawns don't match since exec log seems richer...
are execlogs written in the threads running the actions? seems like that would have huge contention issues possibly |
@meisterT +1 on the thanks for the detailed analysis. Reducing the sizeFor our use case, most of the content is inputs and actual_outputs, so artifact indexing seems most promising for our needs. Leaner fields could also be useful, but we'd see the most benefit if you focus on inputs/outputs vs. per-action outputs. Hashing the command line and env vars could save some space, but reconstructing those things may be somewhat annoying (the command line requires you to checkout exactly that source code to reconstruct via aquery) or hard (env vars on a CI bot that's been deleted are going to take some non-trivial work to recreate, probably more than they're worth). For my $0.02 I'd just leave them alone. And I find that many of those one-per-action lines are useful for understanding the context ("did we rebuild this action despite no changed inputs because it's Compression SGTM. Performance overheadAnything to improve the overhead of creating this content is welcome. Having less content to create, and compressing that content, might help if the bottleneck is the writes to the disk. Different formatsYour observation that we could have a single (efficient to create and store) output format that then gets post-processed into other formats (including stable ones) - whether for comparison against another log or just for reading to understand what happened in the current execution - seems right to me. How to compare protosI agree that a tool is essential. The one you referenced is a good start, but would benefit from enhancements such as analyzing which changed inputs are outputs from another action in the current run. An action for which all changed inputs are outputs from other actions is far less interesting when you're trying to understand why you rebuilt something you didn't expect to, so the tool would benefit from the ability to 1) focus on the non-transitive changes, and 2) walk easily down through the transitive changes to find interesting non-transitive changes. The ability to filter in/out branches of that graph would also be helpful; I could imagine saying "I want to see all non-transitive changes, but not the ones that chain up into dependency B, or only the ones that chain up in to A and C." I could also imagine that it might be valuable to understand cases where a deeply nested dependency was rebuilt due to changed inputs but produced unmodified output and so didn't contribute to rebuilds of artifacts higher in the dependency graph, and that the tool might help understand that there are multiple distinct subsets of the graph. I'm not sure I agree that the tool shouldn't work from the JSON content. Having the analysis tool highlight aspects of the content in human-readable formats, and then being able to go find the raw content in human-readable format, seems like a better workflow than having the tool be the only way to interact with the content. The tool should handle all the variations in the data model (no matter how lean/rich the user chooses to capture), but JSON seems like an OK format for the reasons you touched on in your "Different formats" section. In analyzing execution logs this week, I've found some value in having a file per action, for easy comparison or examination of that particular action vs. having to diff the entire file or open it and search down to the specific part I'm interested in. It probably wouldn't matter with a 5MB file, it was annoying but not worse with a 1GB file, but it's high-friction with our full 100+GB files. And doing that could help with the memory usage required for sorting (my experience was that I needed the JVM to have an Xmx around 100GB for a 100GB sorted exec log due I believe to the sort at the end) since sorting would be just a matter of reading an index to determine how to concatenate the individual files rather than something that has to happen in memory. |
+1 on data available through BES, or streaming through some remote protocol. There's already a good deal of challenge around getting the bazel client to perform well (in the face of memory issues, startup times, dependency fetching, etc.) - only offering this info via client-generated local files could very well mean that users need to supply:
The various streaming protocols that bazel supports (BES, RBE) have been a godsend for doing this sort of analysis/reporting across builds - I'm sure that if this info were available in the build event stream (maybe with verbosity flags/options as needed) then the community would write the backends to store/aggregate the data appropriately. |
While trying to debug what changed between two builds, I discovered something else that makes the current implementation hard to use. When an action is not a cache hit, we get information about the I'm fine with having differences in the |
We were discussing today in BazelCon Community day an idea to be able to test cache determinism in every pull request. Leaving it here in case is useful. Assuming remote cache is enabled, and that it contains artifacts for latest commit in main git branch (the usual):
|
…optimize creation of unsorted logs. Since we're about to introduce a new log format (see #18643), let's take this opportunity to clean up the command line API. Avoid a pointless copy of the execution log, if no sorting is requested, by writing directly into the requested output binary path. If a temporary file is necessary, write it to a well-known path in the output base, and delete it after conversions are successful; this ensures that large temporary files don't accumulate in /tmp. RELNOTES[INC]: Delete the --experimental_execution_log_file flag. Use --execution_log_binary_file in conjunction with --noexecution_log_sort instead. PiperOrigin-RevId: 580132026 Change-Id: Icca827260176b3ea804384d2bed4a56ce85cfc14
…ag, and optimize creation of unsorted logs. Since we're about to introduce a new log format (see bazelbuild#18643), let's take this opportunity to clean up the command line API. Avoid a pointless copy of the execution log, if no sorting is requested, by writing directly into the requested output binary path. If a temporary file is necessary, write it to a well-known path in the output base, and delete it after conversions are successful; this ensures that large temporary files don't accumulate in /tmp. RELNOTES[INC]: Delete the --experimental_execution_log_file flag. Use --execution_log_binary_file in conjunction with --noexecution_log_sort instead. PiperOrigin-RevId: 580132026 Change-Id: Icca827260176b3ea804384d2bed4a56ce85cfc14
…ag, and optimize creation of unsorted logs. (#20132) Since we're about to introduce a new log format (see #18643), let's take this opportunity to clean up the command line API. Avoid a pointless copy of the execution log, if no sorting is requested, by writing directly into the requested output binary path. If a temporary file is necessary, write it to a well-known path in the output base, and delete it after conversions are successful; this ensures that large temporary files don't accumulate in /tmp. RELNOTES[INC]: Delete the --experimental_execution_log_file flag. Use --execution_log_binary_file in conjunction with --noexecution_log_sort instead. PiperOrigin-RevId: 580132026 Change-Id: Icca827260176b3ea804384d2bed4a56ce85cfc14
They're fairly verbose and represent a significant portion of the log, much more so once we switch to a compact format. It should be possible to uniquely identify spawns through a combination of mnemonic, target label and output path. Progress on bazelbuild/bazel#18643. PiperOrigin-RevId: 582273738
They're fairly verbose and represent a significant portion of the log, much more so once we switch to a compact format. It should be possible to uniquely identify spawns through a combination of mnemonic, target label and output path. Progress on #18643. PiperOrigin-RevId: 582273738 Change-Id: Ife9edc6aaf51c96e4ac42a84f69371ab4aaa50eb
We now unconditionally emit the full spawn metrics, which already include the wall time; the extra field is wasteful and potentially confusing. Progress on bazelbuild/bazel#18643. PiperOrigin-RevId: 582282615
We now unconditionally emit the full spawn metrics, which already include the wall time; the extra field is wasteful and potentially confusing. Progress on #18643. PiperOrigin-RevId: 582282615 Change-Id: I73f87f3e1c54679506acde860991ef57947262d3
… reflects its purpose. Also amend misleading or plainly incorrect comments. In particular, the runner field is always populated, even for cache hits, and spawns are only missing when the persistent action cache is hit ("local" is ambiguous, as we also use it in certain places to refer to the disk cache). Note that, while the runner and cache_hit fields currently overlap in purpose, their separate existence is better aligned with potential future plans to promote cache lookups to a strategy (see bazelbuild/bazel#19904). Progress on bazelbuild/bazel#18643. PiperOrigin-RevId: 582294420
… reflects its purpose. Also amend misleading or plainly incorrect comments. In particular, the runner field is always populated, even for cache hits, and spawns are only missing when the persistent action cache is hit ("local" is ambiguous, as we also use it in certain places to refer to the disk cache). Note that, while the runner and cache_hit fields currently overlap in purpose, their separate existence is better aligned with potential future plans to promote cache lookups to a strategy (see #19904). Progress on #18643. PiperOrigin-RevId: 582294420 Change-Id: I75fc9a42b8a38c2233cc033e97d7c87264fb11d7
Status update: Bazel 7.1.0 will introduce an We designed this new format so that the log can be produced with minimal runtime overhead (aiming for 2-3% at most, often less) and have a much smaller file size (a reduction of 100x relative to the old format can be expected, sometimes more). As a drawback, the new format is not designed with human readability in mind, and is expected to be analyzed with the aid of specialized tools. As a starting point, 7.1.0 will also provide an offline conversion tool ( We look forward to your feedback regarding this new format during the 7.x cycle, and might still make some minor adjustments in response to it (hence the experimental nature) although we'll endeavor to minimize disruption to early adopters. We expect Bazel 8.0.0 to declare the new format to be stable and to deprecate the older formats. |
We have seen this pattern with recent log files. They are written as protobuf so it's not friendly for human consumption. What if we include a For reference,
which is a lot more user-friendly. Separate question: given that this is supposed to be "compact", could we include this log in BES? |
I also want to ask what Son asked at the end: can we get this referenced in the BES and uploaded to the remote cache? Similar to the execution deps graph when that is enabled? |
Yes, we can include it in the BES. I'll prepare a PR. |
@tjgq I started to play with the new compact exec log today and it's very nice. One small feedback is that |
+1 to also adding a way to upload this by default without providing a value to the flag. |
@tjgq and I discussed this, and he's not sure it would ever be fine for this to always be collected by default. But there could be a flag, like |
Since this is an experimental flag, I think just change it to a tri-state flag:
The default value could still be From my observation, this flag benefits folks who rely on Bazel's incremental state tracking a lot more. For folks to rely on a separate query system (i.e. bazel-diff or target-determinator) to decide what to build and test, the number of actions is usually higher and thus, the log size is also a bit higher (1-2 GBs for some runs). To be clear, I don't think the increase in action count is caused by using a separate query system, but there usually is some correlation there. Therefore, I do agree that we should not enable the compact log by default, at least not yet. |
There would be no way to unset it then in a different |
🤔 perhaps you could explicitly unset it with the |
Runfiles trees are now represented with a custom `RunfilesTree` message in the compact execution log. This allows using `InputSet`s to representing all artifacts staged at canonical locations, with only symlinks and root symlinks stored flattened and with explicit runfiles paths. Since runfile paths can collide, this change makes it necessary to preserve the order of elements in an `InputSet`. The previous representation as repeated ID fields for each type (file, symlink, directory) made this impossible, so the representation has been modified to reference all direct entry IDs in a single repeated field. Since this also reduces the potential for type mismatches between the ID field type and the referenced message type, all other typed IDs are replaced with untyped ID fields. By slightly tweaking the way IDs are generated for nested entries and not emitting IDs for entries that are never referenced (e.g. `Spawn`s), IDs are now consecutive, which simplifies the (possibly concurrent) bookkeeping for consumers by allowing them to use an array to store the entries. Progress on #18643. RELNOTES: The compact execution log now stores runfiles in a more compact representation that should reduce the memory overhead and log output size, in particular for test spawns. This change required breaking changes to the (experimental) log format. Closes #23321. PiperOrigin-RevId: 676773599 Change-Id: I010653681ffa44557142bf25009e9178b5d68515
To anyone already using the compact execution log: please note that (in c2f539c) we had to make backwards-incompatible changes to the format in order to fix a flaw in the original design that made logging of large runfiles trees inefficient. The conversion tool ( I intend to declare the format stable (i.e., remove the |
Runfiles trees are now represented with a custom `RunfilesTree` message in the compact execution log. This allows using `InputSet`s to representing all artifacts staged at canonical locations, with only symlinks and root symlinks stored flattened and with explicit runfiles paths. Since runfile paths can collide, this change makes it necessary to preserve the order of elements in an `InputSet`. The previous representation as repeated ID fields for each type (file, symlink, directory) made this impossible, so the representation has been modified to reference all direct entry IDs in a single repeated field. Since this also reduces the potential for type mismatches between the ID field type and the referenced message type, all other typed IDs are replaced with untyped ID fields. By slightly tweaking the way IDs are generated for nested entries and not emitting IDs for entries that are never referenced (e.g. `Spawn`s), IDs are now consecutive, which simplifies the (possibly concurrent) bookkeeping for consumers by allowing them to use an array to store the entries. Progress on #18643. RELNOTES: The compact execution log now stores runfiles in a more compact representation that should reduce the memory overhead and log output size, in particular for test spawns. This change required breaking changes to the (experimental) log format. Closes #23321. PiperOrigin-RevId: 676773599 Change-Id: I010653681ffa44557142bf25009e9178b5d68515
Runfiles trees are now represented with a custom `RunfilesTree` message in the compact execution log. This allows using `InputSet`s to representing all artifacts staged at canonical locations, with only symlinks and root symlinks stored flattened and with explicit runfiles paths. Since runfile paths can collide, this change makes it necessary to preserve the order of elements in an `InputSet`. The previous representation as repeated ID fields for each type (file, symlink, directory) made this impossible, so the representation has been modified to reference all direct entry IDs in a single repeated field. Since this also reduces the potential for type mismatches between the ID field type and the referenced message type, all other typed IDs are replaced with untyped ID fields. By slightly tweaking the way IDs are generated for nested entries and not emitting IDs for entries that are never referenced (e.g. `Spawn`s), IDs are now consecutive, which simplifies the (possibly concurrent) bookkeeping for consumers by allowing them to use an array to store the entries. Progress on bazelbuild#18643. RELNOTES: The compact execution log now stores runfiles in a more compact representation that should reduce the memory overhead and log output size, in particular for test spawns. This change required breaking changes to the (experimental) log format. Closes bazelbuild#23321. PiperOrigin-RevId: 676773599 Change-Id: I010653681ffa44557142bf25009e9178b5d68515 (cherry picked from commit c2f539c)
Runfiles trees are now represented with a custom `RunfilesTree` message in the compact execution log. This allows using `InputSet`s to representing all artifacts staged at canonical locations, with only symlinks and root symlinks stored flattened and with explicit runfiles paths. Since runfile paths can collide, this change makes it necessary to preserve the order of elements in an `InputSet`. The previous representation as repeated ID fields for each type (file, symlink, directory) made this impossible, so the representation has been modified to reference all direct entry IDs in a single repeated field. Since this also reduces the potential for type mismatches between the ID field type and the referenced message type, all other typed IDs are replaced with untyped ID fields. By slightly tweaking the way IDs are generated for nested entries and not emitting IDs for entries that are never referenced (e.g. `Spawn`s), IDs are now consecutive, which simplifies the (possibly concurrent) bookkeeping for consumers by allowing them to use an array to store the entries. Progress on bazelbuild#18643. RELNOTES: The compact execution log now stores runfiles in a more compact representation that should reduce the memory overhead and log output size, in particular for test spawns. This change required breaking changes to the (experimental) log format. Closes bazelbuild#23321. PiperOrigin-RevId: 676773599 Change-Id: I010653681ffa44557142bf25009e9178b5d68515 (cherry picked from commit c2f539c)
Runfiles trees are now represented with a custom `RunfilesTree` message in the compact execution log. This allows using `InputSet`s to representing all artifacts staged at canonical locations, with only symlinks and root symlinks stored flattened and with explicit runfiles paths. Since runfile paths can collide, this change makes it necessary to preserve the order of elements in an `InputSet`. The previous representation as repeated ID fields for each type (file, symlink, directory) made this impossible, so the representation has been modified to reference all direct entry IDs in a single repeated field. Since this also reduces the potential for type mismatches between the ID field type and the referenced message type, all other typed IDs are replaced with untyped ID fields. By slightly tweaking the way IDs are generated for nested entries and not emitting IDs for entries that are never referenced (e.g. `Spawn`s), IDs are now consecutive, which simplifies the (possibly concurrent) bookkeeping for consumers by allowing them to use an array to store the entries. Progress on bazelbuild#18643. RELNOTES: The compact execution log now stores runfiles in a more compact representation that should reduce the memory overhead and log output size, in particular for test spawns. This change required breaking changes to the (experimental) log format. Closes bazelbuild#23321. PiperOrigin-RevId: 676773599 Change-Id: I010653681ffa44557142bf25009e9178b5d68515 (cherry picked from commit c2f539c)
Fixes bazelbuild#18643. PiperOrigin-RevId: 678723961 Change-Id: I3da378d2009de2632228d1e914ae451c96cf5b75
Fixes #18643. PiperOrigin-RevId: 678723961 Change-Id: I3da378d2009de2632228d1e914ae451c96cf5b75
A fix for this issue has been included in Bazel 7.4.0 RC1. Please test out the release candidate and report any issues as soon as possible. |
@tjgq Was the change for including compact exec log in BES merged? I assumed it was available and tried to find it in our BES stream, but couldn't. I assumed the change was made, so is that not the case? |
It should be added to the BES as long as you explicitly enable it via |
Description of the feature request:
Sometimes, it's useful to be able to dig into why a particular action was not cached by remote execution.
The usual culprits are a changed input file, different input files or a different command line. Bazel provides ample tools to query what it would run (
aquery
) and what it did (--execution_log_*_file
or the BEP) but neither of these is amenable to post-facto debugging:aquery
requires extra conscious effortSo it would be nice to be able to do some sort of logging to answer questions like "why did this build done two days ago did not reuse the cached actions from that other build done 30 minutes before it".
This doesn't necessarily need to be done within Bazel (although that's the particular hammer I have); I could also imagine logging on the remote execution side where it is logged with actions are done by which Bazel invocations.
What underlying problem are you trying to solve with this feature?
Being able to tell why a particular action was not cached so that future cache hit rates are higher.
Which operating system are you running Bazel on?
No response
What is the output of
bazel info release
?No response
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: