-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43693: [C++][Acero] Support AVX2 swiss join decoding #43832
Conversation
@github-actions crossbow submit -g cpp |
@ursabot please benchmark |
Benchmark runs are scheduled for commit e2af277. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
This comment was marked as outdated.
This comment was marked as outdated.
Thanks for your patience. Conbench analyzed the 2 benchmarking runs that have been run so far on PR commit e2af277. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Before formalizing this PR, I think I can use some help from @pitrou @cyb70289 @wgtmac @mapleFU @felipecrv . I'm benchmarking The result surprises me. The scalar version:
The AVX2 version:
Take
What I need for help is that: Is my assumption reasonable? Or is it just the case on my hardware (I temporarily have difficulties on accessing other Intel machines)? Or is it the AVX2 code I wrote to be improved further? Or even is it simply the problem of the benchmark itself? BTW, if benchmarking using this PR (the slow AVX2 paths are intentionally avoided), we get positive results, about 50% improvement for the AVX version. Thanks in advance! |
On my amd 3800x: without flag ARROW_USER_SIMD_LEVEL=NONE
With ARROW_USER_SIMD_LEVEL=NONE flag:
|
I'm too sleepy to testing it on icelake..would do it tomorrow |
Thanks a lot for running it @mapleFU ! Much appreciated! I think we still see certain cases that scalar code beats the vectorized one (but not as much as mine). |
This is on my other desktop (Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz, Coffee Lake), similar symptom (possibly because it is also Coffee Lake as my MPB). The scalar version:
The AVX2 version:
|
Why does |
It seems that this benchmark benches some implementation details which are not visible at Acero's public APIs, hence it has to implement its own threading code. And (I guess) then the original author uses OpenMP to avoid some of such boring (and possibly long) threading code. |
OK, got something new. The bad AVX2 gather performance seems strongly related to "Gather Data Sampling" vulnerability [1] (CVE-2022-40982, aka "Downfall") mitigation [2]. The CPU in my quote is apparently in the affected model list, for which the mitigation updates the microcode and causes significant performance down esp. for AVX2 gather. Lucky enough this mitigation can be easily disabled. The benchmark result showed that the gather performance w/o this mitigation is much better, and beats the scalar version almost always:
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/gather-data-sampling.html |
Nice finding! |
// Benchmarking shows that when the data element width is <= 8, the scalar code almost | ||
// always outperforms the vectorized code - about 2X~3X faster when the whole data set | ||
// falls into L1~LLC, and the ratio goes down to about 1:1 as the data size increases | ||
// when most of the accesses hit the main memory. This is possibly because that decoding | ||
// is mostly copying scattered pieces of data and there is not enough computation to pay | ||
// off the cost of the heavy gather instructions. | ||
// For fixed length 0 (boolean column), the vectorized code wins by batching 8 bits into | ||
// a single byte instead of modifying the same byte 8 times in the scalar code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As my comment #43832 (comment) says, it is a microcode upgrade of a vulnerability mitigation that causes this unexpected performance issue. Given that it seems only the very recent (after 2022) Intel models can get away with it, most legacy models will suffer. So I think I'll just keep using the scalar version for fixed length between 1 and 8 - of course I can reorg the code a bit and update the reason for that.
@pitrou @mapleFU If you are still following, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that it doesn't affect AMD processors nor recent Intel processors, we could simply keep the optimization enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's good for me too. I've removed the fixed-length-specific checking (and the comment).
I'm getting the final benchmarking numbers and will un-draft the PR afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for me, and this should be denote in Release note?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice reminder!
@raulcd I think I can put something like "the performance improvement on specific CPU models (blablabla) may not be as significant as expected due to blablabla" in the PR description. Is there something we should do to ensure that will appear in the coming release notes?
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put something in the "Are there any user-facing changes?" section of the PR description. Quote:
No changes other than positive performance improvement. Users can expect such improvement for hash joins related workload. Nevertheless the improvement degree highly depends on not only the workload, but also the CPU models. For Intel CPUs from Skylake to Icelake/Tigerlake, which suffer the performance degradation of AVX2 gather because of an vulnerability mitigation of Intel's (detailed in #43832 (comment)), the improvement is less significant - single digit percent. Other models, e.g. AMD, and the most recent Intel, can achieve better improvement up to 30%.
// without checking buffer bounds (useful with SIMD or fixed size memory loads | ||
// and stores). | ||
// | ||
static int NumRowsToSkip(const RowTableImpl& rows, int column_id, int num_rows, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used anywhere so far. And seems not useful in the future either.
This is the final benchmark numbers. The result is obtained from my desktop Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz, Coffee Lake.
BM_RowArray_DecodeScalar
AVX2 w/ Mitigation
AVX2 w/o Mitigation
BM_HashJoinBasic_HeavyBuildPayloadScalar
AVX2 w/ Mitigation
AVX2 w/o Mitigation
|
I've cleared all the confusions and got the code/benchmark ready. @pitrou @mapleFU @felipecrv @cyb70289 @wgtmac Would you please help to review? Thanks! |
Can we make sure |
Done. Reduced the max row size of that benchmark to 512k. The max execution time is now within 1 second. |
|
||
uint64_t total_rows = 0; | ||
for (auto _ : st) { | ||
st.PauseTiming(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it actually relevant to pause timing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not, the code in-between is trivial. I'll remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
default_memory_pool())); | ||
total_rows += batch.length; | ||
} | ||
st.counters["rows/sec"] = benchmark::Counter(total_rows, benchmark::Counter::kIsRate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should simply be state.SetItemsProcessed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Updated.
st.counters["rows/sec"] = benchmark::Counter(total_rows, benchmark::Counter::kIsRate); | ||
} | ||
|
||
template <typename... Args> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the Args
for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing. Removed. Thanks for pointing this out.
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
f21ae94
to
69d6802
Compare
@github-actions crossbow submit -g cpp |
Revision: 1be8e4a Submitted crossbow builds: ursacomputing/crossbow @ actions-71c56e957f |
The CI failures are unrelated, I'll merge. Thank you @zanmato1984 ! |
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 2bd2e35. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
You can find the background in #43693.
By looking at how
Visit_avx2/VisitNulls_avx2
's non-simd counterparts (Visit/VisitNulls
) are used, I found they are solely for decoding rows from the build side of the join. So I added AVX2 versions for those decoding methods and wiredVisit_avx2/VisitNulls_avx2
.What changes are included in this PR?
Visit*_avx2
functions to decode fixed-length/offsets/var-length/nulls of the row table.Visit*_avx2
functions.Are these changes tested?
No new tests needed.
The benchmarking result is a bit complicated, I put them in comment #43832 (comment).
Are there any user-facing changes?
No changes other than positive performance improvement. Users can expect such improvement for hash joins related workload. Nevertheless the improvement degree highly depends on not only the workload, but also the CPU models. For Intel CPUs from Skylake to Icelake/Tigerlake, which suffer the performance degradation of AVX2 gather because of an vulnerability mitigation of Intel's (detailed in #43832 (comment)), the improvement is less significant - single digit percent. Other models, e.g. AMD, and the most recent Intel, can achieve better improvement up to 30%.