GH-43693: [C++][Acero] Support AVX2 swiss join decoding #43832

zanmato1984 · 2024-08-26T17:08:27Z

Rationale for this change

You can find the background in #43693.

By looking at how Visit_avx2/VisitNulls_avx2's non-simd counterparts (Visit/VisitNulls) are used, I found they are solely for decoding rows from the build side of the join. So I added AVX2 versions for those decoding methods and wired Visit_avx2/VisitNulls_avx2.

What changes are included in this PR?

Split the decoding methods into smaller pieces to make each of them able to cooperate with the AVX2 version.
Concrete AVX2 specialized functions utilizing the Visit*_avx2 functions to decode fixed-length/offsets/var-length/nulls of the row table.
Fix some bugs in the original Visit*_avx2 functions.
Related benchmarks.

Are these changes tested?

No new tests needed.

The benchmarking result is a bit complicated, I put them in comment #43832 (comment).

Are there any user-facing changes?

No changes other than positive performance improvement. Users can expect such improvement for hash joins related workload. Nevertheless the improvement degree highly depends on not only the workload, but also the CPU models. For Intel CPUs from Skylake to Icelake/Tigerlake, which suffer the performance degradation of AVX2 gather because of an vulnerability mitigation of Intel's (detailed in #43832 (comment)), the improvement is less significant - single digit percent. Other models, e.g. AMD, and the most recent Intel, can achieve better improvement up to 30%.

GitHub Issue: [C++][Acero] AVX2 specialized swiss join functions not wired #43693

zanmato1984 · 2024-08-26T17:13:39Z

@github-actions crossbow submit -g cpp

zanmato1984 · 2024-08-26T17:14:23Z

@ursabot please benchmark

ursabot · 2024-08-26T17:14:29Z

Benchmark runs are scheduled for commit e2af277. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

conbench-apache-arrow · 2024-08-26T22:24:35Z

Thanks for your patience. Conbench analyzed the 2 benchmarking runs that have been run so far on PR commit e2af277.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

zanmato1984 · 2024-08-30T10:21:21Z

Before formalizing this PR, I think I can use some help from @pitrou @cyb70289 @wgtmac @mapleFU @felipecrv .

I'm benchmarking BM_RowArray_Decode* (see the PR), using my 2019 Intel MBP (Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz, Coffee Lake), on my other branch https://github.com/zanmato1984/arrow/tree/swiss-join-avx2-for-maple. This branch is based on the PR branch, with some code paths commented out in order to solely compare the performance between DecodeFixedLength + Visit (the scalar version) and DecodeFixedLength_avx2 + Visit_avx2 (the AVX2 version).

The result surprises me. The scalar version:

ARROW_USER_SIMD_LEVEL=NONE ./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
The number of inputs is very large. BM_HashJoinBasic_PayloadSize will be repeated at least 125 times.
2024-08-30T16:44:41+08:00
Running ./arrow-acero-hash-join-benchmark
Run on (16 X 2400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB
  L1 Instruction 32 KiB
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB
Load Average: 2.24, 2.35, 2.37
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         335756 ns       335771 ns         2088 rows/sec=195.178M/s
BM_RowArray_Decode/"int8"                                                                            222946 ns       222964 ns         3218 rows/sec=293.926M/s
BM_RowArray_Decode/"int16"                                                                           206334 ns       206310 ns         3450 rows/sec=317.653M/s
BM_RowArray_Decode/"int32"                                                                           212043 ns       211908 ns         3243 rows/sec=309.262M/s
BM_RowArray_Decode/"int64"                                                                           214047 ns       213696 ns         3204 rows/sec=306.674M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       325797 ns       325116 ns         2116 rows/sec=201.574M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       332284 ns       331637 ns         2106 rows/sec=197.611M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       332328 ns       331591 ns         2140 rows/sec=197.638M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       331975 ns       331820 ns         2150 rows/sec=197.501M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       367805 ns       367605 ns         1915 rows/sec=178.275M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      368041 ns       367993 ns         1905 rows/sec=178.088M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      527913 ns       527701 ns         1342 rows/sec=124.19M/s
BM_RowArray_DecodeBinary/max_length:32                                                              1645646 ns      1498623 ns          469 rows/sec=43.7302M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1751193 ns      1750436 ns          408 rows/sec=37.4392M/s
BM_RowArray_DecodeBinary/max_length:128                                                             2076064 ns      2073780 ns          350 rows/sec=31.6017M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     353249 ns       353196 ns         1985 rows/sec=185.549M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     226431 ns       226484 ns         3076 rows/sec=289.359M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     583054 ns       583129 ns         1233 rows/sec=112.385M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   455519 ns       455603 ns         1507 rows/sec=143.842M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   288495 ns       288417 ns         2450 rows/sec=227.223M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1335445 ns      1335260 ns          530 rows/sec=49.0803M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1403463 ns      1403317 ns          502 rows/sec=46.7001M/s

The AVX2 version:

./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
The number of inputs is very large. BM_HashJoinBasic_PayloadSize will be repeated at least 125 times.
2024-08-30T16:45:10+08:00
Running ./arrow-acero-hash-join-benchmark
Run on (16 X 2400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB
  L1 Instruction 32 KiB
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB
Load Average: 3.41, 2.64, 2.47
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         275226 ns       275132 ns         2561 rows/sec=238.195M/s
BM_RowArray_Decode/"int8"                                                                            268566 ns       268506 ns         2654 rows/sec=244.072M/s
BM_RowArray_Decode/"int16"                                                                           265198 ns       265165 ns         2557 rows/sec=247.148M/s
BM_RowArray_Decode/"int32"                                                                           262050 ns       261924 ns         2709 rows/sec=250.207M/s
BM_RowArray_Decode/"int64"                                                                           260563 ns       260506 ns         2699 rows/sec=251.568M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       355187 ns       348711 ns         2019 rows/sec=187.935M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       350424 ns       350330 ns         2005 rows/sec=187.066M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       350453 ns       350352 ns         2024 rows/sec=187.055M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       349830 ns       349612 ns         1992 rows/sec=187.451M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       349569 ns       349349 ns         2002 rows/sec=187.592M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      351250 ns       351061 ns         1996 rows/sec=186.677M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      421615 ns       421506 ns         1633 rows/sec=155.478M/s
BM_RowArray_DecodeBinary/max_length:32                                                              1029484 ns      1029140 ns          685 rows/sec=63.6794M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1406671 ns      1406103 ns          495 rows/sec=46.6075M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1657173 ns      1655898 ns          432 rows/sec=39.5767M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     276695 ns       276781 ns         2546 rows/sec=236.775M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     263849 ns       263910 ns         2543 rows/sec=248.323M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     473701 ns       473962 ns         1532 rows/sec=138.271M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   466305 ns       466162 ns         1447 rows/sec=140.584M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   471840 ns       471721 ns         1496 rows/sec=138.927M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1065689 ns      1065185 ns          661 rows/sec=61.5245M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1214506 ns      1213996 ns          562 rows/sec=53.9829M/s

Take int16 for example, the scalar version is about 30% faster than the vector version. This is the case as well for other fixed-length types that are less than 8 bytes. So I benchmarked (on the same machine) solely the memory access pattern (gather several integers from scattered addresses then store them together) between scalar and AVX2 using a much more compact benchmark (it's still in my local and not published in any branches). I put my conclusion, which I'm very unconfident of, in the code comment, quote:

  // Benchmarking shows that when the data element width is <= 8, the scalar code almost
  // always outperforms the vectorized code - about 2X~3X faster when the whole data set
  // falls into L1~LLC, and the ratio goes down to about 1:1 as the data size increases
  // when most of the accesses hit the main memory. This is possibly because that decoding
  // is mostly copying scattered pieces of data and there is not enough computation to pay
  // off the cost of the heavy gather instructions.
  // For fixed length 0 (boolean column), the vectorized code wins by batching 8 bits into
  // a single byte instead of modifying the same byte 8 times in the scalar code.

What I need for help is that: Is my assumption reasonable? Or is it just the case on my hardware (I temporarily have difficulties on accessing other Intel machines)? Or is it the AVX2 code I wrote to be improved further? Or even is it simply the problem of the benchmark itself?

BTW, if benchmarking using this PR (the slow AVX2 paths are intentionally avoided), we get positive results, about 50% improvement for the AVX version.

Thanks in advance!

mapleFU · 2024-08-30T15:38:19Z

On my amd 3800x:

without flag ARROW_USER_SIMD_LEVEL=NONE

BM_RowArray_Decode/"boolean"                                                                         226161 ns       226368 ns         3010 rows/sec=289.507M/s
BM_RowArray_Decode/"int8"                                                                            213483 ns       213692 ns         3251 rows/sec=306.68M/s
BM_RowArray_Decode/"int16"                                                                           216579 ns       216796 ns         3303 rows/sec=302.289M/s
BM_RowArray_Decode/"int32"                                                                           216269 ns       216471 ns         3369 rows/sec=302.742M/s
BM_RowArray_Decode/"int64"                                                                           222836 ns       223107 ns         3267 rows/sec=293.738M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       280235 ns       280491 ns         2543 rows/sec=233.644M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       385640 ns       386004 ns         1888 rows/sec=169.778M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       283350 ns       283581 ns         2481 rows/sec=231.098M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       275834 ns       276064 ns         2477 rows/sec=237.391M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       382956 ns       383228 ns         1797 rows/sec=171.008M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      360731 ns       360978 ns         2061 rows/sec=181.549M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      405461 ns       405901 ns         1729 rows/sec=161.456M/s
BM_RowArray_DecodeBinary/max_length:32                                                               870678 ns       871002 ns          801 rows/sec=75.2409M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1225297 ns      1225776 ns          552 rows/sec=53.4641M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1543498 ns      1544551 ns          454 rows/sec=42.4298M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     241292 ns       241565 ns         2919 rows/sec=271.293M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     222509 ns       222788 ns         3062 rows/sec=294.159M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     453397 ns       453855 ns         1587 rows/sec=144.396M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   320415 ns       320641 ns         2157 rows/sec=204.387M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   311626 ns       311862 ns         2221 rows/sec=210.141M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                   865312 ns       865693 ns          801 rows/sec=75.7023M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                   966808 ns       967218 ns          736 rows/sec=67.7562M/s

With ARROW_USER_SIMD_LEVEL=NONE flag:

BM_RowArray_Decode/"boolean"                                                                         309518 ns       309748 ns         2336 rows/sec=211.575M/s
BM_RowArray_Decode/"int8"                                                                            219132 ns       219335 ns         3205 rows/sec=298.79M/s
BM_RowArray_Decode/"int16"                                                                           206815 ns       207000 ns         3375 rows/sec=316.594M/s
BM_RowArray_Decode/"int32"                                                                           211850 ns       212051 ns         3261 rows/sec=309.053M/s
BM_RowArray_Decode/"int64"                                                                           209969 ns       210150 ns         3284 rows/sec=311.849M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       307361 ns       307555 ns         2192 rows/sec=213.084M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       317631 ns       317849 ns         2211 rows/sec=206.183M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       314159 ns       314372 ns         2218 rows/sec=208.463M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       313868 ns       314084 ns         2162 rows/sec=208.655M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       373098 ns       373331 ns         1839 rows/sec=175.541M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      371984 ns       372230 ns         1830 rows/sec=176.061M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      481453 ns       481768 ns         1440 rows/sec=136.03M/s
BM_RowArray_DecodeBinary/max_length:32                                                              1311100 ns      1311531 ns          534 rows/sec=49.9683M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1788057 ns      1788600 ns          389 rows/sec=36.6404M/s
BM_RowArray_DecodeBinary/max_length:128                                                             2120766 ns      2121693 ns          308 rows/sec=30.8881M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     330122 ns       330365 ns         2142 rows/sec=198.372M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     215677 ns       215314 ns         3202 rows/sec=304.37M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     498039 ns       498533 ns         1431 rows/sec=131.456M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   433924 ns       434161 ns         1595 rows/sec=150.946M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   267960 ns       268163 ns         2638 rows/sec=244.385M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1214087 ns      1214602 ns          589 rows/sec=53.9559M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1335344 ns      1335821 ns          508 rows/sec=49.0597M/s

mapleFU · 2024-08-30T15:38:48Z

I'm too sleepy to testing it on icelake..would do it tomorrow

zanmato1984 · 2024-08-31T15:43:48Z

Thanks a lot for running it @mapleFU ! Much appreciated!

I think we still see certain cases that scalar code beats the vectorized one (but not as much as mine).

zanmato1984 · 2024-08-31T16:44:11Z

This is on my other desktop (Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz, Coffee Lake), similar symptom (possibly because it is also Coffee Lake as my MPB).

The scalar version:

ARROW_USER_SIMD_LEVEL=NONE ./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
2024-09-01T00:32:49+08:00
Running ./arrow-acero-hash-join-benchmark
Run on (8 X 4900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 12288 KiB (x1)
Load Average: 0.46, 3.08, 2.34
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         345809 ns       345761 ns         1896 rows/sec=189.538M/s
BM_RowArray_Decode/"int8"                                                                            267577 ns       267553 ns         2678 rows/sec=244.942M/s
BM_RowArray_Decode/"int16"                                                                           237106 ns       237094 ns         2872 rows/sec=276.409M/s
BM_RowArray_Decode/"int32"                                                                           243701 ns       243697 ns         2874 rows/sec=268.92M/s
BM_RowArray_Decode/"int64"                                                                           239891 ns       239886 ns         2709 rows/sec=273.192M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       316511 ns       316471 ns         2260 rows/sec=207.081M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       310797 ns       310759 ns         2165 rows/sec=210.887M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       324059 ns       324020 ns         2251 rows/sec=202.256M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       311799 ns       311753 ns         2244 rows/sec=210.214M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       364401 ns       364346 ns         2016 rows/sec=179.87M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      349918 ns       349868 ns         1997 rows/sec=187.313M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      507058 ns       506962 ns         1427 rows/sec=129.27M/s
BM_RowArray_DecodeBinary/max_length:32                                                              1261872 ns      1261465 ns          554 rows/sec=51.9515M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1585243 ns      1584698 ns          462 rows/sec=41.3549M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1822727 ns      1822343 ns          384 rows/sec=35.962M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     379210 ns       379150 ns         1843 rows/sec=172.847M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     275680 ns       275657 ns         2693 rows/sec=237.741M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     599291 ns       599291 ns         1257 rows/sec=109.354M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   506824 ns       506710 ns         1376 rows/sec=129.334M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   360611 ns       360579 ns         2123 rows/sec=181.75M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1182248 ns      1181939 ns          603 rows/sec=55.447M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1395220 ns      1394817 ns          529 rows/sec=46.9847M/s

The AVX2 version:

./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
2024-09-01T00:33:14+08:00
Running ./arrow-acero-hash-join-benchmark
Run on (8 X 4900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 12288 KiB (x1)
Load Average: 0.64, 2.91, 2.31
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         262395 ns       262341 ns         2665 rows/sec=249.808M/s
BM_RowArray_Decode/"int8"                                                                            263405 ns       263397 ns         2716 rows/sec=248.807M/s
BM_RowArray_Decode/"int16"                                                                           248155 ns       248106 ns         2821 rows/sec=264.141M/s
BM_RowArray_Decode/"int32"                                                                           257523 ns       257519 ns         2825 rows/sec=254.486M/s
BM_RowArray_Decode/"int64"                                                                           245070 ns       245020 ns         2824 rows/sec=267.468M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       330801 ns       330759 ns         1980 rows/sec=198.135M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       327874 ns       327839 ns         2134 rows/sec=199.9M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       331278 ns       331242 ns         1947 rows/sec=197.846M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       328647 ns       328611 ns         2112 rows/sec=199.43M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       335129 ns       335101 ns         1937 rows/sec=195.568M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      347641 ns       347601 ns         2097 rows/sec=188.535M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      408356 ns       408265 ns         1731 rows/sec=160.521M/s
BM_RowArray_DecodeBinary/max_length:32                                                               985453 ns       985190 ns          716 rows/sec=66.5202M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1250078 ns      1249727 ns          560 rows/sec=52.4394M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1467264 ns      1466902 ns          474 rows/sec=44.6758M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     266468 ns       266456 ns         2365 rows/sec=245.95M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     246552 ns       246557 ns         2803 rows/sec=265.8M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     437251 ns       437236 ns         1504 rows/sec=149.885M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   455065 ns       455005 ns         1603 rows/sec=144.031M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   445927 ns       445798 ns         1560 rows/sec=147.006M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1033287 ns      1032913 ns          702 rows/sec=63.4468M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1193991 ns      1193373 ns          544 rows/sec=54.9158M/s

pitrou · 2024-09-02T12:40:10Z

Why does arrow-acero-hash-join-benchmark require OpenMP?

zanmato1984 · 2024-09-02T13:40:57Z

Why does arrow-acero-hash-join-benchmark require OpenMP?

It seems that this benchmark benches some implementation details which are not visible at Acero's public APIs, hence it has to implement its own threading code. And (I guess) then the original author uses OpenMP to avoid some of such boring (and possibly long) threading code.

zanmato1984 · 2024-09-03T14:16:59Z

This is on my other desktop (Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz, Coffee Lake), similar symptom (possibly because it is also Coffee Lake as my MPB).

The scalar version:

ARROW_USER_SIMD_LEVEL=NONE ./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
2024-09-01T00:32:49+08:00
Running ./arrow-acero-hash-join-benchmark
Run on (8 X 4900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 12288 KiB (x1)
Load Average: 0.46, 3.08, 2.34
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         345809 ns       345761 ns         1896 rows/sec=189.538M/s
BM_RowArray_Decode/"int8"                                                                            267577 ns       267553 ns         2678 rows/sec=244.942M/s
BM_RowArray_Decode/"int16"                                                                           237106 ns       237094 ns         2872 rows/sec=276.409M/s
BM_RowArray_Decode/"int32"                                                                           243701 ns       243697 ns         2874 rows/sec=268.92M/s
BM_RowArray_Decode/"int64"                                                                           239891 ns       239886 ns         2709 rows/sec=273.192M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       316511 ns       316471 ns         2260 rows/sec=207.081M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       310797 ns       310759 ns         2165 rows/sec=210.887M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       324059 ns       324020 ns         2251 rows/sec=202.256M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       311799 ns       311753 ns         2244 rows/sec=210.214M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       364401 ns       364346 ns         2016 rows/sec=179.87M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      349918 ns       349868 ns         1997 rows/sec=187.313M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      507058 ns       506962 ns         1427 rows/sec=129.27M/s
BM_RowArray_DecodeBinary/max_length:32                                                              1261872 ns      1261465 ns          554 rows/sec=51.9515M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1585243 ns      1584698 ns          462 rows/sec=41.3549M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1822727 ns      1822343 ns          384 rows/sec=35.962M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     379210 ns       379150 ns         1843 rows/sec=172.847M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     275680 ns       275657 ns         2693 rows/sec=237.741M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     599291 ns       599291 ns         1257 rows/sec=109.354M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   506824 ns       506710 ns         1376 rows/sec=129.334M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   360611 ns       360579 ns         2123 rows/sec=181.75M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1182248 ns      1181939 ns          603 rows/sec=55.447M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1395220 ns      1394817 ns          529 rows/sec=46.9847M/s

The AVX2 version:

./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
2024-09-01T00:33:14+08:00
Running ./arrow-acero-hash-join-benchmark
Run on (8 X 4900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 12288 KiB (x1)
Load Average: 0.64, 2.91, 2.31
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         262395 ns       262341 ns         2665 rows/sec=249.808M/s
BM_RowArray_Decode/"int8"                                                                            263405 ns       263397 ns         2716 rows/sec=248.807M/s
BM_RowArray_Decode/"int16"                                                                           248155 ns       248106 ns         2821 rows/sec=264.141M/s
BM_RowArray_Decode/"int32"                                                                           257523 ns       257519 ns         2825 rows/sec=254.486M/s
BM_RowArray_Decode/"int64"                                                                           245070 ns       245020 ns         2824 rows/sec=267.468M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       330801 ns       330759 ns         1980 rows/sec=198.135M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       327874 ns       327839 ns         2134 rows/sec=199.9M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       331278 ns       331242 ns         1947 rows/sec=197.846M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       328647 ns       328611 ns         2112 rows/sec=199.43M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       335129 ns       335101 ns         1937 rows/sec=195.568M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      347641 ns       347601 ns         2097 rows/sec=188.535M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      408356 ns       408265 ns         1731 rows/sec=160.521M/s
BM_RowArray_DecodeBinary/max_length:32                                                               985453 ns       985190 ns          716 rows/sec=66.5202M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1250078 ns      1249727 ns          560 rows/sec=52.4394M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1467264 ns      1466902 ns          474 rows/sec=44.6758M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     266468 ns       266456 ns         2365 rows/sec=245.95M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     246552 ns       246557 ns         2803 rows/sec=265.8M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     437251 ns       437236 ns         1504 rows/sec=149.885M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   455065 ns       455005 ns         1603 rows/sec=144.031M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   445927 ns       445798 ns         1560 rows/sec=147.006M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1033287 ns      1032913 ns          702 rows/sec=63.4468M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1193991 ns      1193373 ns          544 rows/sec=54.9158M/s

OK, got something new.

The bad AVX2 gather performance seems strongly related to "Gather Data Sampling" vulnerability [1] (CVE-2022-40982, aka "Downfall") mitigation [2].

The CPU in my quote is apparently in the affected model list, for which the mitigation updates the microcode and causes significant performance down esp. for AVX2 gather. Lucky enough this mitigation can be easily disabled. The benchmark result showed that the gather performance w/o this mitigation is much better, and beats the scalar version almost always:

 ./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
2024-09-03T21:52:08+08:00
Running ./arrow-acero-hash-join-benchmark
Run on (8 X 4900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 12288 KiB (x1)
Load Average: 0.56, 0.22, 0.08
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         204759 ns       204759 ns         3398 rows/sec=320.059M/s
BM_RowArray_Decode/"int8"                                                                            198094 ns       198094 ns         3476 rows/sec=330.827M/s
BM_RowArray_Decode/"int16"                                                                           199424 ns       199445 ns         3490 rows/sec=328.587M/s
BM_RowArray_Decode/"int32"                                                                           201338 ns       201351 ns         3476 rows/sec=325.477M/s
BM_RowArray_Decode/"int64"                                                                           207006 ns       207010 ns         3406 rows/sec=316.579M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       329304 ns       329258 ns         2137 rows/sec=199.038M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       328043 ns       327986 ns         2116 rows/sec=199.811M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       327691 ns       327650 ns         2137 rows/sec=200.015M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       329935 ns       329892 ns         2133 rows/sec=198.656M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       337341 ns       337283 ns         2085 rows/sec=194.302M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      335654 ns       335592 ns         2066 rows/sec=195.282M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      412375 ns       412278 ns         1698 rows/sec=158.958M/s
BM_RowArray_DecodeBinary/max_length:32                                                               859282 ns       858982 ns          815 rows/sec=76.2938M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1126945 ns      1126548 ns          620 rows/sec=58.1733M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1346772 ns      1346336 ns          521 rows/sec=48.6766M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     225688 ns       225646 ns         3105 rows/sec=290.433M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     222248 ns       222233 ns         3148 rows/sec=294.894M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     448432 ns       448380 ns         1564 rows/sec=146.159M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   289385 ns       289347 ns         2420 rows/sec=226.493M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   289905 ns       289839 ns         2413 rows/sec=226.109M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                   874143 ns       873785 ns          801 rows/sec=75.0013M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1037058 ns      1036678 ns          674 rows/sec=63.2164M/s

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/gather-data-sampling.html
[2] https://access.redhat.com/solutions/7027704

mapleFU · 2024-09-03T14:19:20Z

Nice finding!

zanmato1984 · 2024-09-03T14:27:42Z

cpp/src/arrow/acero/swiss_join_avx2.cc

+  // Benchmarking shows that when the data element width is <= 8, the scalar code almost
+  // always outperforms the vectorized code - about 2X~3X faster when the whole data set
+  // falls into L1~LLC, and the ratio goes down to about 1:1 as the data size increases
+  // when most of the accesses hit the main memory. This is possibly because that decoding
+  // is mostly copying scattered pieces of data and there is not enough computation to pay
+  // off the cost of the heavy gather instructions.
+  // For fixed length 0 (boolean column), the vectorized code wins by batching 8 bits into
+  // a single byte instead of modifying the same byte 8 times in the scalar code.


As my comment #43832 (comment) says, it is a microcode upgrade of a vulnerability mitigation that causes this unexpected performance issue. Given that it seems only the very recent (after 2022) Intel models can get away with it, most legacy models will suffer. So I think I'll just keep using the scalar version for fixed length between 1 and 8 - of course I can reorg the code a bit and update the reason for that.

@pitrou @mapleFU If you are still following, what do you think?

Given that it doesn't affect AMD processors nor recent Intel processors, we could simply keep the optimization enabled.

OK, that's good for me too. I've removed the fixed-length-specific checking (and the comment).

I'm getting the final benchmarking numbers and will un-draft the PR afterwards.

+1 for me, and this should be denote in Release note?

Nice reminder!

@raulcd I think I can put something like "the performance improvement on specific CPU models (blablabla) may not be as significant as expected due to blablabla" in the PR description. Is there something we should do to ensure that will appear in the coming release notes?

Thanks.

I've put something in the "Are there any user-facing changes?" section of the PR description. Quote:

No changes other than positive performance improvement. Users can expect such improvement for hash joins related workload. Nevertheless the improvement degree highly depends on not only the workload, but also the CPU models. For Intel CPUs from Skylake to Icelake/Tigerlake, which suffer the performance degradation of AVX2 gather because of an vulnerability mitigation of Intel's (detailed in #43832 (comment)), the improvement is less significant - single digit percent. Other models, e.g. AMD, and the most recent Intel, can achieve better improvement up to 30%.

zanmato1984 · 2024-09-04T05:47:42Z

cpp/src/arrow/acero/swiss_join_internal.h

-  // without checking buffer bounds (useful with SIMD or fixed size memory loads
-  // and stores).
-  //
-  static int NumRowsToSkip(const RowTableImpl& rows, int column_id, int num_rows,


Not used anywhere so far. And seems not useful in the future either.

zanmato1984 · 2024-09-04T08:13:51Z

This is the final benchmark numbers. The result is obtained from my desktop Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz, Coffee Lake.

There is a mitigation of Intel CPU's vulnerability that severely impacts the AVX2 gather performance (detailed in GH-43693: [C++][Acero] Support AVX2 swiss join decoding #43832 (comment)), so I'll show the benchmark result for AVX2 version both with and without this mitigation compared to the scalar version.
The result with the mitigation (less, but positive, improvement) is what we can expect on Intel CPU models in the affected model list (Skylake to Icelake/Tigerlake), whereas the one without the mitigation (more improvement) is expected on AMD and the most recent Intel CPU models (post-Icelake/Tigerlake).
Benchmark BM_RowArray_Decode* is to solely bench the performance of decoding the row table in swiss join. The result shows the AVX2 version achieves about 1.23X ~ 1.34X up with the mitigation, and 2.48X ~ 3.73X up without the mitigation than the scalar version. We can also see how bad the mitigation affects the gather performance.
Because in most legacy end-to-end hash join benchmarks, the decoding takes only trivial portion in the whole computation, so there is no significant improvement for them. Therefore I create a dedicated benchmark, namely BM_HashJoinBasic_HeavyBuildPayload*, to demonstrate a workload who is predominated by decoding itself. The result shows in such an end-to-end workload, the AVX2 version achieves about 1.08X ~ 1.11X up with the mitigation, and 1.25X ~ 1.38X up without the mitigation than the scalar version. Note that this comparison is done between "all parts of the hash join work using AVX2" and "only decoding using scalar code, and rest parts using AVX2" (by hacking the code) to reflect how solely the AVX2 decoding can improve this particular workload.
The complete numbers are listed below.

BM_RowArray_Decode

Scalar

-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         352029 ns       343594 ns         1815 rows/sec=190.734M/s
BM_RowArray_Decode/"int8"                                                                            265491 ns       259906 ns         2696 rows/sec=252.149M/s
BM_RowArray_Decode/"int16"                                                                           249665 ns       245055 ns         2913 rows/sec=267.43M/s
BM_RowArray_Decode/"int32"                                                                           245361 ns       241379 ns         2905 rows/sec=271.503M/s
BM_RowArray_Decode/"int64"                                                                           241913 ns       238468 ns         2943 rows/sec=274.817M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       311951 ns       308030 ns         2284 rows/sec=212.755M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       318551 ns       315073 ns         2178 rows/sec=207.999M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       310632 ns       307678 ns         2263 rows/sec=212.999M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       311063 ns       308484 ns         2055 rows/sec=212.442M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       344859 ns       342367 ns         1883 rows/sec=191.418M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      344291 ns       342134 ns         1872 rows/sec=191.548M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      492296 ns       489691 ns         1490 rows/sec=133.829M/s
BM_RowArray_DecodeBinary/max_length:32                                                              1352644 ns      1346006 ns          520 rows/sec=48.6885M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1640900 ns      1633598 ns          431 rows/sec=40.117M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1913187 ns      1905681 ns          338 rows/sec=34.3893M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     365524 ns       364301 ns         1969 rows/sec=179.892M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     253069 ns       252345 ns         2787 rows/sec=259.704M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     536577 ns       535134 ns         1304 rows/sec=122.465M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   502928 ns       501643 ns         1376 rows/sec=130.641M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   317641 ns       316956 ns         2208 rows/sec=206.764M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1297863 ns      1295153 ns          565 rows/sec=50.6002M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1317590 ns      1315064 ns          533 rows/sec=49.8341M/s

AVX2 w/ Mitigation

-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         201413 ns       201738 ns         3299 rows/sec=324.853M/s
BM_RowArray_Decode/"int8"                                                                            187945 ns       188248 ns         3726 rows/sec=348.132M/s
BM_RowArray_Decode/"int16"                                                                           190845 ns       191145 ns         3832 rows/sec=342.854M/s
BM_RowArray_Decode/"int32"                                                                           179556 ns       179823 ns         3904 rows/sec=364.442M/s
BM_RowArray_Decode/"int64"                                                                           187440 ns       187723 ns         3897 rows/sec=349.104M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       250259 ns       250605 ns         2299 rows/sec=261.508M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       278307 ns       278680 ns         2512 rows/sec=235.162M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       261561 ns       261912 ns         2785 rows/sec=250.218M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       261767 ns       262106 ns         2786 rows/sec=250.033M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       255717 ns       256038 ns         2738 rows/sec=255.958M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      268011 ns       268339 ns         2676 rows/sec=244.225M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      313420 ns       313799 ns         2231 rows/sec=208.844M/s
BM_RowArray_DecodeBinary/max_length:32                                                               906652 ns       907516 ns          803 rows/sec=72.2136M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1114609 ns      1115591 ns          629 rows/sec=58.7446M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1373871 ns      1375226 ns          512 rows/sec=47.654M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     198752 ns       198992 ns         3222 rows/sec=329.335M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     181090 ns       181312 ns         3859 rows/sec=361.448M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     358132 ns       358548 ns         1800 rows/sec=182.779M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   371824 ns       372187 ns         1949 rows/sec=176.081M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   357208 ns       357540 ns         1956 rows/sec=183.294M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                   923022 ns       923784 ns          780 rows/sec=70.9419M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1048616 ns      1049434 ns          669 rows/sec=62.4479M/s

AVX2 w/o Mitigation

-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                          92114 ns        92105 ns         7569 rows/sec=711.523M/s
BM_RowArray_Decode/"int8"                                                                             90194 ns        90195 ns         8095 rows/sec=726.593M/s
BM_RowArray_Decode/"int16"                                                                            87798 ns        87810 ns         8006 rows/sec=746.33M/s
BM_RowArray_Decode/"int32"                                                                            88995 ns        89094 ns         7867 rows/sec=735.571M/s
BM_RowArray_Decode/"int64"                                                                            94848 ns        96306 ns         7533 rows/sec=680.489M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       205153 ns       207921 ns         3369 rows/sec=315.192M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       215245 ns       217836 ns         3354 rows/sec=300.846M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       207253 ns       209473 ns         3333 rows/sec=312.857M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       216148 ns       218212 ns         3335 rows/sec=300.327M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       212980 ns       214786 ns         3255 rows/sec=305.117M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      215780 ns       217417 ns         3232 rows/sec=301.426M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      270711 ns       272527 ns         2567 rows/sec=240.471M/s
BM_RowArray_DecodeBinary/max_length:32                                                               694130 ns       698295 ns         1022 rows/sec=93.8501M/s
BM_RowArray_DecodeBinary/max_length:64                                                               924115 ns       929161 ns          707 rows/sec=70.5314M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1133587 ns      1139122 ns          610 rows/sec=57.5311M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     110807 ns       111357 ns         6546 rows/sec=588.514M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     106122 ns       106605 ns         6568 rows/sec=614.746M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     323609 ns       324828 ns         2231 rows/sec=201.753M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   162969 ns       163526 ns         4278 rows/sec=400.762M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   171423 ns       171962 ns         4251 rows/sec=381.102M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                   707237 ns       709097 ns          980 rows/sec=92.4203M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                   868808 ns       870891 ns          807 rows/sec=75.2505M/s

BM_HashJoinBasic_HeavyBuildPayload

Scalar

------------------------------------------------------------------------------------------------------------------
Benchmark                                                        Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:1        410398 ns       410328 ns         1746 rows/sec=2.49557M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:8       3337775 ns      3337484 ns          216 rows/sec=2.45454M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:64     33379106 ns     33374225 ns           21 rows/sec=1.96367M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:512   491662619 ns    491562474 ns            2 rows/sec=1.06657M/s

AVX2 w/ Mitigation

------------------------------------------------------------------------------------------------------------------
Benchmark                                                        Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:1        371253 ns       371414 ns         1740 rows/sec=2.75703M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:8       3115015 ns      3115910 ns          230 rows/sec=2.62909M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:64     30945790 ns     30952150 ns           23 rows/sec=2.11733M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:512   439258398 ns    439322419 ns            2 rows/sec=1.1934M/s

AVX2 w/o Mitigation

------------------------------------------------------------------------------------------------------------------
Benchmark                                                        Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:1        296886 ns       296747 ns         2378 rows/sec=3.45075M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:8       2425116 ns      2424949 ns          289 rows/sec=3.37822M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:64     25787602 ns     25783258 ns           27 rows/sec=2.5418M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:512   388581838 ns    388508014 ns            2 rows/sec=1.34949M/s

zanmato1984 · 2024-09-04T08:32:17Z

I've cleared all the confusions and got the code/benchmark ready. @pitrou @mapleFU @felipecrv @cyb70289 @wgtmac Would you please help to review? Thanks!

pitrou · 2024-09-04T09:49:29Z

Can we make sure BM_HashJoinBasic_HeavyBuildPayload doesn't take too much time? It seems one of the parameterizations has a 5 second runtime.

zanmato1984 · 2024-09-04T10:06:49Z

Can we make sure BM_HashJoinBasic_HeavyBuildPayload doesn't take too much time? It seems one of the parameterizations has a 5 second runtime.

Done. Reduced the max row size of that benchmark to 512k. The max execution time is now within 1 second.

pitrou · 2024-09-04T12:32:34Z

cpp/src/arrow/acero/hash_join_benchmark.cc

+
+  uint64_t total_rows = 0;
+  for (auto _ : st) {
+    st.PauseTiming();


Is it actually relevant to pause timing here?

I guess not, the code in-between is trivial. I'll remove them.

pitrou · 2024-09-04T12:33:44Z

cpp/src/arrow/acero/hash_join_benchmark.cc

+                                  default_memory_pool()));
+    total_rows += batch.length;
+  }
+  st.counters["rows/sec"] = benchmark::Counter(total_rows, benchmark::Counter::kIsRate);


Should simply be state.SetItemsProcessed?

Thank you! Updated.

pitrou · 2024-09-04T12:34:03Z

cpp/src/arrow/acero/hash_join_benchmark.cc

+  st.counters["rows/sec"] = benchmark::Counter(total_rows, benchmark::Counter::kIsRate);
+}
+
+template <typename... Args>


What are the Args for?

Nothing. Removed. Thanks for pointing this out.

Co-authored-by: Antoine Pitrou <[email protected]>

pitrou · 2024-12-17T18:20:15Z

@github-actions crossbow submit -g cpp

github-actions · 2024-12-17T18:22:55Z

Revision: 1be8e4a

Submitted crossbow builds: ursacomputing/crossbow @ actions-71c56e957f

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp-ubuntu-20.04-cuda-11.2.2
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer

pitrou · 2024-12-17T20:43:51Z

The CI failures are unrelated, I'll merge. Thank you @zanmato1984 !

conbench-apache-arrow · 2025-01-10T11:07:42Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 2bd2e35.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them.

github-actions bot added Component: C++ awaiting review Awaiting review labels Aug 26, 2024

This comment was marked as outdated.

Sign in to view

zanmato1984 changed the title ~~GH-43693: [C++][Acero] Support AVX swiss join decoding~~ GH-43693: [C++][Acero] Support AVX2 swiss join decoding Aug 29, 2024

zanmato1984 commented Sep 3, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 3, 2024

zanmato1984 commented Sep 4, 2024

View reviewed changes

zanmato1984 marked this pull request as ready for review September 4, 2024 08:30

zanmato1984 requested a review from westonpace as a code owner September 4, 2024 08:30

pitrou reviewed Sep 4, 2024

View reviewed changes

zanmato1984 and others added 21 commits December 18, 2024 01:09

Reduce heavy build payload benchmark row size

28d830c

Address comments

22cb8a6

Update cpp/src/arrow/acero/swiss_join.cc

80a5a9a

Co-authored-by: Antoine Pitrou <[email protected]>

Add mutable_data_as for ResizableArrayData

e5d676f

unroll -> kUnroll

ac05bd6

Update cpp/src/arrow/acero/swiss_join_avx2.cc

09d2cfc

Co-authored-by: Antoine Pitrou <[email protected]>

Update cpp/src/arrow/acero/swiss_join_avx2.cc

d8ce131

Co-authored-by: Antoine Pitrou <[email protected]>

Update cpp/src/arrow/acero/swiss_join_avx2.cc

6ef10b9

Co-authored-by: Antoine Pitrou <[email protected]>

Apply suggestions from code review

3657857

Co-authored-by: Antoine Pitrou <[email protected]>

Format

8e5acf3

Update comment rows -> values

e20e478

Fix typo

e0ffbbc

Update comment rows -> values

be6e0d1

Update cpp/src/arrow/acero/swiss_join_avx2.cc

32e16ee

Co-authored-by: Antoine Pitrou <[email protected]>

Format

e3aa126

Update cpp/src/arrow/acero/swiss_join_avx2.cc

c4c34a5

Co-authored-by: Antoine Pitrou <[email protected]>

Update cpp/src/arrow/acero/swiss_join_avx2.cc

c541bea

Co-authored-by: Antoine Pitrou <[email protected]>

Replace c-style cast with reinterprete_cast

0947795

Use mutable_data_as in swiss join avx2

b628bd2

Apply suggestions from code review

366aa6b

Co-authored-by: Antoine Pitrou <[email protected]>

Add comments about the memory overshoot

69d6802

zanmato1984 force-pushed the swiss-join-avx2 branch from f21ae94 to 69d6802 Compare December 17, 2024 17:27

Add comments about the memory overshoot

1be8e4a

pitrou merged commit 2bd2e35 into apache:main Dec 17, 2024
36 of 37 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Dec 17, 2024

pitrou mentioned this pull request Dec 17, 2024

[C++][Acero] AVX2 specialized swiss join functions not wired #43693

Closed

GH-43693: [C++][Acero] Support AVX2 swiss join decoding #43832

GH-43693: [C++][Acero] Support AVX2 swiss join decoding #43832

Conversation

zanmato1984 commented Aug 26, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

zanmato1984 commented Aug 26, 2024

zanmato1984 commented Aug 26, 2024

ursabot commented Aug 26, 2024

This comment was marked as outdated.

conbench-apache-arrow bot commented Aug 26, 2024

zanmato1984 commented Aug 30, 2024 • edited Loading

mapleFU commented Aug 30, 2024

mapleFU commented Aug 30, 2024

zanmato1984 commented Aug 31, 2024

zanmato1984 commented Aug 31, 2024

pitrou commented Sep 2, 2024

zanmato1984 commented Sep 2, 2024

zanmato1984 commented Sep 3, 2024 • edited Loading

mapleFU commented Sep 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zanmato1984 commented Sep 4, 2024 • edited Loading

BM_RowArray_Decode

Scalar

AVX2 w/ Mitigation

AVX2 w/o Mitigation

BM_HashJoinBasic_HeavyBuildPayload

Scalar

AVX2 w/ Mitigation

AVX2 w/o Mitigation

zanmato1984 commented Sep 4, 2024

pitrou commented Sep 4, 2024

zanmato1984 commented Sep 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Dec 17, 2024

github-actions bot commented Dec 17, 2024

pitrou commented Dec 17, 2024

conbench-apache-arrow bot commented Jan 10, 2025

zanmato1984 commented Aug 26, 2024 •

edited

Loading

zanmato1984 commented Aug 30, 2024 •

edited

Loading

zanmato1984 commented Sep 3, 2024 •

edited

Loading

zanmato1984 commented Sep 4, 2024 •

edited

Loading