Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43693: [C++][Acero] Support AVX2 swiss join decoding #43832

Merged
merged 45 commits into from
Dec 17, 2024

Conversation

zanmato1984
Copy link
Contributor

@zanmato1984 zanmato1984 commented Aug 26, 2024

Rationale for this change

You can find the background in #43693.

By looking at how Visit_avx2/VisitNulls_avx2's non-simd counterparts (Visit/VisitNulls) are used, I found they are solely for decoding rows from the build side of the join. So I added AVX2 versions for those decoding methods and wired Visit_avx2/VisitNulls_avx2.

What changes are included in this PR?

  1. Split the decoding methods into smaller pieces to make each of them able to cooperate with the AVX2 version.
  2. Concrete AVX2 specialized functions utilizing the Visit*_avx2 functions to decode fixed-length/offsets/var-length/nulls of the row table.
  3. Fix some bugs in the original Visit*_avx2 functions.
  4. Related benchmarks.

Are these changes tested?

No new tests needed.

The benchmarking result is a bit complicated, I put them in comment #43832 (comment).

Are there any user-facing changes?

No changes other than positive performance improvement. Users can expect such improvement for hash joins related workload. Nevertheless the improvement degree highly depends on not only the workload, but also the CPU models. For Intel CPUs from Skylake to Icelake/Tigerlake, which suffer the performance degradation of AVX2 gather because of an vulnerability mitigation of Intel's (detailed in #43832 (comment)), the improvement is less significant - single digit percent. Other models, e.g. AMD, and the most recent Intel, can achieve better improvement up to 30%.

@zanmato1984
Copy link
Contributor Author

@github-actions crossbow submit -g cpp

@zanmato1984
Copy link
Contributor Author

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Aug 26, 2024

Benchmark runs are scheduled for commit e2af277. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

This comment was marked as outdated.

Copy link

Thanks for your patience. Conbench analyzed the 2 benchmarking runs that have been run so far on PR commit e2af277.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@zanmato1984 zanmato1984 changed the title GH-43693: [C++][Acero] Support AVX swiss join decoding GH-43693: [C++][Acero] Support AVX2 swiss join decoding Aug 29, 2024
@zanmato1984
Copy link
Contributor Author

zanmato1984 commented Aug 30, 2024

Before formalizing this PR, I think I can use some help from @pitrou @cyb70289 @wgtmac @mapleFU @felipecrv .

I'm benchmarking BM_RowArray_Decode* (see the PR), using my 2019 Intel MBP (Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz, Coffee Lake), on my other branch https://github.com/zanmato1984/arrow/tree/swiss-join-avx2-for-maple. This branch is based on the PR branch, with some code paths commented out in order to solely compare the performance between DecodeFixedLength + Visit (the scalar version) and DecodeFixedLength_avx2 + Visit_avx2 (the AVX2 version).

The result surprises me. The scalar version:

ARROW_USER_SIMD_LEVEL=NONE ./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
The number of inputs is very large. BM_HashJoinBasic_PayloadSize will be repeated at least 125 times.
2024-08-30T16:44:41+08:00
Running ./arrow-acero-hash-join-benchmark
Run on (16 X 2400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB
  L1 Instruction 32 KiB
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB
Load Average: 2.24, 2.35, 2.37
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         335756 ns       335771 ns         2088 rows/sec=195.178M/s
BM_RowArray_Decode/"int8"                                                                            222946 ns       222964 ns         3218 rows/sec=293.926M/s
BM_RowArray_Decode/"int16"                                                                           206334 ns       206310 ns         3450 rows/sec=317.653M/s
BM_RowArray_Decode/"int32"                                                                           212043 ns       211908 ns         3243 rows/sec=309.262M/s
BM_RowArray_Decode/"int64"                                                                           214047 ns       213696 ns         3204 rows/sec=306.674M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       325797 ns       325116 ns         2116 rows/sec=201.574M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       332284 ns       331637 ns         2106 rows/sec=197.611M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       332328 ns       331591 ns         2140 rows/sec=197.638M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       331975 ns       331820 ns         2150 rows/sec=197.501M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       367805 ns       367605 ns         1915 rows/sec=178.275M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      368041 ns       367993 ns         1905 rows/sec=178.088M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      527913 ns       527701 ns         1342 rows/sec=124.19M/s
BM_RowArray_DecodeBinary/max_length:32                                                              1645646 ns      1498623 ns          469 rows/sec=43.7302M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1751193 ns      1750436 ns          408 rows/sec=37.4392M/s
BM_RowArray_DecodeBinary/max_length:128                                                             2076064 ns      2073780 ns          350 rows/sec=31.6017M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     353249 ns       353196 ns         1985 rows/sec=185.549M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     226431 ns       226484 ns         3076 rows/sec=289.359M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     583054 ns       583129 ns         1233 rows/sec=112.385M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   455519 ns       455603 ns         1507 rows/sec=143.842M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   288495 ns       288417 ns         2450 rows/sec=227.223M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1335445 ns      1335260 ns          530 rows/sec=49.0803M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1403463 ns      1403317 ns          502 rows/sec=46.7001M/s

The AVX2 version:

./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
The number of inputs is very large. BM_HashJoinBasic_PayloadSize will be repeated at least 125 times.
2024-08-30T16:45:10+08:00
Running ./arrow-acero-hash-join-benchmark
Run on (16 X 2400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB
  L1 Instruction 32 KiB
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB
Load Average: 3.41, 2.64, 2.47
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         275226 ns       275132 ns         2561 rows/sec=238.195M/s
BM_RowArray_Decode/"int8"                                                                            268566 ns       268506 ns         2654 rows/sec=244.072M/s
BM_RowArray_Decode/"int16"                                                                           265198 ns       265165 ns         2557 rows/sec=247.148M/s
BM_RowArray_Decode/"int32"                                                                           262050 ns       261924 ns         2709 rows/sec=250.207M/s
BM_RowArray_Decode/"int64"                                                                           260563 ns       260506 ns         2699 rows/sec=251.568M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       355187 ns       348711 ns         2019 rows/sec=187.935M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       350424 ns       350330 ns         2005 rows/sec=187.066M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       350453 ns       350352 ns         2024 rows/sec=187.055M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       349830 ns       349612 ns         1992 rows/sec=187.451M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       349569 ns       349349 ns         2002 rows/sec=187.592M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      351250 ns       351061 ns         1996 rows/sec=186.677M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      421615 ns       421506 ns         1633 rows/sec=155.478M/s
BM_RowArray_DecodeBinary/max_length:32                                                              1029484 ns      1029140 ns          685 rows/sec=63.6794M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1406671 ns      1406103 ns          495 rows/sec=46.6075M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1657173 ns      1655898 ns          432 rows/sec=39.5767M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     276695 ns       276781 ns         2546 rows/sec=236.775M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     263849 ns       263910 ns         2543 rows/sec=248.323M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     473701 ns       473962 ns         1532 rows/sec=138.271M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   466305 ns       466162 ns         1447 rows/sec=140.584M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   471840 ns       471721 ns         1496 rows/sec=138.927M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1065689 ns      1065185 ns          661 rows/sec=61.5245M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1214506 ns      1213996 ns          562 rows/sec=53.9829M/s

Take int16 for example, the scalar version is about 30% faster than the vector version. This is the case as well for other fixed-length types that are less than 8 bytes. So I benchmarked (on the same machine) solely the memory access pattern (gather several integers from scattered addresses then store them together) between scalar and AVX2 using a much more compact benchmark (it's still in my local and not published in any branches). I put my conclusion, which I'm very unconfident of, in the code comment, quote:

  // Benchmarking shows that when the data element width is <= 8, the scalar code almost
  // always outperforms the vectorized code - about 2X~3X faster when the whole data set
  // falls into L1~LLC, and the ratio goes down to about 1:1 as the data size increases
  // when most of the accesses hit the main memory. This is possibly because that decoding
  // is mostly copying scattered pieces of data and there is not enough computation to pay
  // off the cost of the heavy gather instructions.
  // For fixed length 0 (boolean column), the vectorized code wins by batching 8 bits into
  // a single byte instead of modifying the same byte 8 times in the scalar code.

What I need for help is that: Is my assumption reasonable? Or is it just the case on my hardware (I temporarily have difficulties on accessing other Intel machines)? Or is it the AVX2 code I wrote to be improved further? Or even is it simply the problem of the benchmark itself?

BTW, if benchmarking using this PR (the slow AVX2 paths are intentionally avoided), we get positive results, about 50% improvement for the AVX version.

Thanks in advance!

@mapleFU
Copy link
Member

mapleFU commented Aug 30, 2024

On my amd 3800x:

without flag ARROW_USER_SIMD_LEVEL=NONE

BM_RowArray_Decode/"boolean"                                                                         226161 ns       226368 ns         3010 rows/sec=289.507M/s
BM_RowArray_Decode/"int8"                                                                            213483 ns       213692 ns         3251 rows/sec=306.68M/s
BM_RowArray_Decode/"int16"                                                                           216579 ns       216796 ns         3303 rows/sec=302.289M/s
BM_RowArray_Decode/"int32"                                                                           216269 ns       216471 ns         3369 rows/sec=302.742M/s
BM_RowArray_Decode/"int64"                                                                           222836 ns       223107 ns         3267 rows/sec=293.738M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       280235 ns       280491 ns         2543 rows/sec=233.644M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       385640 ns       386004 ns         1888 rows/sec=169.778M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       283350 ns       283581 ns         2481 rows/sec=231.098M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       275834 ns       276064 ns         2477 rows/sec=237.391M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       382956 ns       383228 ns         1797 rows/sec=171.008M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      360731 ns       360978 ns         2061 rows/sec=181.549M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      405461 ns       405901 ns         1729 rows/sec=161.456M/s
BM_RowArray_DecodeBinary/max_length:32                                                               870678 ns       871002 ns          801 rows/sec=75.2409M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1225297 ns      1225776 ns          552 rows/sec=53.4641M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1543498 ns      1544551 ns          454 rows/sec=42.4298M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     241292 ns       241565 ns         2919 rows/sec=271.293M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     222509 ns       222788 ns         3062 rows/sec=294.159M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     453397 ns       453855 ns         1587 rows/sec=144.396M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   320415 ns       320641 ns         2157 rows/sec=204.387M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   311626 ns       311862 ns         2221 rows/sec=210.141M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                   865312 ns       865693 ns          801 rows/sec=75.7023M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                   966808 ns       967218 ns          736 rows/sec=67.7562M/s

With ARROW_USER_SIMD_LEVEL=NONE flag:

BM_RowArray_Decode/"boolean"                                                                         309518 ns       309748 ns         2336 rows/sec=211.575M/s
BM_RowArray_Decode/"int8"                                                                            219132 ns       219335 ns         3205 rows/sec=298.79M/s
BM_RowArray_Decode/"int16"                                                                           206815 ns       207000 ns         3375 rows/sec=316.594M/s
BM_RowArray_Decode/"int32"                                                                           211850 ns       212051 ns         3261 rows/sec=309.053M/s
BM_RowArray_Decode/"int64"                                                                           209969 ns       210150 ns         3284 rows/sec=311.849M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       307361 ns       307555 ns         2192 rows/sec=213.084M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       317631 ns       317849 ns         2211 rows/sec=206.183M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       314159 ns       314372 ns         2218 rows/sec=208.463M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       313868 ns       314084 ns         2162 rows/sec=208.655M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       373098 ns       373331 ns         1839 rows/sec=175.541M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      371984 ns       372230 ns         1830 rows/sec=176.061M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      481453 ns       481768 ns         1440 rows/sec=136.03M/s
BM_RowArray_DecodeBinary/max_length:32                                                              1311100 ns      1311531 ns          534 rows/sec=49.9683M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1788057 ns      1788600 ns          389 rows/sec=36.6404M/s
BM_RowArray_DecodeBinary/max_length:128                                                             2120766 ns      2121693 ns          308 rows/sec=30.8881M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     330122 ns       330365 ns         2142 rows/sec=198.372M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     215677 ns       215314 ns         3202 rows/sec=304.37M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     498039 ns       498533 ns         1431 rows/sec=131.456M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   433924 ns       434161 ns         1595 rows/sec=150.946M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   267960 ns       268163 ns         2638 rows/sec=244.385M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1214087 ns      1214602 ns          589 rows/sec=53.9559M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1335344 ns      1335821 ns          508 rows/sec=49.0597M/s

@mapleFU
Copy link
Member

mapleFU commented Aug 30, 2024

I'm too sleepy to testing it on icelake..would do it tomorrow

@zanmato1984
Copy link
Contributor Author

Thanks a lot for running it @mapleFU ! Much appreciated!

I think we still see certain cases that scalar code beats the vectorized one (but not as much as mine).

@zanmato1984
Copy link
Contributor Author

This is on my other desktop (Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz, Coffee Lake), similar symptom (possibly because it is also Coffee Lake as my MPB).

The scalar version:

ARROW_USER_SIMD_LEVEL=NONE ./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
2024-09-01T00:32:49+08:00
Running ./arrow-acero-hash-join-benchmark
Run on (8 X 4900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 12288 KiB (x1)
Load Average: 0.46, 3.08, 2.34
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         345809 ns       345761 ns         1896 rows/sec=189.538M/s
BM_RowArray_Decode/"int8"                                                                            267577 ns       267553 ns         2678 rows/sec=244.942M/s
BM_RowArray_Decode/"int16"                                                                           237106 ns       237094 ns         2872 rows/sec=276.409M/s
BM_RowArray_Decode/"int32"                                                                           243701 ns       243697 ns         2874 rows/sec=268.92M/s
BM_RowArray_Decode/"int64"                                                                           239891 ns       239886 ns         2709 rows/sec=273.192M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       316511 ns       316471 ns         2260 rows/sec=207.081M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       310797 ns       310759 ns         2165 rows/sec=210.887M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       324059 ns       324020 ns         2251 rows/sec=202.256M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       311799 ns       311753 ns         2244 rows/sec=210.214M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       364401 ns       364346 ns         2016 rows/sec=179.87M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      349918 ns       349868 ns         1997 rows/sec=187.313M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      507058 ns       506962 ns         1427 rows/sec=129.27M/s
BM_RowArray_DecodeBinary/max_length:32                                                              1261872 ns      1261465 ns          554 rows/sec=51.9515M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1585243 ns      1584698 ns          462 rows/sec=41.3549M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1822727 ns      1822343 ns          384 rows/sec=35.962M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     379210 ns       379150 ns         1843 rows/sec=172.847M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     275680 ns       275657 ns         2693 rows/sec=237.741M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     599291 ns       599291 ns         1257 rows/sec=109.354M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   506824 ns       506710 ns         1376 rows/sec=129.334M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   360611 ns       360579 ns         2123 rows/sec=181.75M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1182248 ns      1181939 ns          603 rows/sec=55.447M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1395220 ns      1394817 ns          529 rows/sec=46.9847M/s

The AVX2 version:

./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
2024-09-01T00:33:14+08:00
Running ./arrow-acero-hash-join-benchmark
Run on (8 X 4900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 12288 KiB (x1)
Load Average: 0.64, 2.91, 2.31
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         262395 ns       262341 ns         2665 rows/sec=249.808M/s
BM_RowArray_Decode/"int8"                                                                            263405 ns       263397 ns         2716 rows/sec=248.807M/s
BM_RowArray_Decode/"int16"                                                                           248155 ns       248106 ns         2821 rows/sec=264.141M/s
BM_RowArray_Decode/"int32"                                                                           257523 ns       257519 ns         2825 rows/sec=254.486M/s
BM_RowArray_Decode/"int64"                                                                           245070 ns       245020 ns         2824 rows/sec=267.468M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       330801 ns       330759 ns         1980 rows/sec=198.135M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       327874 ns       327839 ns         2134 rows/sec=199.9M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       331278 ns       331242 ns         1947 rows/sec=197.846M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       328647 ns       328611 ns         2112 rows/sec=199.43M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       335129 ns       335101 ns         1937 rows/sec=195.568M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      347641 ns       347601 ns         2097 rows/sec=188.535M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      408356 ns       408265 ns         1731 rows/sec=160.521M/s
BM_RowArray_DecodeBinary/max_length:32                                                               985453 ns       985190 ns          716 rows/sec=66.5202M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1250078 ns      1249727 ns          560 rows/sec=52.4394M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1467264 ns      1466902 ns          474 rows/sec=44.6758M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     266468 ns       266456 ns         2365 rows/sec=245.95M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     246552 ns       246557 ns         2803 rows/sec=265.8M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     437251 ns       437236 ns         1504 rows/sec=149.885M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   455065 ns       455005 ns         1603 rows/sec=144.031M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   445927 ns       445798 ns         1560 rows/sec=147.006M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1033287 ns      1032913 ns          702 rows/sec=63.4468M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1193991 ns      1193373 ns          544 rows/sec=54.9158M/s

@pitrou
Copy link
Member

pitrou commented Sep 2, 2024

Why does arrow-acero-hash-join-benchmark require OpenMP?

@zanmato1984
Copy link
Contributor Author

Why does arrow-acero-hash-join-benchmark require OpenMP?

It seems that this benchmark benches some implementation details which are not visible at Acero's public APIs, hence it has to implement its own threading code. And (I guess) then the original author uses OpenMP to avoid some of such boring (and possibly long) threading code.

@zanmato1984
Copy link
Contributor Author

zanmato1984 commented Sep 3, 2024

This is on my other desktop (Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz, Coffee Lake), similar symptom (possibly because it is also Coffee Lake as my MPB).

The scalar version:

ARROW_USER_SIMD_LEVEL=NONE ./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
2024-09-01T00:32:49+08:00
Running ./arrow-acero-hash-join-benchmark
Run on (8 X 4900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 12288 KiB (x1)
Load Average: 0.46, 3.08, 2.34
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         345809 ns       345761 ns         1896 rows/sec=189.538M/s
BM_RowArray_Decode/"int8"                                                                            267577 ns       267553 ns         2678 rows/sec=244.942M/s
BM_RowArray_Decode/"int16"                                                                           237106 ns       237094 ns         2872 rows/sec=276.409M/s
BM_RowArray_Decode/"int32"                                                                           243701 ns       243697 ns         2874 rows/sec=268.92M/s
BM_RowArray_Decode/"int64"                                                                           239891 ns       239886 ns         2709 rows/sec=273.192M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       316511 ns       316471 ns         2260 rows/sec=207.081M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       310797 ns       310759 ns         2165 rows/sec=210.887M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       324059 ns       324020 ns         2251 rows/sec=202.256M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       311799 ns       311753 ns         2244 rows/sec=210.214M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       364401 ns       364346 ns         2016 rows/sec=179.87M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      349918 ns       349868 ns         1997 rows/sec=187.313M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      507058 ns       506962 ns         1427 rows/sec=129.27M/s
BM_RowArray_DecodeBinary/max_length:32                                                              1261872 ns      1261465 ns          554 rows/sec=51.9515M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1585243 ns      1584698 ns          462 rows/sec=41.3549M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1822727 ns      1822343 ns          384 rows/sec=35.962M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     379210 ns       379150 ns         1843 rows/sec=172.847M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     275680 ns       275657 ns         2693 rows/sec=237.741M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     599291 ns       599291 ns         1257 rows/sec=109.354M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   506824 ns       506710 ns         1376 rows/sec=129.334M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   360611 ns       360579 ns         2123 rows/sec=181.75M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1182248 ns      1181939 ns          603 rows/sec=55.447M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1395220 ns      1394817 ns          529 rows/sec=46.9847M/s

The AVX2 version:

./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
2024-09-01T00:33:14+08:00
Running ./arrow-acero-hash-join-benchmark
Run on (8 X 4900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 12288 KiB (x1)
Load Average: 0.64, 2.91, 2.31
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         262395 ns       262341 ns         2665 rows/sec=249.808M/s
BM_RowArray_Decode/"int8"                                                                            263405 ns       263397 ns         2716 rows/sec=248.807M/s
BM_RowArray_Decode/"int16"                                                                           248155 ns       248106 ns         2821 rows/sec=264.141M/s
BM_RowArray_Decode/"int32"                                                                           257523 ns       257519 ns         2825 rows/sec=254.486M/s
BM_RowArray_Decode/"int64"                                                                           245070 ns       245020 ns         2824 rows/sec=267.468M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       330801 ns       330759 ns         1980 rows/sec=198.135M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       327874 ns       327839 ns         2134 rows/sec=199.9M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       331278 ns       331242 ns         1947 rows/sec=197.846M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       328647 ns       328611 ns         2112 rows/sec=199.43M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       335129 ns       335101 ns         1937 rows/sec=195.568M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      347641 ns       347601 ns         2097 rows/sec=188.535M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      408356 ns       408265 ns         1731 rows/sec=160.521M/s
BM_RowArray_DecodeBinary/max_length:32                                                               985453 ns       985190 ns          716 rows/sec=66.5202M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1250078 ns      1249727 ns          560 rows/sec=52.4394M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1467264 ns      1466902 ns          474 rows/sec=44.6758M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     266468 ns       266456 ns         2365 rows/sec=245.95M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     246552 ns       246557 ns         2803 rows/sec=265.8M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     437251 ns       437236 ns         1504 rows/sec=149.885M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   455065 ns       455005 ns         1603 rows/sec=144.031M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   445927 ns       445798 ns         1560 rows/sec=147.006M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1033287 ns      1032913 ns          702 rows/sec=63.4468M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1193991 ns      1193373 ns          544 rows/sec=54.9158M/s

OK, got something new.

The bad AVX2 gather performance seems strongly related to "Gather Data Sampling" vulnerability [1] (CVE-2022-40982, aka "Downfall") mitigation [2].

The CPU in my quote is apparently in the affected model list, for which the mitigation updates the microcode and causes significant performance down esp. for AVX2 gather. Lucky enough this mitigation can be easily disabled. The benchmark result showed that the gather performance w/o this mitigation is much better, and beats the scalar version almost always:

 ./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
2024-09-03T21:52:08+08:00
Running ./arrow-acero-hash-join-benchmark
Run on (8 X 4900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 12288 KiB (x1)
Load Average: 0.56, 0.22, 0.08
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         204759 ns       204759 ns         3398 rows/sec=320.059M/s
BM_RowArray_Decode/"int8"                                                                            198094 ns       198094 ns         3476 rows/sec=330.827M/s
BM_RowArray_Decode/"int16"                                                                           199424 ns       199445 ns         3490 rows/sec=328.587M/s
BM_RowArray_Decode/"int32"                                                                           201338 ns       201351 ns         3476 rows/sec=325.477M/s
BM_RowArray_Decode/"int64"                                                                           207006 ns       207010 ns         3406 rows/sec=316.579M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       329304 ns       329258 ns         2137 rows/sec=199.038M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       328043 ns       327986 ns         2116 rows/sec=199.811M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       327691 ns       327650 ns         2137 rows/sec=200.015M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       329935 ns       329892 ns         2133 rows/sec=198.656M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       337341 ns       337283 ns         2085 rows/sec=194.302M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      335654 ns       335592 ns         2066 rows/sec=195.282M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      412375 ns       412278 ns         1698 rows/sec=158.958M/s
BM_RowArray_DecodeBinary/max_length:32                                                               859282 ns       858982 ns          815 rows/sec=76.2938M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1126945 ns      1126548 ns          620 rows/sec=58.1733M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1346772 ns      1346336 ns          521 rows/sec=48.6766M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     225688 ns       225646 ns         3105 rows/sec=290.433M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     222248 ns       222233 ns         3148 rows/sec=294.894M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     448432 ns       448380 ns         1564 rows/sec=146.159M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   289385 ns       289347 ns         2420 rows/sec=226.493M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   289905 ns       289839 ns         2413 rows/sec=226.109M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                   874143 ns       873785 ns          801 rows/sec=75.0013M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1037058 ns      1036678 ns          674 rows/sec=63.2164M/s

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/gather-data-sampling.html
[2] https://access.redhat.com/solutions/7027704

@mapleFU
Copy link
Member

mapleFU commented Sep 3, 2024

Nice finding!

Comment on lines 400 to 407
// Benchmarking shows that when the data element width is <= 8, the scalar code almost
// always outperforms the vectorized code - about 2X~3X faster when the whole data set
// falls into L1~LLC, and the ratio goes down to about 1:1 as the data size increases
// when most of the accesses hit the main memory. This is possibly because that decoding
// is mostly copying scattered pieces of data and there is not enough computation to pay
// off the cost of the heavy gather instructions.
// For fixed length 0 (boolean column), the vectorized code wins by batching 8 bits into
// a single byte instead of modifying the same byte 8 times in the scalar code.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As my comment #43832 (comment) says, it is a microcode upgrade of a vulnerability mitigation that causes this unexpected performance issue. Given that it seems only the very recent (after 2022) Intel models can get away with it, most legacy models will suffer. So I think I'll just keep using the scalar version for fixed length between 1 and 8 - of course I can reorg the code a bit and update the reason for that.

@pitrou @mapleFU If you are still following, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that it doesn't affect AMD processors nor recent Intel processors, we could simply keep the optimization enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's good for me too. I've removed the fixed-length-specific checking (and the comment).

I'm getting the final benchmarking numbers and will un-draft the PR afterwards.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for me, and this should be denote in Release note?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice reminder!

@raulcd I think I can put something like "the performance improvement on specific CPU models (blablabla) may not be as significant as expected due to blablabla" in the PR description. Is there something we should do to ensure that will appear in the coming release notes?

Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put something in the "Are there any user-facing changes?" section of the PR description. Quote:

No changes other than positive performance improvement. Users can expect such improvement for hash joins related workload. Nevertheless the improvement degree highly depends on not only the workload, but also the CPU models. For Intel CPUs from Skylake to Icelake/Tigerlake, which suffer the performance degradation of AVX2 gather because of an vulnerability mitigation of Intel's (detailed in #43832 (comment)), the improvement is less significant - single digit percent. Other models, e.g. AMD, and the most recent Intel, can achieve better improvement up to 30%.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 3, 2024
// without checking buffer bounds (useful with SIMD or fixed size memory loads
// and stores).
//
static int NumRowsToSkip(const RowTableImpl& rows, int column_id, int num_rows,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used anywhere so far. And seems not useful in the future either.

@zanmato1984
Copy link
Contributor Author

zanmato1984 commented Sep 4, 2024

This is the final benchmark numbers. The result is obtained from my desktop Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz, Coffee Lake.

  1. There is a mitigation of Intel CPU's vulnerability that severely impacts the AVX2 gather performance (detailed in GH-43693: [C++][Acero] Support AVX2 swiss join decoding #43832 (comment)), so I'll show the benchmark result for AVX2 version both with and without this mitigation compared to the scalar version.
  2. The result with the mitigation (less, but positive, improvement) is what we can expect on Intel CPU models in the affected model list (Skylake to Icelake/Tigerlake), whereas the one without the mitigation (more improvement) is expected on AMD and the most recent Intel CPU models (post-Icelake/Tigerlake).
  3. Benchmark BM_RowArray_Decode* is to solely bench the performance of decoding the row table in swiss join. The result shows the AVX2 version achieves about 1.23X ~ 1.34X up with the mitigation, and 2.48X ~ 3.73X up without the mitigation than the scalar version. We can also see how bad the mitigation affects the gather performance.
  4. Because in most legacy end-to-end hash join benchmarks, the decoding takes only trivial portion in the whole computation, so there is no significant improvement for them. Therefore I create a dedicated benchmark, namely BM_HashJoinBasic_HeavyBuildPayload*, to demonstrate a workload who is predominated by decoding itself. The result shows in such an end-to-end workload, the AVX2 version achieves about 1.08X ~ 1.11X up with the mitigation, and 1.25X ~ 1.38X up without the mitigation than the scalar version. Note that this comparison is done between "all parts of the hash join work using AVX2" and "only decoding using scalar code, and rest parts using AVX2" (by hacking the code) to reflect how solely the AVX2 decoding can improve this particular workload.
  5. The complete numbers are listed below.

BM_RowArray_Decode

Scalar

-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         352029 ns       343594 ns         1815 rows/sec=190.734M/s
BM_RowArray_Decode/"int8"                                                                            265491 ns       259906 ns         2696 rows/sec=252.149M/s
BM_RowArray_Decode/"int16"                                                                           249665 ns       245055 ns         2913 rows/sec=267.43M/s
BM_RowArray_Decode/"int32"                                                                           245361 ns       241379 ns         2905 rows/sec=271.503M/s
BM_RowArray_Decode/"int64"                                                                           241913 ns       238468 ns         2943 rows/sec=274.817M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       311951 ns       308030 ns         2284 rows/sec=212.755M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       318551 ns       315073 ns         2178 rows/sec=207.999M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       310632 ns       307678 ns         2263 rows/sec=212.999M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       311063 ns       308484 ns         2055 rows/sec=212.442M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       344859 ns       342367 ns         1883 rows/sec=191.418M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      344291 ns       342134 ns         1872 rows/sec=191.548M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      492296 ns       489691 ns         1490 rows/sec=133.829M/s
BM_RowArray_DecodeBinary/max_length:32                                                              1352644 ns      1346006 ns          520 rows/sec=48.6885M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1640900 ns      1633598 ns          431 rows/sec=40.117M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1913187 ns      1905681 ns          338 rows/sec=34.3893M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     365524 ns       364301 ns         1969 rows/sec=179.892M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     253069 ns       252345 ns         2787 rows/sec=259.704M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     536577 ns       535134 ns         1304 rows/sec=122.465M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   502928 ns       501643 ns         1376 rows/sec=130.641M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   317641 ns       316956 ns         2208 rows/sec=206.764M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                  1297863 ns      1295153 ns          565 rows/sec=50.6002M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1317590 ns      1315064 ns          533 rows/sec=49.8341M/s

AVX2 w/ Mitigation

-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                         201413 ns       201738 ns         3299 rows/sec=324.853M/s
BM_RowArray_Decode/"int8"                                                                            187945 ns       188248 ns         3726 rows/sec=348.132M/s
BM_RowArray_Decode/"int16"                                                                           190845 ns       191145 ns         3832 rows/sec=342.854M/s
BM_RowArray_Decode/"int32"                                                                           179556 ns       179823 ns         3904 rows/sec=364.442M/s
BM_RowArray_Decode/"int64"                                                                           187440 ns       187723 ns         3897 rows/sec=349.104M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       250259 ns       250605 ns         2299 rows/sec=261.508M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       278307 ns       278680 ns         2512 rows/sec=235.162M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       261561 ns       261912 ns         2785 rows/sec=250.218M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       261767 ns       262106 ns         2786 rows/sec=250.033M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       255717 ns       256038 ns         2738 rows/sec=255.958M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      268011 ns       268339 ns         2676 rows/sec=244.225M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      313420 ns       313799 ns         2231 rows/sec=208.844M/s
BM_RowArray_DecodeBinary/max_length:32                                                               906652 ns       907516 ns          803 rows/sec=72.2136M/s
BM_RowArray_DecodeBinary/max_length:64                                                              1114609 ns      1115591 ns          629 rows/sec=58.7446M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1373871 ns      1375226 ns          512 rows/sec=47.654M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     198752 ns       198992 ns         3222 rows/sec=329.335M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     181090 ns       181312 ns         3859 rows/sec=361.448M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     358132 ns       358548 ns         1800 rows/sec=182.779M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   371824 ns       372187 ns         1949 rows/sec=176.081M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   357208 ns       357540 ns         1956 rows/sec=183.294M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                   923022 ns       923784 ns          780 rows/sec=70.9419M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                  1048616 ns      1049434 ns          669 rows/sec=62.4479M/s

AVX2 w/o Mitigation

-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"                                                                          92114 ns        92105 ns         7569 rows/sec=711.523M/s
BM_RowArray_Decode/"int8"                                                                             90194 ns        90195 ns         8095 rows/sec=726.593M/s
BM_RowArray_Decode/"int16"                                                                            87798 ns        87810 ns         8006 rows/sec=746.33M/s
BM_RowArray_Decode/"int32"                                                                            88995 ns        89094 ns         7867 rows/sec=735.571M/s
BM_RowArray_Decode/"int64"                                                                            94848 ns        96306 ns         7533 rows/sec=680.489M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                                                       205153 ns       207921 ns         3369 rows/sec=315.192M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                                                       215245 ns       217836 ns         3354 rows/sec=300.846M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                                                       207253 ns       209473 ns         3333 rows/sec=312.857M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                                                       216148 ns       218212 ns         3335 rows/sec=300.327M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                                                       212980 ns       214786 ns         3255 rows/sec=305.117M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                                                      215780 ns       217417 ns         3232 rows/sec=301.426M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                                                      270711 ns       272527 ns         2567 rows/sec=240.471M/s
BM_RowArray_DecodeBinary/max_length:32                                                               694130 ns       698295 ns         1022 rows/sec=93.8501M/s
BM_RowArray_DecodeBinary/max_length:64                                                               924115 ns       929161 ns          707 rows/sec=70.5314M/s
BM_RowArray_DecodeBinary/max_length:128                                                             1133587 ns      1139122 ns          610 rows/sec=57.5311M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0     110807 ns       111357 ns         6546 rows/sec=588.514M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1     106122 ns       106605 ns         6568 rows/sec=614.746M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2     323609 ns       324828 ns         2231 rows/sec=201.753M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0                   162969 ns       163526 ns         4278 rows/sec=400.762M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1                   171423 ns       171962 ns         4251 rows/sec=381.102M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2                   707237 ns       709097 ns          980 rows/sec=92.4203M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3                   868808 ns       870891 ns          807 rows/sec=75.2505M/s

BM_HashJoinBasic_HeavyBuildPayload

Scalar

------------------------------------------------------------------------------------------------------------------
Benchmark                                                        Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:1        410398 ns       410328 ns         1746 rows/sec=2.49557M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:8       3337775 ns      3337484 ns          216 rows/sec=2.45454M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:64     33379106 ns     33374225 ns           21 rows/sec=1.96367M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:512   491662619 ns    491562474 ns            2 rows/sec=1.06657M/s

AVX2 w/ Mitigation

------------------------------------------------------------------------------------------------------------------
Benchmark                                                        Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:1        371253 ns       371414 ns         1740 rows/sec=2.75703M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:8       3115015 ns      3115910 ns          230 rows/sec=2.62909M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:64     30945790 ns     30952150 ns           23 rows/sec=2.11733M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:512   439258398 ns    439322419 ns            2 rows/sec=1.1934M/s

AVX2 w/o Mitigation

------------------------------------------------------------------------------------------------------------------
Benchmark                                                        Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:1        296886 ns       296747 ns         2378 rows/sec=3.45075M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:8       2425116 ns      2424949 ns          289 rows/sec=3.37822M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:64     25787602 ns     25783258 ns           27 rows/sec=2.5418M/s
BM_HashJoinBasic_HeavyBuildPayload/HashTable krows:512   388581838 ns    388508014 ns            2 rows/sec=1.34949M/s

@zanmato1984 zanmato1984 marked this pull request as ready for review September 4, 2024 08:30
@zanmato1984
Copy link
Contributor Author

I've cleared all the confusions and got the code/benchmark ready. @pitrou @mapleFU @felipecrv @cyb70289 @wgtmac Would you please help to review? Thanks!

@pitrou
Copy link
Member

pitrou commented Sep 4, 2024

Can we make sure BM_HashJoinBasic_HeavyBuildPayload doesn't take too much time? It seems one of the parameterizations has a 5 second runtime.

@zanmato1984
Copy link
Contributor Author

Can we make sure BM_HashJoinBasic_HeavyBuildPayload doesn't take too much time? It seems one of the parameterizations has a 5 second runtime.

Done. Reduced the max row size of that benchmark to 512k. The max execution time is now within 1 second.


uint64_t total_rows = 0;
for (auto _ : st) {
st.PauseTiming();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it actually relevant to pause timing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not, the code in-between is trivial. I'll remove them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

default_memory_pool()));
total_rows += batch.length;
}
st.counters["rows/sec"] = benchmark::Counter(total_rows, benchmark::Counter::kIsRate);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should simply be state.SetItemsProcessed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Updated.

st.counters["rows/sec"] = benchmark::Counter(total_rows, benchmark::Counter::kIsRate);
}

template <typename... Args>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the Args for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing. Removed. Thanks for pointing this out.

@pitrou
Copy link
Member

pitrou commented Dec 17, 2024

@github-actions crossbow submit -g cpp

Copy link

Revision: 1be8e4a

Submitted crossbow builds: ursacomputing/crossbow @ actions-71c56e957f

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-20.04-cuda-11.2.2 GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@pitrou
Copy link
Member

pitrou commented Dec 17, 2024

The CI failures are unrelated, I'll merge. Thank you @zanmato1984 !

@pitrou pitrou merged commit 2bd2e35 into apache:main Dec 17, 2024
36 of 37 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Dec 17, 2024
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 2bd2e35.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants