Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Resolve S3 crash issue #2120

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

dante-lee
Copy link

Related issue: #1912

From v0.35.0, there appears to be an issue with build toolchain changes, resulting in the problem described in #1912 (pure virtual method called exception). After investigation, I believe I've identified the root cause.

While using S3 filesystem, I captured the program's backtrace (shown below in the toggle).

Click to toggle
#0  0x00007f92169f700b in raise () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x00007f92169d6859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#2  0x00007f9215c2b8d1 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#3  0x00007f9215c3737c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#4  0x00007f9215c373e7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#5  0x00007f9215c38145 in __cxa_pure_virtual () from /lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#6  0x00007f9134d4b920 in Aws::Http::CurlHandleContainer::~CurlHandleContainer (this=0x33c734b8, __in_chrg=<optimized out>) at external/aws-sdk-cpp/aws-cpp-sdk-core/source/http/curl/CurlHandleContainer.cpp:27
        logSystem = 0x33a128e0
        logSystem = <optimized out>
        logStream = <optimized out>
        handle = <optimized out>
        __for_range = <optimized out>
        __for_begin = <optimized out>
        __for_end = <optimized out>
        logSystem = <optimized out>
        logStream = <optimized out>
#7  0x00007f9134cede6c in Aws::Http::CurlHttpClient::~CurlHttpClient (this=0x33c73450, __in_chrg=<optimized out>) at external/aws-sdk-cpp/aws-cpp-sdk-core/include/aws/core/http/curl/CurlHttpClient.h:26
No locals.
#8  0x00007f9134ce5a71 in __gnu_cxx::new_allocator<Aws::Http::CurlHttpClient>::destroy<Aws::Http::CurlHttpClient> (this=0x33c73450, __p=0x33c73450) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/ext/new_allocator.h:153
No locals.
#9  0x00007f9134ce59ff in std::allocator_traits<Aws::Allocator<Aws::Http::CurlHttpClient> >::_S_destroy<Aws::Allocator<Aws::Http::CurlHttpClient>, Aws::Http::CurlHttpClient> (__a=..., __p=0x33c73450) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/bits/alloc_traits.h:260
No locals.
#10 0x00007f9134ce595a in std::allocator_traits<Aws::Allocator<Aws::Http::CurlHttpClient> >::destroy<Aws::Http::CurlHttpClient> (__a=..., __p=0x33c73450) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/bits/alloc_traits.h:364
No locals.
#11 0x00007f9134ce57b5 in std::_Sp_counted_ptr_inplace<Aws::Http::CurlHttpClient, Aws::Allocator<Aws::Http::CurlHttpClient>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x33c73440) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/bits/shared_ptr_base.h:557
No locals.
#12 0x00007f913477b77a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x33c73440) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/bits/shared_ptr_base.h:155
No locals.
#13 0x00007f9134778de5 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x3397ab90, __in_chrg=<optimized out>) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/bits/shared_ptr_base.h:730
No locals.
#14 0x00007f9134b69422 in std::__shared_ptr<Aws::Http::HttpClient, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x3397ab88, __in_chrg=<optimized out>) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/bits/shared_ptr_base.h:1169
No locals.
#15 0x00007f9134b6943e in std::shared_ptr<Aws::Http::HttpClient>::~shared_ptr (this=0x3397ab88, __in_chrg=<optimized out>) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/bits/shared_ptr.h:103
No locals.
#16 0x00007f9134cf0218 in Aws::Internal::AWSHttpResourceClient::~AWSHttpResourceClient (this=0x3397ab50, __in_chrg=<optimized out>) at external/aws-sdk-cpp/aws-cpp-sdk-core/source/internal/AWSHttpResourceClient.cpp:98
No locals.
#17 0x00007f9134cf1344 in Aws::Internal::EC2MetadataClient::~EC2MetadataClient (this=0x3397ab50, __in_chrg=<optimized out>) at external/aws-sdk-cpp/aws-cpp-sdk-core/source/internal/AWSHttpResourceClient.cpp:183
No locals.
#18 0x00007f9134cdd7c5 in __gnu_cxx::new_allocator<Aws::Internal::EC2MetadataClient>::destroy<Aws::Internal::EC2MetadataClient> (this=0x3397ab50, __p=0x3397ab50) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/ext/new_allocator.h:153
No locals.
#19 0x00007f9134cdd79f in std::allocator_traits<Aws::Allocator<Aws::Internal::EC2MetadataClient> >::_S_destroy<Aws::Allocator<Aws::Internal::EC2MetadataClient>, Aws::Internal::EC2MetadataClient> (__a=..., __p=0x3397ab50) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/bits/alloc_traits.h:260
No locals.
#20 0x00007f9134cdd768 in std::allocator_traits<Aws::Allocator<Aws::Internal::EC2MetadataClient> >::destroy<Aws::Internal::EC2MetadataClient> (__a=..., __p=0x3397ab50) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/bits/alloc_traits.h:364
No locals.
#21 0x00007f9134cdd64f in std::_Sp_counted_ptr_inplace<Aws::Internal::EC2MetadataClient, Aws::Allocator<Aws::Internal::EC2MetadataClient>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x3397ab40) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/bits/shared_ptr_base.h:557
No locals.
#22 0x00007f913477b77a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x3397ab40) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/bits/shared_ptr_base.h:155
No locals.
#23 0x00007f9134778de5 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7f913586a078 <Aws::Internal::s_ec2metadataClient+8>, __in_chrg=<optimized out>) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/bits/shared_ptr_base.h:730
No locals.
#24 0x00007f9134ca8fc6 in std::__shared_ptr<Aws::Internal::EC2MetadataClient, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f913586a070 <Aws::Internal::s_ec2metadataClient>, __in_chrg=<optimized out>) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/bits/shared_ptr_base.h:1169
No locals.
#25 0x00007f9134ca8fe2 in std::shared_ptr<Aws::Internal::EC2MetadataClient>::~shared_ptr (this=0x7f913586a070 <Aws::Internal::s_ec2metadataClient>, __in_chrg=<optimized out>) at /dt9/usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9/../../../../include/c++/9/bits/shared_ptr.h:103
No locals.
#26 0x00007f92169fa8a7 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#27 0x00007f92169faa60 in exit () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#28 0x00007f92169d808a in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#29 0x0000000000630e2e in _start ()

Upon inspection of CurlHandleContainer.cpp:27, I found that it uses a logging macro defined here. This macro depends on static_variables which could lead to unsafe behavior during program termination. Since the pure virtual method called exception occurs during program exit, it's likely caused by the destruction order of static variables (logging-related static variables being destroyed before the destructor is called).

As a temporary fix, I've removed the logging macros in CurlHandleContainer's destructor using bazel's patch_cmds. While this resolves the immediate issue, it may not be the optimal long-term solution. I'd appreciate review of this approach, considering our dependency on tensorflow==2.16 and S3 filesystem functionality.

Regarding the build.Dockerfile modification: I noticed that the tensorflow version should align with what's specified in tensorflow_io/python/ops/version_ops.py. The original script was installing the latest tensorflow version, so I've modified it to install the specific version defined in the version_ops.py file.

@dante-lee
Copy link
Author

@mihaimaruseac @yongtang

Can you review this PR? I closed #2119 and re-opened this.

@dante-lee
Copy link
Author

Sorry for the frequent updates. I just updated AWSLogSystem's destructor to call ShutdownAWSLogging(). Since current aws-sdk-cpp still uses static variables for logger and calling ShutdownAWSLogging also resolved issue.

@corona10
Copy link

@mihaimaruseac
FYI, internally, we are trying to upgrade the internal tensorflow version to 2.16.x, which depends on the S3 workload.
If this fix can be released as soon as possible, it would be great for our company too.

@mihaimaruseac
Copy link

I can't do much, sadly. I left Google's TF team years ago, only helping with reviews here and there (mostly OSS builds and security related).

TF IO was never maintained by Google, so this is even harder to land.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants