Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Try 3.12 again [WIP] #2692

Closed
wants to merge 3 commits into from
Closed

[python] Try 3.12 again [WIP] #2692

wants to merge 3 commits into from

Conversation

@johnkerl
Copy link
Member Author

johnkerl commented Jun 6, 2024

Copy link

codecov bot commented Jun 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.19%. Comparing base (6d6c90e) to head (3b2f36a).
Report is 157 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2692   +/-   ##
=======================================
  Coverage   90.19%   90.19%           
=======================================
  Files          37       37           
  Lines        4019     4019           
=======================================
  Hits         3625     3625           
  Misses        394      394           
Flag Coverage Δ
python 90.19% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 90.19% <ø> (ø)
libtiledbsoma ∅ <ø> (∅)

@johnkerl
Copy link
Member Author

johnkerl commented Jun 6, 2024

Here's the problem:

See also

This PR full CI is red across the board for all Python versions on MacOS:
https://github.com/single-cell-data/TileDB-SOMA/actions/runs/9407659739

And as of today this still reproduces on my Mac:

$ pip list | grep arrow
pyarrow                       12.0.1
$ pip install pyarrow==14.0
Collecting pyarrow==14.0
...
Successfully installed pyarrow-14.0.0
johnkerl@exade[prod][HEAD][tiledb]$ python
Python 3.9.13 (v3.9.13:6de2ca5339, May 17 2022, 11:23:25)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> import tiledb
>>> tiledb.open('s3://anything/at/all')
Fatal error condition occurred in /Users/runner/work/crossbow/crossbow/vcpkg/buildtrees/aws-c-common/src/v0.8.9-fed0b55d0f.clean/source/allocator.c:121: allocator != ((void*)0)
Exiting Application
################################################################################
Stack trace:
################################################################################
1   libarrow.1400.dylib                 0x000000010faf4fe6 aws_fatal_assert + 70
2   libarrow.1400.dylib                 0x000000010faf3e57 aws_mem_acquire + 55
3   libarrow.1400.dylib                 0x000000010fb07392 aws_string_new_from_cursor + 50
4   libarrow.1400.dylib                 0x000000010fb01328 aws_json_value_get_from_object + 40
...

Same problem on pip install pyarrow==13. We need to go down to pyarrow==12 to not have this nasty crash.

And yet Python 3.12 doesn't have pyarrow 12 or below.

cc @ihnorton

@ivirshup
Copy link
Contributor

ivirshup commented Jun 6, 2024

Yeah, I am seeing this now too. It looks like it is specific to when pyarrow is imported before tiledb, so CI and tests ran fine.

@johnkerl
Copy link
Member Author

johnkerl commented Jun 6, 2024

Needs discussion on whether we:

  • Say we don't do Python 3.12 (I'm not advocating for that -- it's an option, and a bad one)
  • Say we don't do Python 3.12 on MacOS
  • Say we do Python 3.12 but with a heavy caveat "watch your import ordering due to this issue" ...

None of these feels ideal.

@ivirshup
Copy link
Contributor

ivirshup commented Jun 6, 2024

Personally, I think the problems caused by upper bounding the allowed versions of arrow is more of an issue that python=3.12 support specifically. And the macOS support in python 3.12 is more of a symptom.

Is there an issue tracking this for tiledb? And is it python specific? It looks like tiledb and arrow are both trying to do something with AWS auth at the C level that's causing the issue.

@ivirshup
Copy link
Contributor

ivirshup commented Jun 6, 2024

I guess in tiledbsoma you have the option of making sure tiledb is imported before pyarrow on import tiledbsoma but you gotta hope people didn't import pyarrow first. I guess you could probably check for that though.

Also, if this behavior is good enough for tiledb-py, surely it is good enough for tiledbsoma?

Which ends up meaning, I'd go for tiledbsoma doing:

Say we do Python 3.12 but with a heavy caveat "watch your import ordering due to this issue"

And have tiledb/ tiledb-py take ownership of the issue and fix it.

@johnkerl
Copy link
Member Author

johnkerl commented Jun 7, 2024

And have tiledb/ tiledb-py take ownership of the issue and fix it.

We had extensive internal discussions a few months ago and determined it wasn't our bug.

I'll resuscitate those discussions freshly. cc @ihnorton .

If we were mistaken and it is a bug of ours, that will be a learning; if we're not, there may still be mitigation options we can take to work around the pyarrow issue.

@ivirshup
Copy link
Contributor

ivirshup commented Jun 7, 2024

We had extensive internal discussions a few months ago and determined it wasn't our bug.

Ah, sorry, I wasn't making a value judgement about whose fault this is. It's more that import pyarrow, tiledb; tiledb.open("s3://...") is really a issue/ bug on the tiledb side and not the responsibility of tiledbsoma.

My suggested short term mitigation would be:

  • Unpin pyarrow here
  • Have tiledb-py throw a warning on import if pyarrow has already been imported. This warning should tell people about the error, and that they need to import tiledb first or downgrade pyarrow.

Has an issue been opened over on the arrow side? It seems like the underlying cause could be an issue with how they do some aws configuration on import, so maybe this could even be patched on their side?

@johnkerl
Copy link
Member Author

Good news -- I found "one more import" (#2734) which helps ...

@johnkerl
Copy link
Member Author

johnkerl commented Jun 14, 2024

Some status:

Repro

#!/usr/bin/env python

import pyarrow
import tiledb
tiledb.open("s3://anything/at/all")

fails with

Fatal error condition occurred in /Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/vcpkg/buildtrees/aws-c-common/src/v0.8.9-fed0b55d0f.clean/source/allocator.c:121: allocator != ((void*)0)
Exiting Application
################################################################################
Stack trace:
################################################################################
1   libarrow.1400.dylib                 0x0000000106f90150 aws_fatal_assert + 80
2   libarrow.1400.dylib                 0x0000000106f8f528 aws_mem_acquire + 64
3   libarrow.1400.dylib                 0x0000000106fa1cc4 aws_string_new_from_cursor + 76
4   libarrow.1400.dylib                 0x0000000106f9bb94 aws_json_value_get_from_object + 44
5   libarrow.1400.dylib                 0x0000000106f8830c aws_endpoints_ruleset_new_from_string + 120
6   libarrow.1400.dylib                 0x0000000106f22430 _ZN3Aws3Crt9Endpoints10RuleEngineC2ERK15aws_byte_cursorS5_P13aws_allocator + 48
7   libarrow.1400.dylib                 0x00000001066cb820 _ZN3Aws8Endpoint23DefaultEndpointProviderINS_2S321S3ClientConfigurationENS2_8Endpoint19S3BuiltInParametersENS4_25S3ClientContextParametersEEC2EPKcm + 120
8   libtiledb.dylib                     0x0000000150982484 _ZN3Aws2S38S3ClientC2ERKNS_6Client19ClientConfigurationENS2_15AWSAuthV4Signer20PayloadSigningPolicyEbNS0_34US_EAST_1_REGIONAL_ENDPOINT_OPTIONE + 900
9   libtiledb.dylib                     0x0000000150151178 _ZN6tiledb6common11make_sharedINS_2sm14TileDBS3ClientELi66EJRKNS2_12S3ParametersERN3Aws6Client19ClientConfigurationENS8_15AWSAuthV4Signer20PayloadSigningPolicyERKbEEENSt3__110shared_ptrIT_EERAT0__KcDpOT1_ + 92
10  libtiledb.dylib                     0x000000015013b7b8 _ZNK6tiledb2sm2S311init_clientEv + 3840
11  libtiledb.dylib                     0x0000000150144d8c _ZNK6tiledb2sm2S313ls_with_sizesERKNS0_3URIERKNSt3__112basic_stringIcNS5_11char_traitsIcEENS5_9allocatorIcEEEEi + 76
12  libtiledb.dylib                     0x0000000150144cf4 _ZNK6tiledb2sm2S313ls_with_sizesERKNS0_3URIE + 44
13  libtiledb.dylib                     0x000000015016dbf4 _ZNK6tiledb2sm3VFS13ls_with_sizesERKNS0_3URIE + 264
14  libtiledb.dylib                     0x000000015002ce18 _ZNK6tiledb2sm14ArrayDirectory2lsERKNS0_3URIE + 60
15  libtiledb.dylib                     0x000000015002ea54 _ZN6tiledb2sm14ArrayDirectory20load_array_meta_urisEv + 144
16  libtiledb.dylib                     0x00000001500312dc _ZNSt3__120__packaged_task_funcIZN6tiledb6common10ThreadPool5asyncIZNS1_2sm14ArrayDirectory4loadEvE3$_3JEEEDaOT_DpOT0_EUlvE_NS_9allocatorISE_EEFNS2_6StatusEvEEclEv + 24
17  libtiledb.dylib                     0x000000015086594c _ZNSt3__113packaged_taskIFN6tiledb6common6StatusEvEEclEv + 80
18  libtiledb.dylib                     0x00000001508655a8 _ZN6tiledb6common10ThreadPool6workerEv + 76
19  libtiledb.dylib                     0x0000000150866adc _ZNSt3__114__thread_proxyB8ue170006INS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN6tiledb6common10ThreadPoolEFvvEPS9_EEEEEPvSE_ + 72
20  libsystem_pthread.dylib             0x0000000190c46f94 _pthread_start + 136
21  libsystem_pthread.dylib             0x0000000190c41d34 thread_start + 8
Abort trap: 6

on MacOS with pyarrow >= 13. Does not fail with pyarrow < 13. 100% reproducible, and immediate. The x86 and arm64 families are equally affected.

The above script passes after pip install pyarrow==12 and fails after pip install pyarrow==13. This process can be repeated, reproducing on demand the appearance of the issue in pyarrow 13 and above.

Workaround on this PR

Simply try importing tiledb before pyarrow. I'd tried this before a few months ago & it didn't work -- I found at the time adding import tiledb at the very top of apis/python/src/tiledbsoma/__init__.py did not suffice. However I found today that also adding it at the top of apis/python/tests/__init__.py does allow unit tests to pass on my MacOS laptop.

That's the good news.

The bad news is that this is now segfaulting in CI:
#2734 (comment)

Current experiment: maybe the segfault has to do with today's main. So I've applied this PR's patch to a few branches:

I'd love to reproduce this segfault and get a stack trace but sadly this command from the CI run

python -m pytest --cov=apis/python/src --cov-report=xml apis/python/tests -v --durations=20

does not segfault on my Mac. :(

Upstream

@ihnorton and I believe that this is a problem with pyarrow:

https://github.com/apache/arrow/blob/fe4d04f081e55ca2de7b1b67b10ad7dca96cfd9e/cpp/thirdparty/versions.txt#L54

ARROW_AWSSDK_BUILD_VERSION=1.10.55

and we'll file a ticket there.

@johnkerl
Copy link
Member Author

Cloaing in favor of #3001.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants