Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: disallow specifying the 'str' dtype with storage in [..] in string alias #60661

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

The intention was for the new default "str" dtype to not include the storage in the string alias, and so to also not allow constructing it that way (this is discussed in the PDEP).

This is also implemented this way, as you can see when directly calling the extension dtype API:

# for "string" this is allowed
>>> pd.StringDtype.construct_from_string("string[pyarrow]")
string[pyarrow]

# but for "str" not
>>> pd.StringDtype.construct_from_string("str[pyarrow]")
...
TypeError: Cannot construct a 'StringDtype' from 'str[pyarrow]'

However, when specifying this as a dtype argument in eg constructors (going through pandas_dtype(...), which goes through the extension dtype registry), this "accidentally" kind of works, but gives an unexpected result:

# this raises as expected
>>> pd.Series(["a"], dtype="str[python]")
TypeError: data type 'str[python]' not understood

# but this gives a result
>>> pd.Series(["a"], dtype="str[pyarrow]")
>>>
0    a
dtype: string[pyarrow]
>>> type(pd.Series(["a"], dtype="str[pyarrow]").dtype)
pandas.ArrowDtype

I think it is confusing that it does work in case of the pyarrow storage, but then does give a different dtype than what you would typically expect.
So I would rather just disallow this case (which is what this PR does), although this is a small breaking change for people currently using dtype="str[pyarrow]" to get the ArrowDtype.

@jorisvandenbossche jorisvandenbossche added Strings String extension data type and string data API - Consistency Internal Consistency of API/Behavior labels Jan 5, 2025
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Jan 5, 2025
@rhshadrach
Copy link
Member

I think it is confusing that it does work in case of the pyarrow storage, but then does give a different dtype than what you would typically expect. So I would rather just disallow this case (which is what this PR does), although this is a small breaking change for people currently using dtype="str[pyarrow]" to get the ArrowDtype.

This gives a definitively wrong result and so I think that puts it solidly in the bugfix camp, not a breaking change.

@WillAyd
Copy link
Member

WillAyd commented Jan 5, 2025

If we really cared we could convert that during the pickle read, although I dont think that would be a blocker. Generally using pickle to move from one environment to another is discouraged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants