Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent cache refresh bug when a FileExistsError is triggered #9982

Open
5 tasks done
nicos68 opened this issue Jan 24, 2025 · 1 comment
Open
5 tasks done

Apparent cache refresh bug when a FileExistsError is triggered #9982

nicos68 opened this issue Jan 24, 2025 · 1 comment
Labels
bug needs triage Issue that has not been reviewed by xarray team member

Comments

@nicos68
Copy link

nicos68 commented Jan 24, 2025

What happened?

  1. call to_zarr on a dataset with a string s3 url as the store argument, writing the dataset to the bucket
  2. call to_zarr again on the same dataset with the same url as store argument => a FileExistsError is triggered
  3. catch the error
  4. delete the dataset just written from the s3 bucket, using s3fs for example (but the result is the same when using boto)
  5. call to_zarr again on the same dataset with the same url as store argument => a FileExistsError is triggered again

What did you expect to happen?

  1. As the dataset has been deleted, there should be no error and the dataset should be rewrittten to the s3 bucket.

Minimal Complete Verifiable Example

from s3fs import S3FileSystem
from xarray import open_dataset
import os
from dotenv import load_dotenv
from testcontainers.localstack import LocalStackContainer
import pytest


load_dotenv(".test.env")
BUCKET_NAME = os.environ["S3_BUCKET"]


@pytest.fixture
def s3_container():
    port = 8566
    container = (
        LocalStackContainer().with_env("SERVICES", "s3").with_bind_ports(4566, port)
    )
    with container as container:
        region = "us-east-1"
        client = container.get_client("s3", region_name=region)
        client.create_bucket(Bucket=BUCKET_NAME)
        yield container


@pytest.mark.usefixtures("s3_container")
def test_rewrite_after_delete_after_error():
    xarray_dataset = open_dataset("data.nc")
    xarray_dataset.to_zarr(f"s3://{BUCKET_NAME}/data")
    s3fs = S3FileSystem()

    # bucket/data is *not* empty, data has been written successfully
    assert s3fs.find(BUCKET_NAME) != []
    assert s3fs.find("bucket/data") != []

    try:
        # Write the same data again, to trigger a FileExistsError
        xarray_dataset.to_zarr(f"s3://{BUCKET_NAME}/data")
    except FileExistsError:
        # we just silence the error and carry on.
        pass

    # Erase all the data from the bucket. The data should be writable again.
    for path in s3fs.ls(f"s3://{os.environ['S3_BUCKET']}", detail=False):
        s3fs.rm(path, recursive=True)

    # The bucket *is* empty. This can be checked with aws cli.
    assert s3fs.find(BUCKET_NAME) == []

    # The data can be written to data2
    xarray_dataset.to_zarr(f"s3://{BUCKET_NAME}/data2")

    # bucket/data2 is not empty, data has been written successfully
    assert s3fs.find("bucket/data2") != []

    # But a FileExistsError is triggered when writing to bucket/data again,
    # even though the directory is empty
    xarray_dataset.to_zarr(f"s3://{BUCKET_NAME}/data")

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

===================================================================================================== FAILURES =====================================================================================================
______________________________________________________________________________________ test_rewrite_after_delete_after_error _______________________________________________________________________________________

    @pytest.mark.usefixtures("s3_container")
    def test_rewrite_after_delete_after_error():
        xarray_dataset = open_dataset("data.nc")
        xarray_dataset.to_zarr(f"s3://{BUCKET_NAME}/data")
        s3fs = S3FileSystem()

        # bucket/data is *not* empty, data has been written successfully
        assert s3fs.find(BUCKET_NAME) != []
        assert s3fs.find(f"{BUCKET_NAME}/data") != []

        try:
            # Write the same data again, to trigger a FileExistsError
            xarray_dataset.to_zarr(f"s3://{BUCKET_NAME}/data")
        except FileExistsError:
            # we just silence the error and carry on.
            pass

        # Erase all the data from the bucket. The data should be writable again.
        for path in s3fs.ls(f"s3://{os.environ['S3_BUCKET']}", detail=False):
            s3fs.rm(path, recursive=True)

        # The bucket *is* empty. This can be checked with aws cli.
        assert s3fs.find(BUCKET_NAME) == []

        # The data can be written to data2
        xarray_dataset.to_zarr(f"s3://{BUCKET_NAME}/data2")

        # bucket/data2 is not empty, data has been written successfully
        assert s3fs.find(f"{BUCKET_NAME}/data2") != []

        # But a FileExistsError is triggered when writing to bucket/data again,
        # even though the directory is empty
>       xarray_dataset.to_zarr(f"s3://{BUCKET_NAME}/data")

test_xarray.py:82:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv/lib/python3.12/site-packages/xarray/core/dataset.py:2622: in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
.venv/lib/python3.12/site-packages/xarray/backends/api.py:2194: in to_zarr
    zstore = backends.ZarrStore.open_group(
.venv/lib/python3.12/site-packages/xarray/backends/zarr.py:703: in open_group
    ) = _get_open_params(
.venv/lib/python3.12/site-packages/xarray/backends/zarr.py:1792: in _get_open_params
    zarr_group = zarr.open_group(store, **open_kwargs)
.venv/lib/python3.12/site-packages/zarr/_compat.py:43: in inner_f
    return f(*args, **kwargs)
.venv/lib/python3.12/site-packages/zarr/api/synchronous.py:524: in open_group
    sync(
.venv/lib/python3.12/site-packages/zarr/core/sync.py:142: in sync
    raise return_result
.venv/lib/python3.12/site-packages/zarr/core/sync.py:98: in _runner
    return await coro
.venv/lib/python3.12/site-packages/zarr/api/asynchronous.py:800: in open_group
    store_path = await make_store_path(store, mode=mode, storage_options=storage_options, path=path)
.venv/lib/python3.12/site-packages/zarr/storage/_common.py:318: in make_store_path
    result = await StorePath.open(store, path=path_normalized, mode=mode)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = <class 'zarr.storage._common.StorePath'>, store = <FsspecStore(S3FileSystem, bucket/data)>, path = '', mode = 'w-'

    @classmethod
    async def open(
        cls, store: Store, path: str, mode: AccessModeLiteral | None = None
    ) -> StorePath:
        """
        Open StorePath based on the provided mode.

        * If the mode is 'w-' and the StorePath contains keys, raise a FileExistsError.
        * If the mode is 'w', delete all keys nested within the StorePath
        * If the mode is 'a', 'r', or 'r+', do nothing

        Parameters
        ----------
        mode : AccessModeLiteral
            The mode to use when initializing the store path.

        Raises
        ------
        FileExistsError
            If the mode is 'w-' and the store path already exists.
        """

        await store._ensure_open()
        self = cls(store, path)

        # fastpath if mode is None
        if mode is None:
            return self

        if store.read_only and mode != "r":
            raise ValueError(f"Store is read-only but mode is '{mode}'")

        match mode:
            case "w-":
                if not await self.is_empty():
                    msg = (
                        f"{self} is not empty, but `mode` is set to 'w-'."
                        "Either remove the existing objects in storage,"
                        "or set `mode` to a value that handles pre-existing objects"
                        "in storage, like `a` or `w`."
                    )
>                   raise FileExistsError(msg)
E                   FileExistsError: <FsspecStore(S3FileSystem, bucket/data)> is not empty, but `mode` is set to 'w-'.Either remove the existing objects in storage,or set `mode` to a value that handles pre-existing objectsin storage, like `a` or `w`.

.venv/lib/python3.12/site-packages/zarr/storage/_common.py:91: FileExistsError
---------------------------------------------------------------------------------------------- Captured stderr setup -----------------------------------------------------------------------------------------------
Pulling image localstack/localstack:2.0.1
Container started: e9d5537e47e1
Waiting for container <Container: e9d5537e47e1> with image localstack/localstack:2.0.1 to be ready ...

Anything else we need to know?

I put together a full MRE project, which can be found here. It can be run by

  1. cloning the repo
  2. installing the dependencies (I personnally used uv for that, but the pyproject.toml file should be standard enough)
  3. creating an aws profile for the testcontainers s3 instance, e.g. :
[profile testcontainers]
region=us-east-1
output=json
endpoint_url = http://localhost:8566
  1. running pytest

Environment

INSTALLED VERSIONS

commit: None
python: 3.12.7 (main, Oct 16 2024, 04:37:19) [Clang 18.1.8 ]
python-bits: 64
OS: Linux
OS-release: 6.12.9-1-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: 4.9.4-development

xarray: 2025.1.1
pandas: 2.2.3
numpy: 2.1.3
scipy: 1.15.1
netCDF4: 1.7.2
pydap: None
h5netcdf: 1.4.1
h5py: 3.12.1
zarr: 3.0.1
cftime: 1.6.4.post1
nc_time_axis: None
iris: None
bottleneck: 1.4.2
dask: 2025.1.0
distributed: 2025.1.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: 0.8.2
fsspec: 2024.12.0
cupy: None
pint: None
sparse: None
flox: 0.9.15
numpy_groupies: 0.11.2
setuptools: None
pip: None
conda: None
pytest: 8.3.4
mypy: None
IPython: None
sphinx: None

@nicos68 nicos68 added bug needs triage Issue that has not been reviewed by xarray team member labels Jan 24, 2025
Copy link

welcome bot commented Jan 24, 2025

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@nicos68 nicos68 changed the title Apparent cache poisoning when a FileExistsError is triggered Apparent cache refresh bug when a FileExistsError is triggered Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs triage Issue that has not been reviewed by xarray team member
Projects
None yet
Development

No branches or pull requests

1 participant