Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move vendored modules into a vendor directory #129222

Open
picnixz opened this issue Jan 23, 2025 · 24 comments
Open

Move vendored modules into a vendor directory #129222

picnixz opened this issue Jan 23, 2025 · 24 comments
Labels
build The build process and cross-build type-feature A feature request or enhancement

Comments

@picnixz
Copy link
Member

picnixz commented Jan 23, 2025

Feature or enhancement

Proposal:

In CPython, we have some vendored libraries namely libmpdec, hacl, and expat. Those libraries are meant to be clone of their upstream (think of them as git submodules) and many times have I been exluding them from code search as they usually have non-CPython code involved.

We have a plan to remove vendored libmpdec (#115119) and we're almost here. In the long term, we could also try to unvendor expat (probably not hacl as it's used to implement hash functions fallbacks when OpenSSL is not present).

Affected modules:

  • libmpdec
  • expat
  • _hacl

Some advantages:

  • Easy exclusion of vendored modules from code search.
  • Easy addition and deletion of vendored modules.
  • We don't expect redistributors to edit vendored code (especially not hacl).
  • Hopefully less work than the Python/Programs split
  • Could be using git submodules (though this requires an internet connection, so maybe not)

Some inconvenients:

  • Quite painful refactoring since it would affect existing PRs.
  • Quite painful refactoring since it would affect distributors editing those files.

We can start with some modules that should be kept untouched such as HACL* sources and progressively move the others to reduce the work and conflicts. It doesn't need to happen in one go (for instance, we may well ignore the libmpdec case if we manage to make it unvendored before). I don't think we have much open PRs with expat (by the way, we could have a refresh script for expat to ease maintenance like #126623).

Now, the question is how this could affect downstream redistributors. I'm asking first on Github since I don't know whether they are active on Discourse or not. If everyone tells "it's fine", then I'll ask on Discourse to see if there are more redistributors that could be concerned.

cc

I don't know how moving mimalloc related stuff would affect the free-threaded build in particular, so I'm also going to ask @kumaraditya303 and @colesbury about it. EDIT: Turns out it's a no go for mimalloc as there is some CPython dedicated stuff, so we can put it out of the list (see #129222 (comment)).

For the hacl includes, I can take care of it.

@picnixz picnixz added build The build process and cross-build type-feature A feature request or enhancement labels Jan 23, 2025
@mhsmith
Copy link
Member

mhsmith commented Jan 23, 2025

Android doesn't do anything special with these libraries, so it shouldn't be affected.

@mgorny
Copy link
Contributor

mgorny commented Jan 23, 2025

Gentoo used to remove these libraries entirely, but we don't do that anymore. Which reminds me that I was supposed to fix the issue that prevented us from doing so. Also, I was supposed to see if we can remove them. Days are too short.

@ned-deily
Copy link
Member

s/nedbat/ned-deily/
s/universal macOS builds/relocatable build/

Such a change should have no impact on macOS builds, macOS installer builds, or iOS builds.

@picnixz
Copy link
Member Author

picnixz commented Jan 23, 2025

Oups, sorry for mistagging you and for the miswording (forgive my lack of macOS knowledge).

@picnixz
Copy link
Member Author

picnixz commented Jan 23, 2025

Er... this is funny but... who should I ask for Ubuntu actually?

@hugovk
Copy link
Member

hugovk commented Jan 23, 2025

Bikeshedding: call it _vendor instead of vendor?

For example, pip and setuptools have _vendor directories.

@colesbury
Copy link
Contributor

colesbury commented Jan 23, 2025

Mimalloc has substantial modifications from upstream. Daan Leijen was working on changes to mimalloc to support the features needed by free-threaded CPython upstream, but it's probably going to take a while. Even then, it will require CPython specific build configuration -- it's never going to be something where you can just drop in a shared library provided by the OS.

@ned-deily
Copy link
Member

@doko42 had/has been the Debian/Ubuntu contact

@mgorny
Copy link
Contributor

mgorny commented Jan 23, 2025

Bikeshedding: call it _vendor instead of vendor?

For example, pip and setuptools have _vendor directories.

I think the point of _vendor is that it's installed as a Python package but not meant as a public API, so it's not really applicable here.

@picnixz
Copy link
Member Author

picnixz commented Jan 23, 2025

Mimalloc has substantial modifications from upstream. Daan Leijen was working on changes to mimalloc to support the features needed by free-threaded CPython upstream, but it's probably going to take a while. Even then, it will require CPython specific build configuration -- it's never going to be something where you can just drop in a shared library provided by the OS.

Thank you, I wasn't aware of it! I've removed from the list of possible vendored modules.

Bikeshedding: call it _vendor instead of vendor?

I don't mind. At least, vendoring modules could be included last (or first) and be well-separated from other includes thanks to their name starting with _. Note that we might have some additional files next to the vendored modules to ease our life (e.g., for HACL, we redefine some names to avoid name clashes, I don't know how it is for other modules).

@colesbury
Copy link
Contributor

Thank you, I wasn't aware of it! I've removed from the list of possible vendored modules.

It may still make sense to include it in the vendored modules directory. I don't have a strong opinion one way or the other. It depends on what you are primarily trying to communicate with the vendored directory. For example, it's third-party code in terms of license, attribution, and code style. But Linux distributions should probably treat it more like internal CPython code. For example, they should not substitute their own mimalloc package.

@zanieb
Copy link
Contributor

zanieb commented Jan 23, 2025

Thanks for the ping!

I think there would be some minor pain downstream because we have a weird custom build system for the extension modules, but it's fine.

What about changes like #127932?

@picnixz
Copy link
Member Author

picnixz commented Jan 23, 2025

It depends on what you are primarily trying to communicate with the vendored directory

For me vendored was in the sense of something that is not ours but just a bundled copy of a project to avoid installing separate packages. If it's internal CPython code, even if it heavily relies on an upstream package, I think we can keep it where it is now. Since mimalloc is primarily for free-threading-related stuff, I think it's fine to keep it where it is (it would also avoid the needs of changing the configure script I think, so I don't mind leaving mimalloc alone).

What about changes like #127932?

I'm not sure I understand your question but I'll try to reply with what I understood. If the upstream has issues, we can't really change the vendored copy even if we want to because we have some SBOM checks. So we can't just "hotfix" HACL* when we want to (we need to wait for upstream updates). So whether we move it to vendor or not, we would have the same issues.

@ned-deily
Copy link
Member

Historically, the cpython copies of some of the long-time "vendored" modules have been patched and modified in the cpython repo without regard to upstream. That has lead, of course, to maintenance issues when new upstream versions are released. I think you need to clarify here whether your intent is to move to a model of strictly vanilla vendored versions with separate necessary patch files in the repo so that SBOMs for the vendored versions could be used. That's potentially a much bigger change, i.e. reviewing all the cpython copies and extracting our changes into separate patches, updating the devguide to explain the process, etc etc, and one that needs to be carefully done. But it would likely be of long-term benefit to do so.

@zooba
Copy link
Member

zooba commented Jan 23, 2025

On Windows (and it's open to other platforms) we already debundle sources into https://github.com/python/cpython-source-deps and pre-built binaries into https://github.com/python/cpython-bin-deps. They are referenced by build script, rather than submodules, but at least it means our only build-time network access is to the same service that holds the rest of our code (i.e. if it's down, our build is failing already). We do also patch these from time to time, but usually in individual commits to preserve the history.

Are you proposing to move them there? Or just to another directory in the current source tree?

@zanieb
Copy link
Contributor

zanieb commented Jan 23, 2025

If the upstream has issues, we can't really change the vendored copy even if we want to because we have some SBOM checks. So we can't just "hotfix" HACL* when we want to (we need to wait for upstream updates). So whether we move it to vendor or not, we would have the same issues.

My point is that we already did apply a patch to HACL* — is the idea here that we would revert that change?

I think the subsequent comment from @ned-deily (at #129222 (comment)) captures the broader concern.

@picnixz
Copy link
Member Author

picnixz commented Jan 23, 2025

I think you need to clarify here whether your intent is to move to a model of strictly vanilla vendored versions with separate necessary patch files in the repo so that SBOMs for the vendored versions could be used. That's potentially a much bigger change, i.e. reviewing all the cpython copies and extracting our changes into separate patches, updating the devguide to explain the process, etc etc, and one that needs to be carefully done. But it would likely be of long-term benefit to do so.

That's my main intent. Namely, try to separate what's being vendored (and can be updated if the vendor's upstream has changed) from what's being "our" code. I guess we can also allow separate patches to be applied on our vendored copies if needs arise (e.g., if the vendored copy has issues and the upstream takes a long time for making the change or does not want to make the changes).

Are you proposing to move them there? Or just to another directory in the current source tree?

I'm just proposing to moving it in another directory.

My point is that we already did apply a patch to HACL* — is the idea here that we would revert that change?

Err, no? we did apply a patch to HACL* using the upstream HACL* or am I wrong? I mean, what we did is just update the bundled HACL* library once the necessary patch has been written out there or am I missing something?

@eli-schwartz
Copy link
Contributor

Could be using git submodules (though this requires an internet connection, so maybe not)

Cloning the cpython git repo already requires an internet connection. Doing a recursive clone doesn't require more of one.

The release tarballs should absolutely include the contents of a submodule as well -- in fact you have to since you can't later clone the submodules, Internet connection or not, since there is no submodule metadata.

The advantage of using submodules is that it becomes very very very clear what is vendored versus what is patched.

The Meson build system does something similar using "wrap files" built into the core build system, that contain an URL download location for the upstream release tarballs and a checksum, which works even without git metadata. This also means that Meson can distribute your choice of "fat" release tarballs with vendored dependency fallbacks, or "thin" release tarballs with just the wrap definition, and at build time it will download and checksum the vendored dependency only if you don't have a system copy, or conditional on configure options such as "force vendor dependencies for standalone builds" or "forbid vendor dependencies for distro policy". The code to handle this robustly is not trivial, which is why we built it into the core so that it only has to be maintained in a single place. You probably want to stick with submodules.

@hroncok
Copy link
Contributor

hroncok commented Jan 23, 2025

Moving the vendored projects to an explicit directory would not impact Fedora/RHEL. We would need to change the paths for those that we rm, but that should be it.

@gpshead
Copy link
Member

gpshead commented Jan 23, 2025

If you're bikeshedding the name, github code search already excludes directorys named third_party/ as that is a common industry norm place to put... third party software that you've vendored.

HACL* belongs specifically with hashlib extension modules so colocating it in a sub-tree under that could would make sense to me, but I'm not picky, it's just a matter of updating include paths and our script that updates the specific version we use.

Ideal: We'd make a top level third_party/ tree to contain all of these vendored bits in their own subdirectories.

Doing that would be good for project health IMNSHO rather than scattering stuff around as we do today. re-packagers should appreciate it as well as they each have their own ideas of what they do and don't want to include vs pull from external sources and it makes it more obvious what we've got that could be considered for that.

Non-goal 1: Vendored things each need their own management in terms of how we get them, update them, and patch them if necessary for our repo. Don't try for a single approach there, it won't fit everything. There may be some commonalities on how that gets done but that kind of thing can be worked out on a case by case basis later.

Non-goal 2: Vendored things in our tree do NOT need to match upstream. applying patches is part of vendoring. the important thing is to track those patches and automate recreating such changes when updates are needed rather than expecting the next person doing it to fully understand.

@gpshead
Copy link
Member

gpshead commented Jan 23, 2025

BTW, the reason pip and setuptools put a _ prefix on their vendor tree is likely to make it clear that it is private in a Python naming sense. The things we vendor within CPython are not Python code. We shouldn't need to do that.

@eli-schwartz
Copy link
Contributor

If you're bikeshedding the name, github code search already excludes directorys named third_party/ as that is a common industry norm place to put... third party software that you've vendored.

... or thirdparty/ or third-party/, the latter of which I personally feel I've seen more often than the version with an underscore. :D I wonder if github excludes those too?

@picnixz
Copy link
Member Author

picnixz commented Jan 23, 2025

If you're bikeshedding the name, github code search already excludes directorys named third_party/ as that is a common industry norm place to put... third party software that you've vendored.

Oh... then maybe we can use that name :') If it's supported by Github, I would vote for 'third-party' because I don't need to press shift for _ (and it reads better than thirdparty).

Non-goal 1: Vendored things each need their own management in terms of how we get them, update them, and patch them if necessary for our repo. Don't try for a single approach there, it won't fit everything. There may be some commonalities on how that gets done but that kind of thing can be worked out on a case by case basis later.

Yes, that's what I think as well. I don't think we can have a unified approach.

Non-goal 2: Vendored things in our tree do NOT need to match upstream. applying patches is part of vendoring. the important thing is to track those patches and automate recreating such changes when updates are needed rather than expecting the next person doing it to fully understand.

I'm ok with this as well. I don't mind having each vendored module with its own patching system (I mean, we're already doing it somehow).


Now, AFAICT, there doesn't seem to many issues about just moving files which was my primary concern here. I'm no expert in redistribution but since I raised the issue on Discord, I thought it was also good to actually first ask whether this is something that would be interesting for CPython in the long term. And I'd like to thank everyone involved in this (constructive) discussion.

@eli-schwartz
Copy link
Contributor

eli-schwartz commented Jan 23, 2025

Ideal: We'd make a top level third_party/ tree to contain all of these vendored bits in their own subdirectories.

Doing that would be good for project health IMNSHO rather than scattering stuff around as we do today. re-packagers should appreciate it as well as they each have their own ideas of what they do and don't want to include vs pull from external sources and it makes it more obvious what we've got that could be considered for that.

By the way, Meson explicitly mandates that all "subprojects" (thirdparty code with their own build system that can be autoconfigured) must exist in a single top-level directory (default: subprojects/ but you can define it in your meson.build file if you use a different directory instead) for precisely this reason. It is extremely nice to be able to tell what code is third-party and what code is first-party.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build The build process and cross-build type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests