experimental_index_url fetching of large dep graphs is slow #2014

aignas · 2024-06-26T05:24:52Z

For context see: https://bazelbuild.slack.com/archives/CA306CEV6/p1719321260320259?thread_ts=1715871259.645269&cid=CA306CEV6

Some of our users have a lot of unique packages (e.g. 700+) and the fetching of PyPI index metadata takes a lot of time. This has to be done at some point though and if users are using MODULE.bazel.lock this is done once when updating the lock file and then users don't have to do it again. The problem comes when users need to update the lock file or the rules_python version is updated - they have to refetch the PyPI index data for all of the 700+ packages, because we are not caching the results of the PyPI index data, because that may change at any time and because we want to call it once per package when evaluating the extension - thus reusing the results across the hubs.

The reuse across hubs vs caching between runs is a trade off we have to make here and I'd be willing to bet that we should instead isolate the calls and do one call to PyPI per package per hub where we use the requirement line and the URL of the index as a cannonical id so that we can store the result of the module_ctx.download in the bazel downloader cache. This means that when users update rules_python or their lock files they do not need to refetch the PyPI index data and it would be much faster. Also, people who are not yet using the lock file but have persistent workers would benefit from this because the extension evaluation would speed up drastically. What is more the rules_python code handling all of this could be simplified.

Alternatives that I thought about and discarded:

Make it non-eager - fetch the PyPI data for each package inside whl_library. We still need to get the filenames from somewhere and since it goes to the lock file, it seems that this could only work with lock file formats similar to poetry.lock or pdm.lock, which have the whl filenames in the lock file, but the URL is not specified. Since the URLs are useful in the lock file, it is better to just have them retrieved as part of pip extension evaluation.
Have a separate repository rule that fetches the PyPI metadata only and then we pass a label to a file that has the URL instead of the URL itself. This would make the fetching of PyPI metadata lazy, but it suffers from the same problem - we need the filenames of the distributions that we are working with and it makes the code more complex just to avoid using MODULE.bazel.lock.

So the plan of attack:

Rework the pip.parse extension to collect all of the requirements from each invocation of pip.parse.
Rework the simpleapi_download function to create a cannonical id using the requirement_line (or a list of them) and the URL that we are calling. Add to docs a line stating that users would have to do bazel clean --expunge --async if they want to clear the cache of the PyPI metadata.
Ask users to test to see if that makes a difference.

cc: @elviswianda

The text was updated successfully, but these errors were encountered:

aignas · 2024-06-28T04:57:12Z

I have a branch and it seems that using the canonical_id does not yield any significant benefits for large requirements files. I saw that the biggest easy wins are achieved by:

Use MODULE.bazel.lock - this means that the extension is not re-evaluated if nothing changes.
Use bazel v7.1 or above - we default to fetching metadata from PyPI in parallel
Use experimental_index_url_overrides so that we are making fewer calls to PyPI. If users use experimental_extra_index_urls then for each package we will call all of the listed indexes. If the users instead use the mentioned parameter, then we will make a single call per each package.

For 900 package requirements.txt the evaluation of the extension was 9 seconds without using canonical_id and 8 seconds with using the canonical ID and having the items in the cache. When using bazel 7.0.0, the number would be much higher (closer to a minute), but this is still much faster than using uv pip compile on such a big requirements file. Since the evaluation of the extension is only done once in a while, I personally think the performance hit is OK or at least there is nothing we can do about it. I am also thinking if we should use the canonical_id here because that will put more things in cache and may lead to surprising behaviour. If we are not caching, we can be more clever in other places.

Maybe the only action item from this issue is to update documentation with the recommendations above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experimental_index_url fetching of large dep graphs is slow #2014

experimental_index_url fetching of large dep graphs is slow #2014

aignas commented Jun 26, 2024

aignas commented Jun 28, 2024

experimental_index_url fetching of large dep graphs is slow #2014

experimental_index_url fetching of large dep graphs is slow #2014

Comments

aignas commented Jun 26, 2024

aignas commented Jun 28, 2024