-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(localSearch): add concurrency pooling, cleanup logic, improve performance #3374
Conversation
719813f
to
9889339
Compare
9889339
to
b1dce46
Compare
Force pushed to keep up with main branch (use pMap instead). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'll merge this after testing once with unocss docs or something.
Sure, I will create another PR to port the JSDOM based section splitter back after this one gets merged. After that all users should receive performance benefit as well as enhanced indexing robustness. |
fixes #3377
Highlights
This PR enables the local search plugin to work in true parallel. While keeping its API backward compatible, it now allows asynchronous user-provided callbacks for both html generation and section splitting (which is very expensive).
In addition, by refactoring the logic of task dispatching, the plugin will no longer accumulate content for all pages and locales before indexing. Therefore, this change will benefit all local-search users out-of-box.
What's next
I've implemented multiprocessing that worked with this PR. That implementation is not yet included here because (1) I am not sure whether the inclusion of a new package (JSDOM) is acceptable, and (2) I do not know if people would agree with the idea of multiprocessing.
As you can see in this commit, the new
miniSearch.splitIntoSections()
config option allows an AsyncGenerator as its return value. And the AsyncGenerator can be used to proxy a worker thread using JS Event model.One important reason I chose local-search plugin to start experimenting is data isolation. The splitting task only needs html string, so it's easy to be stripped out of the context. In my project, the overhead of IPC is reduced to the minimum since the main process only sends a path, not the entire content.
However, if done correctly, many other processes can also be refactored to run in parallel. For example, the
renderPage()
looks very promising, which currently takes relatively long to complete, especially for large volume of content. If we can make that part parallel, I believe almost everyone would receive a noticeable speedup.The speedup is crazy
Both task are performed on Apple M1 Max, building this project
Problem to be fixed
The default section splitting function
splitPageIntoSections()
uses two regular expressions to detect and stip headers from html sources. However, it assumes that each<h*>
element has an<a>
element as it first child. This is only true when generated by pure markdown sources. However, markdown allows html element embedding (vitepress even makes it reactive), and users would expect their own headers to be detected normally, which may not follow the RegExp patterns.In addition, the original splitter used a sparse array to track current header hierarchies. However, it fails at some very simple cases. For example:
All these problems have been fixed in the
worker
implementation from my project. I can spend some time to port them back (with or without multiprocessing).