-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API docs are not being properly indexed for search #1432
Comments
We recently reached our first funding goal, and therefore released the While we investigate this issue, you can set |
Okay, so the symbol type code span is being included in the index. I failed to recognize that as the body text of the code span wasn't included in the search result. But now I realize that the body text is added by CSS so it isn't in the HTML. Looks like the |
We could also try to somehow unescape the title before adding it to the page. IIUC escaping is done in the function formatResult (location, title, summary) {
return '<article><h3><a href="' + joinUrl(base_url, location) + '">'+ escapeHtml(title) + '</a></h3><p>' + escapeHtml(summary) +'</p></article>';
} That would at least fix the rendering issue in the search page. |
Actually, this setting makes no difference. To be clear, changing the setting does remove the symbol type from the documentation, but it does not remove it from the search results. I expect that should be a clue as to where the problem is. While testing this, I also noticed that any search result which returns the module (top heading in the page which included the module level docstring) also has the entire header (name of module) in escaped code span (for example, |
Oh, yes, when we don't separate signatures from the headings, then headings are the signatures, and they're wrapped in code tags. I suppose we could wrap them in spans instead (as you suggested in previous discussions), with the right CSS styling, but they would be escaped as About symbols still appearing in the search results, try also setting |
That fixes the issue. The However, we still have the issue with the module titles. I should point out that all of the headings (for functions, classes, attributes, methods, etc) have code spans in them, which get stripped. But only the module titles do not get their tags stripped. The offending code span here is not the symbol type, but the code span which contains the module name. |
I just deployed the docs with the changes in #1434 and search results are better (they are actually usable). However, we still have the issue with the module names not being striped of tags before being indexed. Therefore, this issue is still unresolved. |
Upon further investigation, it appears that this issue is a symptom of mkdocs/mkdocs#3357. MkDocs' search plugin gets its entries from 2 sources: (1) a page and (2) the sections of a page. The sections are obtained by iterating over the TOC for a page. The TOC already sanitizes section headings and therefore we don't have an issue with them (for the most part). However, until mkdocs/mkdocs#3357 is resolved, there is no guarantee that a page title is sanitized. Unfortunately, the search plugin assumes that it is. I could submit a change to mkdocs which removes that assumption (by explicitly sanitizing the title within the search plugin before indexing), but it would be more sensible to just fix mkdocs/mkdocs#3357 directly. That said, there is still the issue with symbol types in the TOC. They should not be getting through the TOC sanitation. I wonder if there is a change to mkdocsstrings which could work around the issue entirely. That needs further investigation. |
After I posted that I remembered that the type symbols are added by mkdocsstrings after Markdown is done with the TOC. Therefore, changes to the TOC extension would have no effect. I wonder if perhaps mkdocsstrings needs to take a different approach here so as not to dirty the TOC and page title. And that is what I want to investigate. |
After some investigation, it is now clear that the as-yet unresolved problem lies with MkDocs itself. See mkdocs/mkdocs#3357, which is specifically about ensuring page titles do not contain markup. When that is resolved, markup will not be passed to the search index for whole-page entries. Also of note is mkdocs/mkdocs#3560. If that (or something similar) gets merged, then we can turn |
Great investigation and PR explanation @waylan 👌
I'm open to any suggestion! However IIUC this will be solved thanks to the two PRs mentioned above?
The nav_parts = [f"<code>{part}</code>" for part in parts] You can try to remove the |
There are a few different reasons why I say that. First of all, we don't know yet what the fix for mkdocs/mkdocs#3357 will be. It could be complex and allow preservation of markup in a page title or it could simply strip all markup from the page title. If it is the later, then those code tags will get removed even in places where we want them to stay. And that leads me to the bigger issue. Where is the proper location to be wrapping content in tags? I think ideally the answer should be the HTML templates (I don't mean mkdocstrings-python's templates as they are rendering Markdown, not HTML). To be clear, the page body is understood to contain HTML when passed to a template. However, everything else is assumed to be plain text. In other words, when a page template is rendering the page's TOC, it is iterating over a collection of plain text strings and rendering each into a TOC item. This is where the various tags and symbols should get inserted (by the template), They should not be part of the text passed to the template as they are now. Of course, the problem is how to get a plugin to do that. Yes, plugins have access to and can modify the Jinja env, which would allow a plugin to override/replace/inherit a theme's templates. But that falls down when the user provides their own template overrides. And even if the user-templates-issue was resolved, a plugin would presumably need to provide a different set of templates for every supported theme out there, which is not practical. So, in the end, I don't know what the solution should be. I'm just pointing out that this will continue to be an issue to some degree or another so long as Python objects continue to get markup inserted into them by plugins.
Yes, that's the problem. I want the code tags where I want them, but not where I don't. The question is how to preserve them where we want them, but to exclude them where we don't. It all depends on what solution mkdocs takes. |
According to the direction that the issue is moving to... Putting html into the nav value may become unsupported https://github.com/mkdocs/regressions/actions/runs/7733622178/job/21087001738 But I'll need to keep thinking on this |
No actually the |
This is off-topic here but I don't know any other thread that talks about this It is currently dumping garbage such as the following into the HTML when using mkdocs-material: <a aria-label='Previous: <code class="doc-symbol doc-symbol-nav doc-symbol-module"></code> inventory' class="md-footer__link md-footer__link--prev" href="../inventory/"> And with upcoming changes to MkDocs it will dump slightly different garbage into there :D
|
Made a further post about it mkdocs/mkdocs#3357 (comment) |
I think we had a discussion about this with @squidfunk when I was implementing the feature in mkdocstrings. I think it led to the use of |
@oprypin, for context, this is all about Python-Markdown using mkdocstrings for its API documentation. The |
The default behavior is for search to use the heading of the relevant section as the title of the search result. However, the code spans within headings of the API docs are being HTML escaped in the results and many of the obvious headings are not being returned as results. I'm assuming the second issue is related to the first in that the correct text is not getting indexed. In other words, if only the plain text content of a heading was being indexed, then that would result in better search results.
Consider the following example.
ESCAPED_CHARS
is an instance attribute of theMarkdown
class. One might expect that its relevant section (ESCAPED_CHARS) would be returned in a search for the stringESCAPED_CHARS
, but it is not. However, the methodmarkdown.inlinepatterns.EscapeInlineProcessor.handleMatch
is in the results because it mentionsESCAPED_CHARS
in the body of its documentation. Yet, in the search result, the title is the text<code class="doc-symbol doc-symbol-toc doc-symbol-method"></code> handleMatch
nothandleMatch
ormarkdown.inlinepatterns.EscapeInlineProcessor.handleMatch
as one might expect. Frustratingly, the search termhandleMatch
does not return that result at all. Yet, it is in the results for the search termdoc-symbol-method
, which shouldn't even be indexed, as it is an HTML class assigned to the code span, not text.@pawamoy do you have any insight into this? I have not yet looked at the code and am not sure how the mkdocsstrings extension passes its generated pages to search for indexing.
The text was updated successfully, but these errors were encountered: