Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IVS in string searching #24

Open
wants to merge 5 commits into
base: gh-pages
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added images/E0100.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/E0101.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 34 additions & 1 deletion index.html
Original file line number Diff line number Diff line change
Expand Up @@ -551,7 +551,40 @@ <h4>South Asian (Indic script) languages</h4>

</section>

</section>
<section id="cjk-ivs">
<h4>East Asian languages</h4>
xfq marked this conversation as resolved.
Show resolved Hide resolved

<p>The Chinese, Japanese, and Korean (CJK) writing systems sometimes encounter this issue.</p>

<p>A single ideograph can have multiple variants depending on the variation selector (VS) used. For example, <span class="codepoint" translate="no"><bdi lang="zh">&#x9F8D;</bdi><code class="uname">U+9F8D CJK UNIFIED IDEO­GRAPH-​9F8D</code></span> when followed by different variation selectors, such as <span class="codepoint" translate="no"><img alt="Variation Selector-17" src="images/E0100.png"><code class="uname">U+E0100 VARIATION SELECTOR-17</code></span> or <span class="codepoint" translate="no"><img alt="Variation Selector-18" src="images/E0101.png"><code class="uname">U+E0100 VARIATION SELECTOR-18</code></span>, may represent distinct glyphs. String searching operations need to consider whether matches are performed on the ideograph alone or on the full Ideographic Variation Sequence (<abbr>IVS</abbr>).</p>
xfq marked this conversation as resolved.
Show resolved Hide resolved
<p>When performing string searches, treat an IVS (e.g., <span class="codepoint" translate="no"><bdi lang="zh">&#x9F8D;&#xE0100;</bdi><code class="uname">U+9F8D CJK UNIFIED IDEO­GRAPH-​9F8D</code> + <img alt="Variation Selector-17" src="images/E0100.png"><code class="uname">U+E0100 VARIATION SELECTOR-17</code></span>) as a single unit and use the first code point as the processing baseline.</p>

<p>If possible, support both "fuzzy" and "precise" search modes to accommodate different user needs.</p>

<aside class="example" title="Example of string searching and IVS">
<p>Here is an example string search on text with IVS:</p>
<table ><thead>
<tr>
<th>Search term</th>
<th>Text</th>
<th>Result (fuzzy mode)</th>
<th>Result (precise mode)</th>
</tr></thead>
<tbody>
<tr>
<td><span lang="zh">龍天</span> (U+9F8D U+5929)</td>
<td><span lang="zh">龍天</span> (U+9F8D U+5929)<br>
<span lang="zh">龍󠄀天</span> (U+9F8D U+E0100 U+5929)<br>
<span lang="zh">龍󠄁天</span> (U+9F8D U+E0101 U+5929)</td>
<td>Matches all three strings</td>
<td>Matches only <span lang="zh">龍天</span> (U+9F8D U+5929)</td>
</tr>
</tbody></table>
</aside>

</section>

</section>

<section id="whitespaceNormalization">
<h4>Whitespace Normalization</h4>
Expand Down