-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow [
and ]
as URL code-points
#753
Comments
Thus far validity has aimed to match RFC 3986 and RFC 3987. (I would be open to allowing all non-ASCII equally (minus surrogates) as it's converted to ASCII anyway. That would also bring it more in line to how we deal with non-ASCII generally.) This relates to #379 as well. I think your rationale is quite sound though and I would be open to making this change. But this would be the first intentional deviation from RFC 3986 validity I think. Would love to hear from others as well. |
I tested a range of parsers:
|
Aye, the host behaviour is a bit annoying.
|
Are all proxy servers aware of and implement the mentioned rfc correctly? I worked on a url shortener using '[' and ']' as part of a shortened url and some mobile applications cut of the url at those characters. |
Background
RFC-3986 reserves two kinds of delimiters:
gen-delims
, which are used by the URL syntax itself (things like?
and#
), andsub-delims
, which are available for use within a URL component to mark out subcomponents.For example,
&
and=
are in thesub-delims
set, and they are used by query strings to encode key-value pairs.It's important that we have a set of known subcomponent delimiters, because clients need the assurance that these characters can be used without escaping. An escaped and unescaped subcomponent delimiter must not be equivalent - for example, percent-escaping a
&
in the query string would merge adjacent key-value pairs and corrupt its meaning.RFC-3986 also includes
[
and]
in thegen-delims
set, and does not allow their use anywhere except IP addresses. That document does not explain why these particular characters are forbidden elsewhere.Its predecessor, RFC-2396, includes these characters in the
unwise
character set, and possibly provides more insight as to why they must be escaped in URLs:This escaping means they cannot be used as subcomponent delimiters.
Problem
Despite the above, the URL standard today allows
[
and]
to be used unescaped in URL query strings and fragments. They are not URL code-points, but they are tolerated, and have been for such a long time that an ecosystem has emerged which depends on them being available as subcomponent delimiters.An example of this is the Javascript
qs
library (>250m downloads per month), used by popular frameworks such as Express.js. It uses square brackets to denote nesting and arrays in key-value pairs.and arrays:
Query-strings created by this library will use percent-encoded brackets. This is apparently undesirable though, so they added an option to skip percent-encoding key names, and users unhappy with the escaping are encouraged to use it.
Moreover, the use of an unsanctioned character as a subcomponent delimiter means that brackets in key names are ambiguous:
And the only way to break this ambiguity would be to say that escaped and unescaped square brackets might not be equivalent, as is the case with all other subcomponent delimiters.
Proposed Resolution
I believe we should accept that unescaped
[
and]
are a de-facto part of the web at this point, and include them as valid URL code-points. We already allow them to be used without escaping, and developers have been using them unescaped for some time.Historically, there has never been any conflict with other URL components (indeed, IPv6 addresses now use square brackets) - there was only some concern about colliding delimiters when embedding URLs, but that concern seems to apply equally to regular parentheses
()
, which are allowed (and are actually used specifically as URL delimiters in Markdown). Ultimately, the issue of colliding delimiters is an issue for the embedding document to solve, not for the embedded content to attempt to second-guess.Therefore, IMO, the presence of an unescaped square bracket should not be grounds to call the URL non-valid.
The text was updated successfully, but these errors were encountered: