Skip to content

Meta: UTS46 feedback #744

@annevk

Description

@annevk

Feedback I submitted to be considered for the Unicode April 2023 meeting.


Chromium will ship Nontransitional Processing soon: https://chromestatus.com/feature/5105856067141632. That covers all browser engines. I suggest taking that opportunity to simplify this document and its test suite and declare the transition period for which this conditional existed to be over.


Steps don't always consider that domain labels can be empty, e.g., when CheckBidi is true the first subrule of "The Bidi Rule" inspects the first character of a label. I think that might also apply to CheckJoiners and potentially other steps. (I initially thought the problem here was VerifyDnsLength not being considered, but that check happens much later on in the processing model so it's something more fundamental.)


Please change U+2260 (≠), U+226E (≮), and U+226F (≯) from disallowed_STD3_valid to valid.

These code points are not decomposed so they can never conflict with =, <, and >. And they are not inherently more confusing than any of the other allowed code points, which include hieroglyphics and emoji. These code points also work as-is in all browser engines (while < and > are forbidden) and on balance preference ought to be given to retaining compatibility so end users are not prevented from visiting websites or seeing subresources that might use these code points in their domain for one reason or another.

For further background and discussion please see #733.

Thank you!

#733 (comment)


I have worked on importing IdnaTestV2.txt into web-platform-tests, the test framework used by all web browsers. The goal was to meet the requirements of the domain to ASCII algorithm specified at https://url.spec.whatwg.org/#idna with beStrict initialized to false.

As such, I attempted to filter out ToASCII statuses for UseSTD3ASCIIRules, CheckHyphens, and VerifyDnsLength. Hoping that any statuses that are left would indicate a failure requirement.

You can find my work at web-platform-tests/wpt#38080.

I ran into the following issues. Most of them relate to status annotation. IPv4 address confusion was the one issue that did not relate to statuses.

  • VerifyDnsLength is not P4, but rather A4_1 and A4_2.
  • Tests that use trailing ASCII digit labels (or such a label followed by a dot) are not useful for browsers as that will trigger the IPv4 parser. Which will then usually return failure as the input was not actually an IPv4 address string. This is a problem for a number of the A4_1 and A4_2 tests. And also a large number of tests later on, such as ToASCII("xn--gl0as212a.8.") or ToASCII("1.27"). I wrote a filter to exclude them, but it would be better if they were adjusted slightly (e.g., made to contain one non-EN code point) so what they aim to test can also be tested in browsers. (Note that the IPv4 parser runs after domain to ASCII, but the web platform doesn't provide a way to invoke domain to ASCII on its own and probably never will.)
  • The test for ToASCII("$") is marked P1 and V6, not U1. This also affects numerous tests with <, >, and =. If they continue to have multiple statuses that will also make it impossible to filter them in an automated fashion. (This also applies to non-ASCII UseSTD3ASCIIRules code points, but I filed a separate request to remove those.)
  • NV8 is not used as a status.
  • A3 and X3 do not appear to be used as a status. (These are catered for by P4 presumably.)
  • CheckBidi is not V8. V8 does not appear to be used. You'd have to filter out all B1-6 statuses instead.

An issue reported against the URL Standard indicated that the current CheckBidi handling from UTS 46 is rather strict: #543. Namely, domains containing RTL-labels cannot have labels consisting solely of ASCII digits preceding them (such labels are invalid per The Bidi Rule subrule 1). This ends up rejecting a number of domains in the wild and also seems unnecessarily restrictive for RTL users.

In that issue I worked with Harald Alvestrand (one of the editors of RFC 5893: Right-to-Left Scripts for Internationalized Domain Names for Applications (IDNA)) on a specific set of changes for UTS 46 that would remedy this issue, while still imposing the majority of Bidi-related requirements present in UTS 46 today.

The proposed changes are:

  1. Remove step 8 of https://unicode.org/reports/tr46/#Validity_Criteria as Validity Criteria only operates on a single label. (Although it somehow claims to have knowledge about the domain_name string as well...)
  2. Add a new step 5 to https://unicode.org/reports/tr46/#Processing. (Note that due to step 4 we will have U-labels.)

The new step 5 would as follows:

  • If CheckBidi, and the domain_name string is a Bidi domain name, record there was an error if neither of the following conditions is true:
    • All labels in the domain_name string satisfy the 6 subrules of The Bidi Rule of RFC 5893, Section 2.
    • RTL labels in the domain_name string are immediately followed by an LDH label whose first code point is not of class EN and all labels in the domain_name string are either LDH labels or satisfy the 6 subrules of The Bidi Rule of RFC 5893, Section 2.

Thank you for your consideration. This is probably the final IDNA-related issue from the URL Standard. Once all of them have been resolved I’ll work with browser implementers to ensure the changes (if any) get implemented so we can finally declare victory on IDNA interoperability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    i18n-trackerGroup bringing to attention of Internationalization, or tracked by i18n but not needing response.metaChanges to the ecosystem around the standard, not its contentstopic: idna

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions