Change the script mixing policy to highly restrictive

The current script mixing policy (moderately restricitive) allows mixing of Latin-ASCII and one non-Latin script (unless the non-Latin script is Cyrillic or Greek). This CL tightens up the policy to block mixing of Latin-ASCII and a non-Latin script unless the non-Latin script is Chinese (Hanzi, Bopomofo), Japanese (Kanji, Hiragana, Katakana) or Korean (Hangul, Hanja). Major gTLDs (.net/.org/.com) do not allow the registration of a domain that has both Latin and a non-Latin script. The only exception is names with Latin + Chinese/Japanese/Korean scripts. The same is true of ccTLDs with IDNs. Given the above registration rules of major gTLDs and ccTLDs, allowing mixing of Latin and non-Latin other than CJK has no practical effect. In the meantime, domain names in TLDs with a laxer policy on script mixing would be subject to a potential spoofing attempt with the current moderately restrictive script mixing policy. To protect users from those risks, there are a few ad-hoc rules in place. By switching to highly restrictive those ad-hoc rules can be removed simplifying the IDN display policy implementation a bit. This is also coordinated with Mozilla. See https://bugzilla.mozilla.org/show_bug.cgi?id=1399939 . BUG=726950, 756226, 756456, 756735, 770465 TEST=components_unittests --gtest_filter=*IDN* Change-Id: Ib96d0d588f7fcda38ffa0ce59e98a5bd5b439116 Reviewed-on: https://chromium-review.googlesource.com/688825Reviewed-by: Brett Wilson <brettw@chromium.org> Reviewed-by: Lucas Garron <lgarron@chromium.org> Commit-Queue: Jungshik Shin <jshin@chromium.org> Cr-Commit-Position: refs/heads/master@{#506561}

Change the script mixing policy to highly restrictive
The current script mixing policy (moderately restricitive) allows mixing of Latin-ASCII and one non-Latin script (unless the non-Latin script is Cyrillic or Greek). This CL tightens up the policy to block mixing of Latin-ASCII and a non-Latin script unless the non-Latin script is Chinese (Hanzi, Bopomofo), Japanese (Kanji, Hiragana, Katakana) or Korean (Hangul, Hanja). Major gTLDs (.net/.org/.com) do not allow the registration of a domain that has both Latin and a non-Latin script. The only exception is names with Latin + Chinese/Japanese/Korean scripts. The same is true of ccTLDs with IDNs. Given the above registration rules of major gTLDs and ccTLDs, allowing mixing of Latin and non-Latin other than CJK has no practical effect. In the meantime, domain names in TLDs with a laxer policy on script mixing would be subject to a potential spoofing attempt with the current moderately restrictive script mixing policy. To protect users from those risks, there are a few ad-hoc rules in place. By switching to highly restrictive those ad-hoc rules can be removed simplifying the IDN display policy implementation a bit. This is also coordinated with Mozilla. See https://bugzilla.mozilla.org/show_bug.cgi?id=1399939 . BUG=726950, 756226, 756456, 756735, 770465 TEST=components_unittests --gtest_filter=*IDN* Change-Id: Ib96d0d588f7fcda38ffa0ce59e98a5bd5b439116 Reviewed-on: https://chromium-review.googlesource.com/688825Reviewed-by: Brett Wilson <brettw@chromium.org> Reviewed-by: Lucas Garron <lgarron@chromium.org> Commit-Queue: Jungshik Shin <jshin@chromium.org> Cr-Commit-Position: refs/heads/master@{#506561}
fd34ee82 · Jungshik Shin · Commit Bot · 4ea1bb80 · fd34ee82 · fd34ee82
Commit fd34ee82 authored Oct 04, 2017 by Jungshik Shin Committed by Commit Bot Oct 04, 2017
Showing with 17 additions and 21 deletions

components/url_formatter/idn_spoof_checker.cc components/url_formatter/idn_spoof_checker.cc +7 -19

components/url_formatter/url_formatter_unittest.cc components/url_formatter/url_formatter_unittest.cc +10 -2

No files found.
--- a/components/url_formatter/idn_spoof_checker.cc
+++ b/components/url_formatter/idn_spoof_checker.cc
@@ -64,13 +64,14 @@ IDNSpoofChecker::IDNSpoofChecker() {
  // MIXED_SCRIPT_CONFUSABLE, WHOLE_SCRIPT_CONFUSABLE, MIXED_NUMBERS, ANY_CASE})
  // This default configuration is adjusted below as necessary.

-  // Set the restriction level to moderate. It allows mixing Latin with another
-  // script (+ COMMON and INHERITED). Except for Chinese(Han + Bopomofo),
-  // Japanese(Hiragana + Katakana + Han), and Korean(Hangul + Han), only one
-  // script other than Common and Inherited can be mixed with Latin. Cyrillic
-  // and Greek are not allowed to mix with Latin.
+  // Set the restriction level to high. It allows mixing Latin with one logical
+  // CJK script (+ COMMON and INHERITED), but does not allow any other script
+  // mixing (e.g. Latin + Cyrillic, Latin + Armenian, Cyrillic + Greek). Note
+  // that each of {Han + Bopomofo} for Chinese, {Hiragana, Katakana, Han} for
+  // Japanese, and {Hangul, Han} for Korean is treated as a single logical
+  // script.
  // See http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
-  uspoof_setRestrictionLevel(checker_, USPOOF_MODERATELY_RESTRICTIVE);
+  uspoof_setRestrictionLevel(checker_, USPOOF_HIGHLY_RESTRICTIVE);

  // Sets allowed characters in IDN labels and turns on USPOOF_CHAR_LIMIT.
  SetAllowedUnicodeSet(&status);
@@ -234,14 +235,9 @@ bool IDNSpoofChecker::SafeToDisplayAsUnicode(base::StringPiece16 label,
    //   label otherwise entirely in Katakna or Hiragana.
    // - Disallow U+0585 (Armenian Small Letter Oh) and U+0581 (Armenian Small
    //   Letter Co) to be next to Latin.
-    // - Disallow Latin 'o' and 'g' next to Armenian.
-    // - Disalow mixing of Latin and Canadian Syllabary.
-    // - Disalow mixing of Latin and Tifinagh.
    // - Disallow combining diacritical mark (U+0300-U+0339) after a non-LGC
    //   character. Other combining diacritical marks are not in the allowed
    //   character set.
-    // - Disallow Arabic non-spacing marks after non-Arabic characters.
-    // - Disallow Hebrew non-spacing marks after non-Hebrew characters.
    // - Disallow U+0307 (dot above) after 'i', 'j', 'l' or dotless i (U+0131).
    //   Dotless j (U+0237) is not in the allowed set to begin with.
    dangerous_pattern = new icu::RegexMatcher(
@@ -254,15 +250,7 @@ bool IDNSpoofChecker::SafeToDisplayAsUnicode(base::StringPiece16 label,
            R"(^[\p{scx=kana}]+[\u3078-\u307a][\p{scx=kana}]+$|)"
            R"(^[\p{scx=hira}]+[\u30d8-\u30da][\p{scx=hira}]+$|)"
            R"([a-z]\u30fb|\u30fb[a-z]|)"
-            R"(^[\u0585\u0581]+[a-z]|[a-z][\u0585\u0581]+$|)"
-            R"([a-z][\u0585\u0581]+[a-z]|)"
-            R"(^[og]+[\p{scx=armn}]|[\p{scx=armn}][og]+$|)"
-            R"([\p{scx=armn}][og]+[\p{scx=armn}]|)"
-            R"([\p{sc=cans}].*[a-z]|[a-z].*[\p{sc=cans}]|)"
-            R"([\p{sc=tfng}].*[a-z]|[a-z].*[\p{sc=tfng}]|)"
            R"([^\p{scx=latn}\p{scx=grek}\p{scx=cyrl}][\u0300-\u0339]|)"
-            R"([^\p{scx=arab}][\u064b-\u0655\u0670]|)"
-            R"([^\p{scx=hebr}]\u05b4|)"
            R"([ijl\u0131]\u0307)",
            -1, US_INV),
        0, status);

--- a/components/url_formatter/url_formatter_unittest.cc
+++ b/components/url_formatter/url_formatter_unittest.cc
@@ -205,10 +205,18 @@ const IDNTestCase idn_cases[] = {
     false},
    // Devanagari + Latin
    {"xn--ab-3ofh8fqbj6h.in", L"ab\x0939\x093f\x0928\x094d\x0926\x0940.in",
-     true},
+     false},
    // Thai + Latin
    {"xn--ab-jsi9al4bxdb6n.th",
-     L"ab\x0e20\x0e32\x0e29\x0e32\x0e44\x0e17\x0e22.th", true},
+     L"ab\x0e20\x0e32\x0e29\x0e32\x0e44\x0e17\x0e22.th", false},
+    // Armenian + Latin
+    {"xn--bs-red.com", L"b\x057ds.com", false},
+    // Tibetan + Latin
+    {"xn--foo-vkm.com", L"foo\x0f37.com", false},
+    // Oriya + Latin
+    {"xn--fo-h3g.com", L"fo\x0b66.com", false},
+    // Gujarati + Latin
+    {"xn--fo-isg.com", L"fo\x0ae6.com", false},
    // <vitamin in Katakana>b1.com
    {"xn--b1-xi4a7cvc9f.com",
     L"\x30d3\x30bf\x30df\x30f3"