Commit fd34ee82 authored by Jungshik Shin's avatar Jungshik Shin Committed by Commit Bot

Change the script mixing policy to highly restrictive

The current script mixing policy (moderately restricitive) allows
mixing of Latin-ASCII and one non-Latin script (unless the non-Latin
script is Cyrillic or Greek).

This CL tightens up the policy to block mixing of Latin-ASCII and
a non-Latin script unless the non-Latin script is Chinese (Hanzi,
Bopomofo), Japanese (Kanji, Hiragana, Katakana) or Korean (Hangul,
Hanja).

Major gTLDs (.net/.org/.com) do not allow the registration of
a domain that has both Latin and a non-Latin script. The only
exception is names with Latin + Chinese/Japanese/Korean scripts.
The same is true of ccTLDs with IDNs.

Given the above registration rules of major gTLDs and ccTLDs, allowing
mixing of Latin and non-Latin other than CJK has no practical effect. In
the meantime, domain names in TLDs with a laxer policy on script mixing
would be subject to a potential spoofing attempt with the current
moderately restrictive script mixing policy. To protect users from those
risks, there are a few ad-hoc rules in place.

By switching to highly restrictive those ad-hoc rules can be removed
simplifying the IDN display policy implementation a bit.

This is also coordinated with Mozilla. See
https://bugzilla.mozilla.org/show_bug.cgi?id=1399939 .

BUG=726950, 756226, 756456, 756735, 770465
TEST=components_unittests --gtest_filter=*IDN*

Change-Id: Ib96d0d588f7fcda38ffa0ce59e98a5bd5b439116
Reviewed-on: https://chromium-review.googlesource.com/688825Reviewed-by: default avatarBrett Wilson <brettw@chromium.org>
Reviewed-by: default avatarLucas Garron <lgarron@chromium.org>
Commit-Queue: Jungshik Shin <jshin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#506561}
parent 4ea1bb80
......@@ -64,13 +64,14 @@ IDNSpoofChecker::IDNSpoofChecker() {
// MIXED_SCRIPT_CONFUSABLE, WHOLE_SCRIPT_CONFUSABLE, MIXED_NUMBERS, ANY_CASE})
// This default configuration is adjusted below as necessary.
// Set the restriction level to moderate. It allows mixing Latin with another
// script (+ COMMON and INHERITED). Except for Chinese(Han + Bopomofo),
// Japanese(Hiragana + Katakana + Han), and Korean(Hangul + Han), only one
// script other than Common and Inherited can be mixed with Latin. Cyrillic
// and Greek are not allowed to mix with Latin.
// Set the restriction level to high. It allows mixing Latin with one logical
// CJK script (+ COMMON and INHERITED), but does not allow any other script
// mixing (e.g. Latin + Cyrillic, Latin + Armenian, Cyrillic + Greek). Note
// that each of {Han + Bopomofo} for Chinese, {Hiragana, Katakana, Han} for
// Japanese, and {Hangul, Han} for Korean is treated as a single logical
// script.
// See http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
uspoof_setRestrictionLevel(checker_, USPOOF_MODERATELY_RESTRICTIVE);
uspoof_setRestrictionLevel(checker_, USPOOF_HIGHLY_RESTRICTIVE);
// Sets allowed characters in IDN labels and turns on USPOOF_CHAR_LIMIT.
SetAllowedUnicodeSet(&status);
......@@ -234,14 +235,9 @@ bool IDNSpoofChecker::SafeToDisplayAsUnicode(base::StringPiece16 label,
// label otherwise entirely in Katakna or Hiragana.
// - Disallow U+0585 (Armenian Small Letter Oh) and U+0581 (Armenian Small
// Letter Co) to be next to Latin.
// - Disallow Latin 'o' and 'g' next to Armenian.
// - Disalow mixing of Latin and Canadian Syllabary.
// - Disalow mixing of Latin and Tifinagh.
// - Disallow combining diacritical mark (U+0300-U+0339) after a non-LGC
// character. Other combining diacritical marks are not in the allowed
// character set.
// - Disallow Arabic non-spacing marks after non-Arabic characters.
// - Disallow Hebrew non-spacing marks after non-Hebrew characters.
// - Disallow U+0307 (dot above) after 'i', 'j', 'l' or dotless i (U+0131).
// Dotless j (U+0237) is not in the allowed set to begin with.
dangerous_pattern = new icu::RegexMatcher(
......@@ -254,15 +250,7 @@ bool IDNSpoofChecker::SafeToDisplayAsUnicode(base::StringPiece16 label,
R"(^[\p{scx=kana}]+[\u3078-\u307a][\p{scx=kana}]+$|)"
R"(^[\p{scx=hira}]+[\u30d8-\u30da][\p{scx=hira}]+$|)"
R"([a-z]\u30fb|\u30fb[a-z]|)"
R"(^[\u0585\u0581]+[a-z]|[a-z][\u0585\u0581]+$|)"
R"([a-z][\u0585\u0581]+[a-z]|)"
R"(^[og]+[\p{scx=armn}]|[\p{scx=armn}][og]+$|)"
R"([\p{scx=armn}][og]+[\p{scx=armn}]|)"
R"([\p{sc=cans}].*[a-z]|[a-z].*[\p{sc=cans}]|)"
R"([\p{sc=tfng}].*[a-z]|[a-z].*[\p{sc=tfng}]|)"
R"([^\p{scx=latn}\p{scx=grek}\p{scx=cyrl}][\u0300-\u0339]|)"
R"([^\p{scx=arab}][\u064b-\u0655\u0670]|)"
R"([^\p{scx=hebr}]\u05b4|)"
R"([ijl\u0131]\u0307)",
-1, US_INV),
0, status);
......
......@@ -205,10 +205,18 @@ const IDNTestCase idn_cases[] = {
false},
// Devanagari + Latin
{"xn--ab-3ofh8fqbj6h.in", L"ab\x0939\x093f\x0928\x094d\x0926\x0940.in",
true},
false},
// Thai + Latin
{"xn--ab-jsi9al4bxdb6n.th",
L"ab\x0e20\x0e32\x0e29\x0e32\x0e44\x0e17\x0e22.th", true},
L"ab\x0e20\x0e32\x0e29\x0e32\x0e44\x0e17\x0e22.th", false},
// Armenian + Latin
{"xn--bs-red.com", L"b\x057ds.com", false},
// Tibetan + Latin
{"xn--foo-vkm.com", L"foo\x0f37.com", false},
// Oriya + Latin
{"xn--fo-h3g.com", L"fo\x0b66.com", false},
// Gujarati + Latin
{"xn--fo-isg.com", L"fo\x0ae6.com", false},
// <vitamin in Katakana>b1.com
{"xn--b1-xi4a7cvc9f.com",
L"\x30d3\x30bf\x30df\x30f3"
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment