Commit bd51befe authored by Mustafa Emre Acer's avatar Mustafa Emre Acer Committed by Commit Bot

Add a Readme for IDN display policy

This is the markdown version of https://www.chromium.org/developers/design-documents/idn-in-google-chrome
with cleanups and "Other Browsers" section removed. That section will need to be rewritten.

Change-Id: Ib7079412a6fcad91bfe963147e113788570e3265
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2108084
Commit-Queue: Mustafa Emre Acer <meacer@chromium.org>
Reviewed-by: default avatarJoe DeBlasio <jdeblasio@chromium.org>
Cr-Commit-Position: refs/heads/master@{#751443}
parent 1f326209
# Internationalized Domain Names (IDN) in Google Chrome
## Background
Many years ago, domains could only consist of the Latin letters A to Z, digits,
and a few other characters. [Internationalized Domain Names
(IDNs)](https://en.wikipedia.org/wiki/Internationalized_domain_name) were
created to better support non-Latin alphabets for web users around the globe.
Different characters from different (or even the same!) languages can look very
similar. We’ve seen
[reports](https://bugs.chromium.org/p/chromium/issues/detail?id=683314) of
proof-of-concept attacks. These are called [homograph
attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack). For example, the
Latin "a" looks a lot like the Cyrillic "а", so someone could register
`http://ebаy.com` (using Cyrillic "`а`"), which could be confused for
`http://ebay.com`. This is a limitation of how URLs are displayed in browsers in
general, not a specific bug in Chrome.
In a perfect world, domain registrars would not allow these confusable domain
names to be registered. Some domain registrars do exactly that, mostly by
restricting the characters allowed, but many do not. To better protect against
these attacks, browsers display some domains in
[punycode](https://en.wikipedia.org/wiki/Punycode) (looks like `xn--...`)
instead of the original IDN, according to their own IDN policies.
This is a challenging problem space. Chrome has a global user base of billions
of people around the world, many of whom are not viewing URLs with Latin
letters. We want to prevent confusion, while ensuring that users across
languages have a great experience in Chrome. Displaying either punycode or a
visible security warning on too wide of a set of URLs would hurt web usability
for people around the world.
Chrome and other browsers try to balance these needs by implementing IDN
policies in a way that allows IDN to be shown for valid domains, but protects
against confusable homograph attacks.
Chrome's IDN policy is one of several tools that aim to protect users.
[Google Safe Browsing](https://safebrowsing.google.com/) continues to help
protect over two billion devices every day by showing warnings to users when
they attempt to navigate to dangerous or deceptive sites or download dangerous
files. Password managers continue to remember which domain password logins are
for, and won’t automatically fill a password into a domain that is not the
exactly correct one.
## How IDN works
IDNs were devised to support arbitrary Unicode characters in hostnames in a
backward-compatible way. This works by having user agents transform hostnames
containing non-ASCII Unicode characters into an ASCII-only hostname, which can
then be sent on to DNS servers. This is done by encoding each domain label into
its punycode representation. This representation includes a four-character
prefix (`xn--`) and then the unicode translated to ASCII Compatible Encoding
(ACE). For example, `http://öbb.at` is transformed to `http://xn--bb-eka.at`.
## Google Chrome's IDN policy
Since Chrome 51, Chrome uses an IDN display policy that does not take into
account the language settings (the Accept-Language list) of the browser. A
[similar strategy](https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm) is
used by Firefox.
Google Chrome decides if it should show Unicode or punycode for each domain
label (component) of a hostname separately. To decide if a component should be
shown in Unicode, Google Chrome uses the following algorithm:
1. Convert each component stored in the ACE to Unicode per [UTS 46 transitional
processing](http://unicode.org/reports/tr46/#Processing) (`ToUnicode`).
2. If there is an error in `ToUnicode` conversion (e.g. contains [disallowed
characters](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Auts46%3Ddisallowed%3A%5D&abb=on&g=&i=),
[starts with a combining
mark](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uidna_8h.html#a0411cd49bb5b71852cecd93bcbf0ca2da390a6b3d9844a1dcc1f99fb1ae478ecf),
or [violates BiDi
rules](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uidna_8h.html#a0411cd49bb5b71852cecd93bcbf0ca2da8a9311811fb0f3db1644ac1a88056370)),
show punycode.
3. If there is a character in a label not belonging to [Characters allowed in
identifiers](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AIdentifierStatus%3DAllowed%3A&abb=on&g=&i=)
per [Unicode Technical Standard 39 (UTS
39)](http://www.unicode.org/reports/tr39/#Identifier_Status_and_Type), show
punycode.
4. If any character in a label belongs to [the disallowed
list](https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5Cu01CD-%5Cu01DC%5D+%5B%5Cu1c80-%5Cu1c8f%5D++%5B%5Cu1e90-%5Cu1e9b%5D++%5B%5Cu1f00-%5Cu1fff%5D++%5B%5Cua640-%5Cua69f%5D-%5B%5Cua720-%5Cua72f%5D+%5B%5Cu0338+%5Cu058a+%5Cu2010+%5Cu2019+%5Cu2027+%5Cu30a0+%5Cu02bb+%5Cu02bc+%5D&abb=on&g=&i=),
show punycode.
5. If the component uses characters drawn from multiple scripts, it is subject
to a script mixing check based on ["Highly Restrictive" profile of UTS
39](http://www.unicode.org/reports/tr39/#Restriction_Level_Detection) with an
additional restriction on Latin. If the component fails the check, show the
component in punycode.
- Latin, Cyrillic or Greek characters cannot be mixed with each other
- Latin characters in the ASCII range can be mixed ONLY with Chinese (Han,
Bopomofo), Japanese (Kanji, Katakana, Hiragana), or Korean (Hangul, Hanja).
- Han (CJK Ideographs) can be mixed with Bopomofo
- Han can be mixed with Hiragana and Katakana
- Han can be mixed with Korean Hangul
6. If two or more numbering systems (e.g. European digits + Bengali digits) are
mixed, show punycode.
7. If there are any invisible characters (e.g. a sequence of the same combining
mark or a sequence of Kana combining marks), show punycode.
8. Test the label for [mixed script confusable per UTS
39](http://unicode.org/reports/tr39/#Mixed_Script_Confusables). If mixed script
confusable is detected, show punycode.
9. If a hostname belongs to an non-IDN TLD (top-level-domain) such as 'com',
'net', or 'uk' and all the letters in a given label belong to [a set of Cyrillic
letters that look like Latin
letters](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%D0%B0%D1%81%D4%81%D0%B5%D2%BB%D1%96%D1%98%D3%8F%D0%BE%D1%80%D4%9B%D1%95%D4%9D%D1%85%D1%83%D1%8A%D0%AC%D2%BD%D0%BF%D0%B3%D1%B5%D1%A1%5D&g=gc&i=)
(e.g. [Cyrillic Small Letter
IE](http://unicode.org/cldr/utility/character.jsp?a=0435) - `е` ), show
punycode.
10. If the label matches a [dangerous
pattern](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&g=0&l=422),
show punycode.
11. If the [skeleton](http://unicode.org/reports/tr39/#def-skeleton) of the
registrable part of a hostname is identical to one of the top domains after
removing diacritic marks and mapping each character to its spoofing skeleton
(e.g. `www.googlé.com` with `é` in place of `e`), show punycode.
13. Otherwise, show Unicode.
This is implemented by `IDNToUnicodeOneComponent()` and `IsIDNComponentSafe()`
in
[`components/url_formatter/url_formatter.cc`](https://cs.chromium.org/search/?q=components/url_formatter/url_formatter.cc)
and `IDNSpoofChecker` class in
[`components/url_formatter/spoof_checks/idn_spoof_checker.cc`](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc).
## Reporting Security Bugs
We reward certain cases of IDN spoofs according to [Chrome's Vulnerability
Reward Program](https://www.google.com/about/appsecurity/chrome-rewards/index.html)
policies. Please see [this
document]( https://docs.google.com/document/d/1_xJz3J9kkAPwk3pma6K3X12SyPTyyaJDSCxTfF8Y5sU/edit?usp=sharing)
before reporting a security bug.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment