[Lookalikes] Carve out common false positives from target embedding.
crrev.com/c/2437914 expanded what domains were captured as part of the target embedding heuristic, however it also introduced a large cache of new false positives that exceed our ability to allowlist. This CL makes three small changes to mitigate the impact of that change: 1. It permits cross-TLD embeddings (e.g. google.com.mx no longer matches google.com). 2. It adds additional words to our common words list that cause a number of high-profile false positives (e.g. "hotels"). 3. It adds a small set of domains that are are important to protect from embedding, but use a common word in their name (e.g. "office.com"). These domains are flagged if they're embedded, but not if the domain name ends with "-DOMAIN", e.g. (home-office.com). These changes are pretty small, and strictly reduce the set of domains flagged by target embedding, so should be quite low risk. Alongside proactive allowlisting, this should allow us to re-enable target embedding. Bug: 1150994 Change-Id: I1e0f562f0677e63a33068eed24ddca721d51aea3 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2568814 Commit-Queue: Joe DeBlasio <jdeblasio@chromium.org> Reviewed-by:Mustafa Emre Acer <meacer@chromium.org> Cr-Commit-Position: refs/heads/master@{#832634}
Showing
Please register or sign in to comment