[Lookalikes] Carve out common false positives from target embedding.

crrev.com/c/2437914 expanded what domains were captured as part of the target embedding heuristic, however it also introduced a large cache of new false positives that exceed our ability to allowlist. This CL makes three small changes to mitigate the impact of that change: 1. It permits cross-TLD embeddings (e.g. google.com.mx no longer matches google.com). 2. It adds additional words to our common words list that cause a number of high-profile false positives (e.g. "hotels"). 3. It adds a small set of domains that are are important to protect from embedding, but use a common word in their name (e.g. "office.com"). These domains are flagged if they're embedded, but not if the domain name ends with "-DOMAIN", e.g. (home-office.com). These changes are pretty small, and strictly reduce the set of domains flagged by target embedding, so should be quite low risk. Alongside proactive allowlisting, this should allow us to re-enable target embedding. Bug: 1150994 Change-Id: I1e0f562f0677e63a33068eed24ddca721d51aea3 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2568814 Commit-Queue: Joe DeBlasio <jdeblasio@chromium.org> Reviewed-by: Mustafa Emre Acer <meacer@chromium.org> Cr-Commit-Position: refs/heads/master@{#832634}

[Lookalikes] Carve out common false positives from target embedding.
crrev.com/c/2437914 expanded what domains were captured as part of the target embedding heuristic, however it also introduced a large cache of new false positives that exceed our ability to allowlist. This CL makes three small changes to mitigate the impact of that change: 1. It permits cross-TLD embeddings (e.g. google.com.mx no longer matches google.com). 2. It adds additional words to our common words list that cause a number of high-profile false positives (e.g. "hotels"). 3. It adds a small set of domains that are are important to protect from embedding, but use a common word in their name (e.g. "office.com"). These domains are flagged if they're embedded, but not if the domain name ends with "-DOMAIN", e.g. (home-office.com). These changes are pretty small, and strictly reduce the set of domains flagged by target embedding, so should be quite low risk. Alongside proactive allowlisting, this should allow us to re-enable target embedding. Bug: 1150994 Change-Id: I1e0f562f0677e63a33068eed24ddca721d51aea3 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2568814 Commit-Queue: Joe DeBlasio <jdeblasio@chromium.org> Reviewed-by: Mustafa Emre Acer <meacer@chromium.org> Cr-Commit-Position: refs/heads/master@{#832634}
5a74962b · Joe DeBlasio · Chromium LUCI CQ · 91a4557e · 5a74962b · 5a74962b
Commit 5a74962b authored Dec 02, 2020 by Joe DeBlasio Committed by Chromium LUCI CQ Dec 02, 2020
2 changed files
--- a/components/lookalikes/core/lookalike_url_util.cc
+++ b/components/lookalikes/core/lookalike_url_util.cc
@@ -15,6 +15,7 @@
 #include "base/memory/singleton.h"
 #include "base/metrics/field_trial_params.h"
 #include "base/metrics/histogram_macros.h"
+#include "base/strings/strcat.h"
 #include "base/strings/string_piece.h"
 #include "base/strings/string_split.h"
 #include "base/strings/string_util.h"
@@ -61,10 +62,18 @@ const base::FeatureParam<std::string> kAdditionalCommonWords{

 // We might not protect a domain whose e2LD is a common word in target embedding
 // based on the TLD that is paired with it.
-const char* kCommonWords[] = {"shop",  "jobs",     "live",   "info",  "study",
-                              "asahi", "weather",  "health", "forum", "radio",
-                              "ideal", "research", "france", "free",  "mobile",
-                              "sky",   "ask"};
+const char* kCommonWords[] = {
+    "shop",      "jobs",      "live",       "info",    "study",   "asahi",
+    "weather",   "health",    "forum",      "radio",   "ideal",   "research",
+    "france",    "free",      "mobile",     "sky",     "ask",     "booking",
+    "canada",    "dating",    "dictionary", "express", "hoteles", "hotels",
+    "investing", "jharkhand", "nifty"};
+
+// These domains are plausible lookalike targets, but they also use common words
+// in their names. Selectively prevent flagging embeddings where the embedder
+// ends in "-DOMAIN.TLD", since these tend to have higher false positive rates.
+const char* kDomainsPermittedInEndEmbeddings[] = {"office.com", "medium.com",
+                                                  "orange.fr"};

 // What separators can be used to separate tokens in target embedding spoofs?
 // e.g. www-google.com.example.com uses "-" (www-google) and "." (google.com).
@@ -258,8 +267,9 @@ bool DoesETLDPlus1MatchTopDomainOrEngagedSite(
  return false;
 }

-// Returns whether the provided token includes a common word, which is a common
-// indication of a likely false positive.
+// Returns whether the e2LD of the provided domain is a common word (e.g.
+// weather.com, ask.com). Target embeddings of these domains are often false
+// positives (e.g. "super-best-fancy-hotels.com" isn't spoofing "hotels.com").
 bool UsesCommonWord(const DomainInfo& domain) {
  std::vector<std::string> additional_common_words =
      base::SplitString(kAdditionalCommonWords.Get(), ",",
@@ -296,8 +306,36 @@ bool IsEmbeddingItself(const base::span<const base::StringPiece>& domain_labels,
  return false;
 }

+// Returns whether |embedded_target| and |embedding_domain| share the same e2LD,
+// (as in, e.g., google.com and google.org, or airbnb.com.br and airbnb.com).
+// Assumes |embedding_domain| is an eTLD+1.
+bool IsCrossTLDMatch(const DomainInfo& embedded_target,
+                     const std::string& embedding_domain) {
+  return (
+      embedded_target.domain_without_registry ==
+      url_formatter::top_domains::HostnameWithoutRegistry(embedding_domain));
+}
+
+// Returns whether |embedded_target| is one of kDomainsPermittedInEndEmbeddings
+// and that |embedding_domain| ends with that domain (e.g. is of the form
+// "*-outlook.com" for each example.com in kDomainsPermittedInEndEmbeddings).
+// (e.g. will return true if |embedded_target| matches "evil-office.com"). Only
+// impacts Target Embedding matches.
+bool EndsWithPermittedDomains(const DomainInfo& embedded_target,
+                              const std::string& embedding_domain) {
+  for (auto* permitted_ending : kDomainsPermittedInEndEmbeddings) {
+    if (embedded_target.domain_and_registry == permitted_ending &&
+        base::EndsWith(embedding_domain,
+                       base::StrCat({"-", permitted_ending}))) {
+      return true;
+    }
+  }
+  return false;
+}
+
 // A domain is allowed to be embedded if is embedding itself, if its e2LD is a
-// common word or any valid partial subdomain is allowlisted.
+// common word, any valid partial subdomain is allowlisted, or if it's a
+// cross-TLD match (e.g. google.com vs google.com.mx).
 bool IsAllowedToBeEmbedded(
    const DomainInfo& embedded_target,
    const base::span<const base::StringPiece>& subdomain_span,
@@ -305,7 +343,9 @@ bool IsAllowedToBeEmbedded(
    const std::string& embedding_domain) {
  return UsesCommonWord(embedded_target) ||
         ASubdomainIsAllowlisted(subdomain_span, in_target_allowlist) ||
-         IsEmbeddingItself(subdomain_span, embedding_domain);
+         IsEmbeddingItself(subdomain_span, embedding_domain) ||
+         IsCrossTLDMatch(embedded_target, embedding_domain) ||
+         EndsWithPermittedDomains(embedded_target, embedding_domain);
 }

 }  // namespace

--- a/components/lookalikes/core/lookalike_url_util_unittest.cc
+++ b/components/lookalikes/core/lookalike_url_util_unittest.cc
@@ -267,6 +267,17 @@ TEST(LookalikeUrlUtilTest, TargetEmbeddingTest) {
       TargetEmbeddingType::kInterstitial},
      {"google.com-google.com-google.com", "google.com",
       TargetEmbeddingType::kInterstitial},
+
+      // Ignore end-of-domain embeddings when they're also cross-TLD matches.
+      {"google.com.mx", "", TargetEmbeddingType::kNone},
+
+      // For a small set of high-value domains that are also common words (see
+      // kDomainsPermittedInEndEmbeddings), we block all embeddings except those
+      // at the very end of the domain (e.g. foo-{domain.com}). Ensure this
+      // works for domains on the list, but not for others.
+      {"office.com-foo.com", "office.com", TargetEmbeddingType::kInterstitial},
+      {"example-office.com", "", TargetEmbeddingType::kNone},
+      {"example-google.com", "google.com", TargetEmbeddingType::kInterstitial},
  };

  for (auto& test_case : kTestCases) {