Commit 5a74962b authored by Joe DeBlasio's avatar Joe DeBlasio Committed by Chromium LUCI CQ

[Lookalikes] Carve out common false positives from target embedding.

crrev.com/c/2437914 expanded what domains were captured as part of the
target embedding heuristic, however it also introduced a large cache of
new false positives that exceed our ability to allowlist.

This CL makes three small changes to mitigate the impact of that change:
 1. It permits cross-TLD embeddings (e.g. google.com.mx no longer
    matches google.com).
 2. It adds additional words to our common words list that cause a
    number of high-profile false positives (e.g. "hotels").
 3. It adds a small set of domains that are are important to protect
    from embedding, but use a common word in their name (e.g.
    "office.com"). These domains are flagged if they're embedded, but
    not if the domain name ends with "-DOMAIN", e.g. (home-office.com).

These changes are pretty small, and strictly reduce the set of domains
flagged by target embedding, so should be quite low risk. Alongside
proactive allowlisting, this should allow us to re-enable target
embedding.

Bug: 1150994
Change-Id: I1e0f562f0677e63a33068eed24ddca721d51aea3
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2568814
Commit-Queue: Joe DeBlasio <jdeblasio@chromium.org>
Reviewed-by: default avatarMustafa Emre Acer <meacer@chromium.org>
Cr-Commit-Position: refs/heads/master@{#832634}
parent 91a4557e
......@@ -15,6 +15,7 @@
#include "base/memory/singleton.h"
#include "base/metrics/field_trial_params.h"
#include "base/metrics/histogram_macros.h"
#include "base/strings/strcat.h"
#include "base/strings/string_piece.h"
#include "base/strings/string_split.h"
#include "base/strings/string_util.h"
......@@ -61,10 +62,18 @@ const base::FeatureParam<std::string> kAdditionalCommonWords{
// We might not protect a domain whose e2LD is a common word in target embedding
// based on the TLD that is paired with it.
const char* kCommonWords[] = {"shop", "jobs", "live", "info", "study",
"asahi", "weather", "health", "forum", "radio",
"ideal", "research", "france", "free", "mobile",
"sky", "ask"};
const char* kCommonWords[] = {
"shop", "jobs", "live", "info", "study", "asahi",
"weather", "health", "forum", "radio", "ideal", "research",
"france", "free", "mobile", "sky", "ask", "booking",
"canada", "dating", "dictionary", "express", "hoteles", "hotels",
"investing", "jharkhand", "nifty"};
// These domains are plausible lookalike targets, but they also use common words
// in their names. Selectively prevent flagging embeddings where the embedder
// ends in "-DOMAIN.TLD", since these tend to have higher false positive rates.
const char* kDomainsPermittedInEndEmbeddings[] = {"office.com", "medium.com",
"orange.fr"};
// What separators can be used to separate tokens in target embedding spoofs?
// e.g. www-google.com.example.com uses "-" (www-google) and "." (google.com).
......@@ -258,8 +267,9 @@ bool DoesETLDPlus1MatchTopDomainOrEngagedSite(
return false;
}
// Returns whether the provided token includes a common word, which is a common
// indication of a likely false positive.
// Returns whether the e2LD of the provided domain is a common word (e.g.
// weather.com, ask.com). Target embeddings of these domains are often false
// positives (e.g. "super-best-fancy-hotels.com" isn't spoofing "hotels.com").
bool UsesCommonWord(const DomainInfo& domain) {
std::vector<std::string> additional_common_words =
base::SplitString(kAdditionalCommonWords.Get(), ",",
......@@ -296,8 +306,36 @@ bool IsEmbeddingItself(const base::span<const base::StringPiece>& domain_labels,
return false;
}
// Returns whether |embedded_target| and |embedding_domain| share the same e2LD,
// (as in, e.g., google.com and google.org, or airbnb.com.br and airbnb.com).
// Assumes |embedding_domain| is an eTLD+1.
bool IsCrossTLDMatch(const DomainInfo& embedded_target,
const std::string& embedding_domain) {
return (
embedded_target.domain_without_registry ==
url_formatter::top_domains::HostnameWithoutRegistry(embedding_domain));
}
// Returns whether |embedded_target| is one of kDomainsPermittedInEndEmbeddings
// and that |embedding_domain| ends with that domain (e.g. is of the form
// "*-outlook.com" for each example.com in kDomainsPermittedInEndEmbeddings).
// (e.g. will return true if |embedded_target| matches "evil-office.com"). Only
// impacts Target Embedding matches.
bool EndsWithPermittedDomains(const DomainInfo& embedded_target,
const std::string& embedding_domain) {
for (auto* permitted_ending : kDomainsPermittedInEndEmbeddings) {
if (embedded_target.domain_and_registry == permitted_ending &&
base::EndsWith(embedding_domain,
base::StrCat({"-", permitted_ending}))) {
return true;
}
}
return false;
}
// A domain is allowed to be embedded if is embedding itself, if its e2LD is a
// common word or any valid partial subdomain is allowlisted.
// common word, any valid partial subdomain is allowlisted, or if it's a
// cross-TLD match (e.g. google.com vs google.com.mx).
bool IsAllowedToBeEmbedded(
const DomainInfo& embedded_target,
const base::span<const base::StringPiece>& subdomain_span,
......@@ -305,7 +343,9 @@ bool IsAllowedToBeEmbedded(
const std::string& embedding_domain) {
return UsesCommonWord(embedded_target) ||
ASubdomainIsAllowlisted(subdomain_span, in_target_allowlist) ||
IsEmbeddingItself(subdomain_span, embedding_domain);
IsEmbeddingItself(subdomain_span, embedding_domain) ||
IsCrossTLDMatch(embedded_target, embedding_domain) ||
EndsWithPermittedDomains(embedded_target, embedding_domain);
}
} // namespace
......
......@@ -267,6 +267,17 @@ TEST(LookalikeUrlUtilTest, TargetEmbeddingTest) {
TargetEmbeddingType::kInterstitial},
{"google.com-google.com-google.com", "google.com",
TargetEmbeddingType::kInterstitial},
// Ignore end-of-domain embeddings when they're also cross-TLD matches.
{"google.com.mx", "", TargetEmbeddingType::kNone},
// For a small set of high-value domains that are also common words (see
// kDomainsPermittedInEndEmbeddings), we block all embeddings except those
// at the very end of the domain (e.g. foo-{domain.com}). Ensure this
// works for domains on the list, but not for others.
{"office.com-foo.com", "office.com", TargetEmbeddingType::kInterstitial},
{"example-office.com", "", TargetEmbeddingType::kNone},
{"example-google.com", "google.com", TargetEmbeddingType::kInterstitial},
};
for (auto& test_case : kTestCases) {
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment