Speed up nominal features in RandomTreeTrainer.

This CL turns off one-hot conversion for nominal features, and instead adds a similar effect directly in RandomTreeTrainer. Instead of converting each nominal feature of N values into N binary features, have the tree builder pick uniformly at random one of the N values to do the split. This choice is made without regard to the number of examples with a particular value, since that's what one- hot encoding would do. For example, if 9 examples have value "A" and 1 has value "B", then each "A" and "B" has a 50% chance of being the split point. For one-hot, there would be two binary features, and the split selection would similarly pick between them with equal probability. There is one difference between one-hot values, however. The current system still picks first, uniformly, a set of features to split on, then chooses (again uniformly) which value to split on this time. The first pick is different, since one-hot would give each value equal weight across multiple features. For example, if we have two features f1 and f2 with values {A,B} and {C,D,E,F}, then one-hot would pick a split uniformly over the resulting 6 features, each of which was a feature value in the original nominal. Now, we'll pick uniformly between f1 and f2, then uniformly again from either {A,B} or {C,D,E,F}, depending on whether we chose f1 and f2. With one-hot encoding, we had twice the chance to pick a f2 value than f1. We could emulate this when choosing features, but it seemed to work okay and is simpler. The reason this is faster is that, for M features of N nominal values each, one-hot would generate M*N features that would be searched over at each node. Now, we have only M features and then a search over N to pick the split. We could preserve this even if we fixed the discrepancy above. For FisherIrisDataset test with nominal features, locally this reduces the runtime of the test from ~325 msec to ~90 msec. Change-Id: I8a43963db8be7e7eb6eb8bb5efc64aefdc6ca67d Reviewed-on: https://chromium-review.googlesource.com/c/1478177 Commit-Queue: Frank Liberato <liberato@chromium.org> Reviewed-by: Dan Sanders <sandersd@chromium.org> Cr-Commit-Position: refs/heads/master@{#633864}

Speed up nominal features in RandomTreeTrainer.
This CL turns off one-hot conversion for nominal features, and instead adds a similar effect directly in RandomTreeTrainer. Instead of converting each nominal feature of N values into N binary features, have the tree builder pick uniformly at random one of the N values to do the split. This choice is made without regard to the number of examples with a particular value, since that's what one- hot encoding would do. For example, if 9 examples have value "A" and 1 has value "B", then each "A" and "B" has a 50% chance of being the split point. For one-hot, there would be two binary features, and the split selection would similarly pick between them with equal probability. There is one difference between one-hot values, however. The current system still picks first, uniformly, a set of features to split on, then chooses (again uniformly) which value to split on this time. The first pick is different, since one-hot would give each value equal weight across multiple features. For example, if we have two features f1 and f2 with values {A,B} and {C,D,E,F}, then one-hot would pick a split uniformly over the resulting 6 features, each of which was a feature value in the original nominal. Now, we'll pick uniformly between f1 and f2, then uniformly again from either {A,B} or {C,D,E,F}, depending on whether we chose f1 and f2. With one-hot encoding, we had twice the chance to pick a f2 value than f1. We could emulate this when choosing features, but it seemed to work okay and is simpler. The reason this is faster is that, for M features of N nominal values each, one-hot would generate M*N features that would be searched over at each node. Now, we have only M features and then a search over N to pick the split. We could preserve this even if we fixed the discrepancy above. For FisherIrisDataset test with nominal features, locally this reduces the runtime of the test from ~325 msec to ~90 msec. Change-Id: I8a43963db8be7e7eb6eb8bb5efc64aefdc6ca67d Reviewed-on: https://chromium-review.googlesource.com/c/1478177 Commit-Queue: Frank Liberato <liberato@chromium.org> Reviewed-by: Dan Sanders <sandersd@chromium.org> Cr-Commit-Position: refs/heads/master@{#633864}
dca111a7 · liberato@chromium.org · Commit Bot · a98de0c2 · dca111a7 · dca111a7
Commit dca111a7 authored Feb 20, 2019 by liberato@chromium.org Committed by Commit Bot Feb 20, 2019
5 changed files
--- a/media/learning/common/learning_task.h
+++ b/media/learning/common/learning_task.h
@@ -101,24 +101,16 @@ struct COMPONENT_EXPORT(LEARNING_COMMON) LearningTask {
  // the histogram name.
  std::string uma_hacky_confusion_matrix;

-  // RandomTree parameters
-
-  // How RandomTree handles unknown feature values.
-  enum class RTUnknownValueHandling {
-    // Return an empty distribution as the prediction.
-    kEmptyDistribution,
-
-    // Return the sum of the traversal of all splits.
-    kUseAllSplits,
-  };
-  RTUnknownValueHandling rt_unknown_value_handling =
-      RTUnknownValueHandling::kUseAllSplits;
-
  // RandomForest parameters

  // Number of trees in the random forest.
  size_t rf_number_of_trees = 100;

+  // Should ExtraTrees apply one-hot conversion automatically?  RandomTree has
+  // been modified to support nominals directly, though it isn't exactly the
+  // same as one-hot conversion.  It is, however, much faster.
+  bool use_one_hot_conversion = false;
+
  // Reporting parameters

  // This is a hack for the initial media capabilities investigation. It

--- a/media/learning/impl/extra_trees_trainer.cc
+++ b/media/learning/impl/extra_trees_trainer.cc
@@ -34,11 +34,16 @@ void ExtraTreesTrainer::Train(const LearningTask& task,
  if (!tree_trainer_)
    tree_trainer_ = std::make_unique<RandomTreeTrainer>(rng());

-  // RandomTree requires one-hot vectors to properly choose split points the way
-  // that ExtraTrees require.
-  // TODO(liberato): Modify it not to need this.  It's slow.
-  converter_ = std::make_unique<OneHotConverter>(task, training_data);
-  converted_training_data_ = converter_->Convert(training_data);
+  // We've modified RandomTree to handle nominals, so we don't need to do one-
+  // hot conversion normally.  It's slow.  However, the changes to RandomTree
+  // are only approximately the same thing.
+  if (task_.use_one_hot_conversion) {
+    converter_ = std::make_unique<OneHotConverter>(task, training_data);
+    converted_training_data_ = converter_->Convert(training_data);
+    task_ = converter_->converted_task();
+  } else {
+    converted_training_data_ = training_data;
+  }

  // Start training.  Send in nullptr to start the process.
  OnRandomTreeModel(std::move(model_cb), nullptr);
@@ -52,17 +57,22 @@ void ExtraTreesTrainer::OnRandomTreeModel(TrainedModelCB model_cb,

  // If this is the last tree, then return the finished model.
  if (trees_.size() == task_.rf_number_of_trees) {
-    std::move(model_cb).Run(std::make_unique<ConvertingModel>(
-        std::move(converter_),
-        std::make_unique<VotingEnsemble>(std::move(trees_))));
+    std::unique_ptr<Model> model =
+        std::make_unique<VotingEnsemble>(std::move(trees_));
+    // If we have a converter, then wrap everything in a ConvertingModel.
+    if (converter_) {
+      model = std::make_unique<ConvertingModel>(std::move(converter_),
+                                                std::move(model));
+    }
+
+    std::move(model_cb).Run(std::move(model));
    return;
  }

  // Train the next tree.
  auto cb = base::BindOnce(&ExtraTreesTrainer::OnRandomTreeModel, AsWeakPtr(),
                           std::move(model_cb));
-  tree_trainer_->Train(converter_->converted_task(), converted_training_data_,
-                       std::move(cb));
+  tree_trainer_->Train(task_, converted_training_data_, std::move(cb));
 }

 }  // namespace learning

--- a/media/learning/impl/random_tree_trainer.cc
+++ b/media/learning/impl/random_tree_trainer.cc
@@ -36,7 +36,6 @@ struct InteriorNode : public Model {
               int split_index,
               FeatureValue split_point)
      : split_index_(split_index),
-        rt_unknown_value_handling_(task.rt_unknown_value_handling),
        ordering_(task.feature_descriptions[split_index].ordering),
        split_point_(split_point) {}

@@ -47,8 +46,8 @@ struct InteriorNode : public Model {
    FeatureValue f;
    switch (ordering_) {
      case LearningTask::Ordering::kUnordered:
-        // Use the nominal value directly.
-        f = features[split_index_];
+        // Use 0 for "!=" and 1 for "==".
+        f = FeatureValue(features[split_index_] == split_point_);
        break;
      case LearningTask::Ordering::kNumeric:
        // Use 0 for "<=" and 1 for ">".
@@ -58,18 +57,9 @@ struct InteriorNode : public Model {

    auto iter = children_.find(f);

-    // If we've never seen this feature value, then average all our branches.
-    // This is an attempt to mimic one-hot encoding, where we'll take the zero
-    // branch but it depends on the tree structure which of the one-hot values
-    // we're choosing.
-    if (iter == children_.end()) {
-      switch (rt_unknown_value_handling_) {
-        case LearningTask::RTUnknownValueHandling::kEmptyDistribution:
-          return TargetDistribution();
-        case LearningTask::RTUnknownValueHandling::kUseAllSplits:
-          return PredictDistributionWithMissingValues(features);
-      }
-    }
+    // If we've never seen this feature value, then return nothing.
+    if (iter == children_.end())
+      return TargetDistribution();

    return iter->second->PredictDistribution(features);
  }
@@ -98,9 +88,6 @@ struct InteriorNode : public Model {
  int split_index_ = -1;
  base::flat_map<FeatureValue, std::unique_ptr<Model>> children_;

-  // How we handle unknown values.
-  LearningTask::RTUnknownValueHandling rt_unknown_value_handling_;
-
  // How is our feature value ordered?
  LearningTask::Ordering ordering_;

@@ -227,6 +214,11 @@ std::unique_ptr<Model> RandomTreeTrainer::Build(
  }

  // Select the feature subset to consider at this leaf.
+  // TODO(liberato): For nominals, with one-hot encoding, we'd give an equal
+  // chance to each feature's value.  For example, if F1 has {A, B} and F2 has
+  // {C,D,E,F}, then we would pick uniformly over {A,B,C,D,E,F}.  However, now
+  // we pick between {F1, F2} then pick between either {A,B} or {C,D,E,F}.  We
+  // do this because it's simpler and doesn't seem to hurt anything.
  FeatureSet feature_candidates = new_unused_set;
  // TODO(liberato): Let our caller override this.
  const size_t features_per_split =
@@ -267,17 +259,6 @@ std::unique_ptr<Model> RandomTreeTrainer::Build(
  std::unique_ptr<InteriorNode> node = std::make_unique<InteriorNode>(
      task, best_potential_split.split_index, best_potential_split.split_point);

-  // Don't let the subtree use this feature if this is nominal split, since
-  // there's nothing left to split.  For numeric splits, we might want to split
-  // it further.  Note that if there is only one branch for this split, then
-  // we returned a leaf anyway.
-  if (task.feature_descriptions[best_potential_split.split_index].ordering ==
-      LearningTask::Ordering::kUnordered) {
-    DCHECK(new_unused_set.find(best_potential_split.split_index) !=
-           new_unused_set.end());
-    new_unused_set.erase(best_potential_split.split_index);
-  }
-
  for (auto& branch_iter : best_potential_split.branch_infos) {
    node->AddChild(branch_iter.first,
                   Build(task, training_data, branch_iter.second.training_idx,
@@ -297,18 +278,21 @@ RandomTreeTrainer::Split RandomTreeTrainer::ConstructSplit(
  DCHECK_GT(training_idx.size(), 0u);

  Split split(split_index);
-  base::Optional<FeatureValue> split_point;
+
+  bool is_numeric = task.feature_descriptions[split_index].ordering ==
+                    LearningTask::Ordering::kNumeric;

  // TODO(liberato): Consider removing nominal feature support and RF.  That
  // would make this code somewhat simpler.

  // For a numeric split, find the split point.  Otherwise, we'll split on every
  // nominal value that this feature has in |training_data|.
-  if (task.feature_descriptions[split_index].ordering ==
-      LearningTask::Ordering::kNumeric) {
-    split_point =
-        FindNumericSplitPoint(split.split_index, training_data, training_idx);
-    split.split_point = *split_point;
+  if (is_numeric) {
+    split.split_point =
+        FindSplitPoint_Numeric(split.split_index, training_data, training_idx);
+  } else {
+    split.split_point =
+        FindSplitPoint_Nominal(split.split_index, training_data, training_idx);
  }

  // Find the split's feature values and construct the training set for each.
@@ -323,13 +307,14 @@ RandomTreeTrainer::Split RandomTreeTrainer::ConstructSplit(
    FeatureValue v_i = example.features[split.split_index];

    // Figure out what value this example would use for splitting.  For nominal,
-    // it's just |v_i|.  For numeric, it's whether |v_i| is <= the split point
-    // or not (0 for <=, 1 for >).
+    // it's 1 or 0, based on whether |v_i| is equal to the split or not.  For
+    // numeric, it's whether |v_i| is <= the split point or not (0 for <=, and 1
+    // for >).
    FeatureValue split_feature;
-    if (split_point)
-      split_feature = FeatureValue(v_i > *split_point);
+    if (is_numeric)
+      split_feature = FeatureValue(v_i > split.split_point);
    else
-      split_feature = v_i;
+      split_feature = FeatureValue(v_i == split.split_point);

    // Add |v_i| to the right training set.  Remember that emplace will do
    // nothing if the key already exists.
@@ -345,18 +330,18 @@ RandomTreeTrainer::Split RandomTreeTrainer::ConstructSplit(
  // Figure out how good / bad this split is.
  switch (task.target_description.ordering) {
    case LearningTask::Ordering::kUnordered:
-      ComputeNominalSplitScore(&split, total_weight);
+      ComputeSplitScore_Nominal(&split, total_weight);
      break;
    case LearningTask::Ordering::kNumeric:
-      ComputeNumericSplitScore(&split, total_weight);
+      ComputeSplitScore_Numeric(&split, total_weight);
      break;
  }

  return split;
 }

-void RandomTreeTrainer::ComputeNominalSplitScore(Split* split,
-                                                 double total_weight) {
+void RandomTreeTrainer::ComputeSplitScore_Nominal(Split* split,
+                                                  double total_weight) {
  // Compute the nats given that we're at this node.
  split->nats_remaining = 0;
  for (auto& info_iter : split->branch_infos) {
@@ -374,8 +359,8 @@ void RandomTreeTrainer::ComputeNominalSplitScore(Split* split,
  }
 }

-void RandomTreeTrainer::ComputeNumericSplitScore(Split* split,
-                                                 double total_weight) {
+void RandomTreeTrainer::ComputeSplitScore_Numeric(Split* split,
+                                                  double total_weight) {
  // Compute the nats given that we're at this node.
  split->nats_remaining = 0;
  for (auto& info_iter : split->branch_infos) {
@@ -402,7 +387,7 @@ void RandomTreeTrainer::ComputeNumericSplitScore(Split* split,
  }
 }

-FeatureValue RandomTreeTrainer::FindNumericSplitPoint(
+FeatureValue RandomTreeTrainer::FindSplitPoint_Numeric(
    size_t split_index,
    const TrainingData& training_data,
    const std::vector<size_t>& training_idx) {
@@ -444,5 +429,42 @@ FeatureValue RandomTreeTrainer::FindNumericSplitPoint(
  return v_split;
 }

+FeatureValue RandomTreeTrainer::FindSplitPoint_Nominal(
+    size_t split_index,
+    const TrainingData& training_data,
+    const std::vector<size_t>& training_idx) {
+  // We should not be given a training set of size 0, since there's no need to
+  // check an empty split.
+  DCHECK_GT(training_idx.size(), 0u);
+
+  // Construct a set of all values for |training_idx|.  We don't care about
+  // their relative frequency, since one-hot encoding doesn't.
+  // For example, if a feature has 10 "yes" instances and 1 "no" instance, then
+  // there's a 50% chance for each to be chosen here.  This is because one-hot
+  // encoding would do roughly the same thing: when choosing features, the
+  // "is_yes" and "is_no" features that come out of one-hot encoding would be
+  // equally likely to be chosen.
+  //
+  // Important but subtle note: we can't choose a value that's been chosen
+  // before for this feature, since that would be like splitting on the same
+  // one-hot feature more than once.  Luckily, we won't be asked to do that.  If
+  // we choose "Yes" at some level in the tree, then the "==" branch will have
+  // trivial features which will be removed from consideration early (we never
+  // consider features with only one value), and the != branch won't have any
+  // "Yes" values for us to pick at a lower level.
+  std::set<FeatureValue> values;
+  for (size_t idx : training_idx) {
+    const LabelledExample& example = training_data[idx];
+    values.insert(example.features[split_index]);
+  }
+
+  // Select one uniformly at random.
+  size_t which = rng()->Generate(values.size());
+  auto it = values.begin();
+  for (; which > 0; it++, which--)
+    ;
+  return *it;
+}
+
 }  // namespace learning
 }  // namespace media
--- a/media/learning/impl/random_tree_trainer.h
+++ b/media/learning/impl/random_tree_trainer.h
@@ -160,15 +160,20 @@ class COMPONENT_EXPORT(LEARNING_IMPL) RandomTreeTrainer

  // Fill in |nats_remaining| for |split| for a nominal target.  |total_weight|
  // is the total weight of all instances coming into this split.
-  void ComputeNominalSplitScore(Split* split, double total_weight);
+  void ComputeSplitScore_Nominal(Split* split, double total_weight);

  // Fill in |nats_remaining| for |split| for a numeric target.
-  void ComputeNumericSplitScore(Split* split, double total_weight);
+  void ComputeSplitScore_Numeric(Split* split, double total_weight);
+
+  // Compute the split point for |training_data| for a nominal feature.
+  FeatureValue FindSplitPoint_Nominal(size_t index,
+                                      const TrainingData& training_data,
+                                      const std::vector<size_t>& training_idx);

  // Compute the split point for |training_data| for a numeric feature.
-  FeatureValue FindNumericSplitPoint(size_t index,
-                                     const TrainingData& training_data,
-                                     const std::vector<size_t>& training_idx);
+  FeatureValue FindSplitPoint_Numeric(size_t index,
+                                      const TrainingData& training_data,
+                                      const std::vector<size_t>& training_idx);

  DISALLOW_COPY_AND_ASSIGN(RandomTreeTrainer);
 };

--- a/media/learning/impl/random_tree_trainer_unittest.cc
+++ b/media/learning/impl/random_tree_trainer_unittest.cc
@@ -169,30 +169,17 @@ TEST_P(RandomTreeTest, UnknownFeatureValueHandling) {
  training_data.push_back(example_1);
  training_data.push_back(example_2);

-  task_.rt_unknown_value_handling =
-      LearningTask::RTUnknownValueHandling::kEmptyDistribution;
-  std::unique_ptr<Model> model = Train(task_, training_data);
-  TargetDistribution distribution =
+  auto model = Train(task_, training_data);
+  auto distribution =
      model->PredictDistribution(FeatureVector({FeatureValue(789)}));
  if (ordering_ == LearningTask::Ordering::kUnordered) {
-    // OOV data should return an empty distribution (nominal).
-    EXPECT_EQ(distribution.size(), 0u);
-  } else {
-    // OOV data should end up in the |example_2| bucket, since the feature is
-    // numerically higher.
+    // OOV data could be split on either feature first, so we don't really know
+    // which to expect.  We assert that there should be exactly one example, but
+    // whether it's |example_1| or |example_2| isn't clear.
    EXPECT_EQ(distribution.size(), 1u);
-    EXPECT_EQ(distribution[example_2.target_value], 1u);
-  }
-
-  task_.rt_unknown_value_handling =
-      LearningTask::RTUnknownValueHandling::kUseAllSplits;
-  model = Train(task_, training_data);
-  distribution = model->PredictDistribution(FeatureVector({FeatureValue(789)}));
-  if (ordering_ == LearningTask::Ordering::kUnordered) {
-    // OOV data should return with the sum of all splits.
-    EXPECT_EQ(distribution.size(), 2u);
-    EXPECT_EQ(distribution[example_1.target_value], 1u);
-    EXPECT_EQ(distribution[example_2.target_value], 1u);
+    EXPECT_EQ(distribution[example_1.target_value] +
+                  distribution[example_2.target_value],
+              1u);
  } else {
    // The unknown feature is numerically higher than |example_2|, so we
    // expect it to fall into that bucket.
@@ -212,8 +199,6 @@ TEST_P(RandomTreeTest, NumericFeaturesSplitMultipleTimes) {
    training_data.push_back(example);
  }

-  task_.rt_unknown_value_handling =
-      LearningTask::RTUnknownValueHandling::kEmptyDistribution;
  std::unique_ptr<Model> model = Train(task_, training_data);
  for (size_t i = 0; i < 4; i++) {
    // Get a prediction for the |i|-th feature value.