Commit dca111a7 authored by liberato@chromium.org's avatar liberato@chromium.org Committed by Commit Bot

Speed up nominal features in RandomTreeTrainer.

This CL turns off one-hot conversion for nominal features, and
instead adds a similar effect directly in RandomTreeTrainer.

Instead of converting each nominal feature of N values into N binary
features, have the tree builder pick uniformly at random one of the
N values to do the split.  This choice is made without regard to the
number of examples with a particular value, since that's what one-
hot encoding would do.  For example, if 9 examples have value "A"
and 1 has value "B", then each "A" and "B" has a 50% chance of
being the split point.  For one-hot, there would be two binary
features, and the split selection would similarly pick between them
with equal probability.

There is one difference between one-hot values, however.  The
current system still picks first, uniformly, a set of features to
split on, then chooses (again uniformly) which value to split on
this time.  The first pick is different, since one-hot would give
each value equal weight across multiple features.  For example,
if we have two features f1 and f2 with values {A,B} and {C,D,E,F},
then one-hot would pick a split uniformly over the resulting 6
features, each of which was a feature value in the original
nominal.  Now, we'll pick uniformly between f1 and f2, then
uniformly again from either {A,B} or {C,D,E,F}, depending on whether
we chose f1 and f2.

With one-hot encoding, we had twice the chance to pick a f2 value
than f1.  We could emulate this when choosing features, but it
seemed to work okay and is simpler.

The reason this is faster is that, for M features of N nominal
values each, one-hot would generate M*N features that would be
searched over at each node.  Now, we have only M features and
then a search over N to pick the split.  We could preserve this
even if we fixed the discrepancy above.

For FisherIrisDataset test with nominal features, locally this
reduces the runtime of the test from ~325 msec to ~90 msec.

Change-Id: I8a43963db8be7e7eb6eb8bb5efc64aefdc6ca67d
Reviewed-on: https://chromium-review.googlesource.com/c/1478177
Commit-Queue: Frank Liberato <liberato@chromium.org>
Reviewed-by: default avatarDan Sanders <sandersd@chromium.org>
Cr-Commit-Position: refs/heads/master@{#633864}
parent a98de0c2
......@@ -101,24 +101,16 @@ struct COMPONENT_EXPORT(LEARNING_COMMON) LearningTask {
// the histogram name.
std::string uma_hacky_confusion_matrix;
// RandomTree parameters
// How RandomTree handles unknown feature values.
enum class RTUnknownValueHandling {
// Return an empty distribution as the prediction.
kEmptyDistribution,
// Return the sum of the traversal of all splits.
kUseAllSplits,
};
RTUnknownValueHandling rt_unknown_value_handling =
RTUnknownValueHandling::kUseAllSplits;
// RandomForest parameters
// Number of trees in the random forest.
size_t rf_number_of_trees = 100;
// Should ExtraTrees apply one-hot conversion automatically? RandomTree has
// been modified to support nominals directly, though it isn't exactly the
// same as one-hot conversion. It is, however, much faster.
bool use_one_hot_conversion = false;
// Reporting parameters
// This is a hack for the initial media capabilities investigation. It
......
......@@ -34,11 +34,16 @@ void ExtraTreesTrainer::Train(const LearningTask& task,
if (!tree_trainer_)
tree_trainer_ = std::make_unique<RandomTreeTrainer>(rng());
// RandomTree requires one-hot vectors to properly choose split points the way
// that ExtraTrees require.
// TODO(liberato): Modify it not to need this. It's slow.
converter_ = std::make_unique<OneHotConverter>(task, training_data);
converted_training_data_ = converter_->Convert(training_data);
// We've modified RandomTree to handle nominals, so we don't need to do one-
// hot conversion normally. It's slow. However, the changes to RandomTree
// are only approximately the same thing.
if (task_.use_one_hot_conversion) {
converter_ = std::make_unique<OneHotConverter>(task, training_data);
converted_training_data_ = converter_->Convert(training_data);
task_ = converter_->converted_task();
} else {
converted_training_data_ = training_data;
}
// Start training. Send in nullptr to start the process.
OnRandomTreeModel(std::move(model_cb), nullptr);
......@@ -52,17 +57,22 @@ void ExtraTreesTrainer::OnRandomTreeModel(TrainedModelCB model_cb,
// If this is the last tree, then return the finished model.
if (trees_.size() == task_.rf_number_of_trees) {
std::move(model_cb).Run(std::make_unique<ConvertingModel>(
std::move(converter_),
std::make_unique<VotingEnsemble>(std::move(trees_))));
std::unique_ptr<Model> model =
std::make_unique<VotingEnsemble>(std::move(trees_));
// If we have a converter, then wrap everything in a ConvertingModel.
if (converter_) {
model = std::make_unique<ConvertingModel>(std::move(converter_),
std::move(model));
}
std::move(model_cb).Run(std::move(model));
return;
}
// Train the next tree.
auto cb = base::BindOnce(&ExtraTreesTrainer::OnRandomTreeModel, AsWeakPtr(),
std::move(model_cb));
tree_trainer_->Train(converter_->converted_task(), converted_training_data_,
std::move(cb));
tree_trainer_->Train(task_, converted_training_data_, std::move(cb));
}
} // namespace learning
......
......@@ -36,7 +36,6 @@ struct InteriorNode : public Model {
int split_index,
FeatureValue split_point)
: split_index_(split_index),
rt_unknown_value_handling_(task.rt_unknown_value_handling),
ordering_(task.feature_descriptions[split_index].ordering),
split_point_(split_point) {}
......@@ -47,8 +46,8 @@ struct InteriorNode : public Model {
FeatureValue f;
switch (ordering_) {
case LearningTask::Ordering::kUnordered:
// Use the nominal value directly.
f = features[split_index_];
// Use 0 for "!=" and 1 for "==".
f = FeatureValue(features[split_index_] == split_point_);
break;
case LearningTask::Ordering::kNumeric:
// Use 0 for "<=" and 1 for ">".
......@@ -58,18 +57,9 @@ struct InteriorNode : public Model {
auto iter = children_.find(f);
// If we've never seen this feature value, then average all our branches.
// This is an attempt to mimic one-hot encoding, where we'll take the zero
// branch but it depends on the tree structure which of the one-hot values
// we're choosing.
if (iter == children_.end()) {
switch (rt_unknown_value_handling_) {
case LearningTask::RTUnknownValueHandling::kEmptyDistribution:
return TargetDistribution();
case LearningTask::RTUnknownValueHandling::kUseAllSplits:
return PredictDistributionWithMissingValues(features);
}
}
// If we've never seen this feature value, then return nothing.
if (iter == children_.end())
return TargetDistribution();
return iter->second->PredictDistribution(features);
}
......@@ -98,9 +88,6 @@ struct InteriorNode : public Model {
int split_index_ = -1;
base::flat_map<FeatureValue, std::unique_ptr<Model>> children_;
// How we handle unknown values.
LearningTask::RTUnknownValueHandling rt_unknown_value_handling_;
// How is our feature value ordered?
LearningTask::Ordering ordering_;
......@@ -227,6 +214,11 @@ std::unique_ptr<Model> RandomTreeTrainer::Build(
}
// Select the feature subset to consider at this leaf.
// TODO(liberato): For nominals, with one-hot encoding, we'd give an equal
// chance to each feature's value. For example, if F1 has {A, B} and F2 has
// {C,D,E,F}, then we would pick uniformly over {A,B,C,D,E,F}. However, now
// we pick between {F1, F2} then pick between either {A,B} or {C,D,E,F}. We
// do this because it's simpler and doesn't seem to hurt anything.
FeatureSet feature_candidates = new_unused_set;
// TODO(liberato): Let our caller override this.
const size_t features_per_split =
......@@ -267,17 +259,6 @@ std::unique_ptr<Model> RandomTreeTrainer::Build(
std::unique_ptr<InteriorNode> node = std::make_unique<InteriorNode>(
task, best_potential_split.split_index, best_potential_split.split_point);
// Don't let the subtree use this feature if this is nominal split, since
// there's nothing left to split. For numeric splits, we might want to split
// it further. Note that if there is only one branch for this split, then
// we returned a leaf anyway.
if (task.feature_descriptions[best_potential_split.split_index].ordering ==
LearningTask::Ordering::kUnordered) {
DCHECK(new_unused_set.find(best_potential_split.split_index) !=
new_unused_set.end());
new_unused_set.erase(best_potential_split.split_index);
}
for (auto& branch_iter : best_potential_split.branch_infos) {
node->AddChild(branch_iter.first,
Build(task, training_data, branch_iter.second.training_idx,
......@@ -297,18 +278,21 @@ RandomTreeTrainer::Split RandomTreeTrainer::ConstructSplit(
DCHECK_GT(training_idx.size(), 0u);
Split split(split_index);
base::Optional<FeatureValue> split_point;
bool is_numeric = task.feature_descriptions[split_index].ordering ==
LearningTask::Ordering::kNumeric;
// TODO(liberato): Consider removing nominal feature support and RF. That
// would make this code somewhat simpler.
// For a numeric split, find the split point. Otherwise, we'll split on every
// nominal value that this feature has in |training_data|.
if (task.feature_descriptions[split_index].ordering ==
LearningTask::Ordering::kNumeric) {
split_point =
FindNumericSplitPoint(split.split_index, training_data, training_idx);
split.split_point = *split_point;
if (is_numeric) {
split.split_point =
FindSplitPoint_Numeric(split.split_index, training_data, training_idx);
} else {
split.split_point =
FindSplitPoint_Nominal(split.split_index, training_data, training_idx);
}
// Find the split's feature values and construct the training set for each.
......@@ -323,13 +307,14 @@ RandomTreeTrainer::Split RandomTreeTrainer::ConstructSplit(
FeatureValue v_i = example.features[split.split_index];
// Figure out what value this example would use for splitting. For nominal,
// it's just |v_i|. For numeric, it's whether |v_i| is <= the split point
// or not (0 for <=, 1 for >).
// it's 1 or 0, based on whether |v_i| is equal to the split or not. For
// numeric, it's whether |v_i| is <= the split point or not (0 for <=, and 1
// for >).
FeatureValue split_feature;
if (split_point)
split_feature = FeatureValue(v_i > *split_point);
if (is_numeric)
split_feature = FeatureValue(v_i > split.split_point);
else
split_feature = v_i;
split_feature = FeatureValue(v_i == split.split_point);
// Add |v_i| to the right training set. Remember that emplace will do
// nothing if the key already exists.
......@@ -345,18 +330,18 @@ RandomTreeTrainer::Split RandomTreeTrainer::ConstructSplit(
// Figure out how good / bad this split is.
switch (task.target_description.ordering) {
case LearningTask::Ordering::kUnordered:
ComputeNominalSplitScore(&split, total_weight);
ComputeSplitScore_Nominal(&split, total_weight);
break;
case LearningTask::Ordering::kNumeric:
ComputeNumericSplitScore(&split, total_weight);
ComputeSplitScore_Numeric(&split, total_weight);
break;
}
return split;
}
void RandomTreeTrainer::ComputeNominalSplitScore(Split* split,
double total_weight) {
void RandomTreeTrainer::ComputeSplitScore_Nominal(Split* split,
double total_weight) {
// Compute the nats given that we're at this node.
split->nats_remaining = 0;
for (auto& info_iter : split->branch_infos) {
......@@ -374,8 +359,8 @@ void RandomTreeTrainer::ComputeNominalSplitScore(Split* split,
}
}
void RandomTreeTrainer::ComputeNumericSplitScore(Split* split,
double total_weight) {
void RandomTreeTrainer::ComputeSplitScore_Numeric(Split* split,
double total_weight) {
// Compute the nats given that we're at this node.
split->nats_remaining = 0;
for (auto& info_iter : split->branch_infos) {
......@@ -402,7 +387,7 @@ void RandomTreeTrainer::ComputeNumericSplitScore(Split* split,
}
}
FeatureValue RandomTreeTrainer::FindNumericSplitPoint(
FeatureValue RandomTreeTrainer::FindSplitPoint_Numeric(
size_t split_index,
const TrainingData& training_data,
const std::vector<size_t>& training_idx) {
......@@ -444,5 +429,42 @@ FeatureValue RandomTreeTrainer::FindNumericSplitPoint(
return v_split;
}
FeatureValue RandomTreeTrainer::FindSplitPoint_Nominal(
size_t split_index,
const TrainingData& training_data,
const std::vector<size_t>& training_idx) {
// We should not be given a training set of size 0, since there's no need to
// check an empty split.
DCHECK_GT(training_idx.size(), 0u);
// Construct a set of all values for |training_idx|. We don't care about
// their relative frequency, since one-hot encoding doesn't.
// For example, if a feature has 10 "yes" instances and 1 "no" instance, then
// there's a 50% chance for each to be chosen here. This is because one-hot
// encoding would do roughly the same thing: when choosing features, the
// "is_yes" and "is_no" features that come out of one-hot encoding would be
// equally likely to be chosen.
//
// Important but subtle note: we can't choose a value that's been chosen
// before for this feature, since that would be like splitting on the same
// one-hot feature more than once. Luckily, we won't be asked to do that. If
// we choose "Yes" at some level in the tree, then the "==" branch will have
// trivial features which will be removed from consideration early (we never
// consider features with only one value), and the != branch won't have any
// "Yes" values for us to pick at a lower level.
std::set<FeatureValue> values;
for (size_t idx : training_idx) {
const LabelledExample& example = training_data[idx];
values.insert(example.features[split_index]);
}
// Select one uniformly at random.
size_t which = rng()->Generate(values.size());
auto it = values.begin();
for (; which > 0; it++, which--)
;
return *it;
}
} // namespace learning
} // namespace media
......@@ -160,15 +160,20 @@ class COMPONENT_EXPORT(LEARNING_IMPL) RandomTreeTrainer
// Fill in |nats_remaining| for |split| for a nominal target. |total_weight|
// is the total weight of all instances coming into this split.
void ComputeNominalSplitScore(Split* split, double total_weight);
void ComputeSplitScore_Nominal(Split* split, double total_weight);
// Fill in |nats_remaining| for |split| for a numeric target.
void ComputeNumericSplitScore(Split* split, double total_weight);
void ComputeSplitScore_Numeric(Split* split, double total_weight);
// Compute the split point for |training_data| for a nominal feature.
FeatureValue FindSplitPoint_Nominal(size_t index,
const TrainingData& training_data,
const std::vector<size_t>& training_idx);
// Compute the split point for |training_data| for a numeric feature.
FeatureValue FindNumericSplitPoint(size_t index,
const TrainingData& training_data,
const std::vector<size_t>& training_idx);
FeatureValue FindSplitPoint_Numeric(size_t index,
const TrainingData& training_data,
const std::vector<size_t>& training_idx);
DISALLOW_COPY_AND_ASSIGN(RandomTreeTrainer);
};
......
......@@ -169,30 +169,17 @@ TEST_P(RandomTreeTest, UnknownFeatureValueHandling) {
training_data.push_back(example_1);
training_data.push_back(example_2);
task_.rt_unknown_value_handling =
LearningTask::RTUnknownValueHandling::kEmptyDistribution;
std::unique_ptr<Model> model = Train(task_, training_data);
TargetDistribution distribution =
auto model = Train(task_, training_data);
auto distribution =
model->PredictDistribution(FeatureVector({FeatureValue(789)}));
if (ordering_ == LearningTask::Ordering::kUnordered) {
// OOV data should return an empty distribution (nominal).
EXPECT_EQ(distribution.size(), 0u);
} else {
// OOV data should end up in the |example_2| bucket, since the feature is
// numerically higher.
// OOV data could be split on either feature first, so we don't really know
// which to expect. We assert that there should be exactly one example, but
// whether it's |example_1| or |example_2| isn't clear.
EXPECT_EQ(distribution.size(), 1u);
EXPECT_EQ(distribution[example_2.target_value], 1u);
}
task_.rt_unknown_value_handling =
LearningTask::RTUnknownValueHandling::kUseAllSplits;
model = Train(task_, training_data);
distribution = model->PredictDistribution(FeatureVector({FeatureValue(789)}));
if (ordering_ == LearningTask::Ordering::kUnordered) {
// OOV data should return with the sum of all splits.
EXPECT_EQ(distribution.size(), 2u);
EXPECT_EQ(distribution[example_1.target_value], 1u);
EXPECT_EQ(distribution[example_2.target_value], 1u);
EXPECT_EQ(distribution[example_1.target_value] +
distribution[example_2.target_value],
1u);
} else {
// The unknown feature is numerically higher than |example_2|, so we
// expect it to fall into that bucket.
......@@ -212,8 +199,6 @@ TEST_P(RandomTreeTest, NumericFeaturesSplitMultipleTimes) {
training_data.push_back(example);
}
task_.rt_unknown_value_handling =
LearningTask::RTUnknownValueHandling::kEmptyDistribution;
std::unique_ptr<Model> model = Train(task_, training_data);
for (size_t i = 0; i < 4; i++) {
// Get a prediction for the |i|-th feature value.
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment