media/learning/impl/random_tree_trainer_unittest.cc · dca111a7573e0a32b89d24a6c8357091e4f7d0b8 · eriksson monteiro / tangled

This CL turns off one-hot conversion for nominal features, and instead adds a similar effect directly in RandomTreeTrainer. Instead of converting each nominal feature of N values into N binary features, have the tree builder pick uniformly at random one of the N values to do the split. This choice is made without regard to the number of examples with a particular value, since that's what one- hot encoding would do. For example, if 9 examples have value "A" and 1 has value "B", then each "A" and "B" has a 50% chance of being the split point. For one-hot, there would be two binary features, and the split selection would similarly pick between them with equal probability. There is one difference between one-hot values, however. The current system still picks first, uniformly, a set of features to split on, then chooses (again uniformly) which value to split on this time. The first pick is different, since one-hot would give each value equal weight across multiple features. For example, if we have two features f1 and f2 with values {A,B} and {C,D,E,F}, then one-hot would pick a split uniformly over the resulting 6 features, each of which was a feature value in the original nominal. Now, we'll pick uniformly between f1 and f2, then uniformly again from either {A,B} or {C,D,E,F}, depending on whether we chose f1 and f2. With one-hot encoding, we had twice the chance to pick a f2 value than f1. We could emulate this when choosing features, but it seemed to work okay and is simpler. The reason this is faster is that, for M features of N nominal values each, one-hot would generate M*N features that would be searched over at each node. Now, we have only M features and then a search over N to pick the split. We could preserve this even if we fixed the discrepancy above. For FisherIrisDataset test with nominal features, locally this reduces the runtime of the test from ~325 msec to ~90 msec. Change-Id: I8a43963db8be7e7eb6eb8bb5efc64aefdc6ca67d Reviewed-on: https://chromium-review.googlesource.com/c/1478177 Commit-Queue: Frank Liberato <liberato@chromium.org> Reviewed-by:

Dan Sanders <sandersd@chromium.org> Cr-Commit-Position: refs/heads/master@{#633864}

random_tree_trainer_unittest.cc 8.07 KB

Replace random_tree_trainer_unittest.cc