• liberato@chromium.org's avatar
    Speed up nominal features in RandomTreeTrainer. · dca111a7
    liberato@chromium.org authored
    This CL turns off one-hot conversion for nominal features, and
    instead adds a similar effect directly in RandomTreeTrainer.
    
    Instead of converting each nominal feature of N values into N binary
    features, have the tree builder pick uniformly at random one of the
    N values to do the split.  This choice is made without regard to the
    number of examples with a particular value, since that's what one-
    hot encoding would do.  For example, if 9 examples have value "A"
    and 1 has value "B", then each "A" and "B" has a 50% chance of
    being the split point.  For one-hot, there would be two binary
    features, and the split selection would similarly pick between them
    with equal probability.
    
    There is one difference between one-hot values, however.  The
    current system still picks first, uniformly, a set of features to
    split on, then chooses (again uniformly) which value to split on
    this time.  The first pick is different, since one-hot would give
    each value equal weight across multiple features.  For example,
    if we have two features f1 and f2 with values {A,B} and {C,D,E,F},
    then one-hot would pick a split uniformly over the resulting 6
    features, each of which was a feature value in the original
    nominal.  Now, we'll pick uniformly between f1 and f2, then
    uniformly again from either {A,B} or {C,D,E,F}, depending on whether
    we chose f1 and f2.
    
    With one-hot encoding, we had twice the chance to pick a f2 value
    than f1.  We could emulate this when choosing features, but it
    seemed to work okay and is simpler.
    
    The reason this is faster is that, for M features of N nominal
    values each, one-hot would generate M*N features that would be
    searched over at each node.  Now, we have only M features and
    then a search over N to pick the split.  We could preserve this
    even if we fixed the discrepancy above.
    
    For FisherIrisDataset test with nominal features, locally this
    reduces the runtime of the test from ~325 msec to ~90 msec.
    
    Change-Id: I8a43963db8be7e7eb6eb8bb5efc64aefdc6ca67d
    Reviewed-on: https://chromium-review.googlesource.com/c/1478177
    Commit-Queue: Frank Liberato <liberato@chromium.org>
    Reviewed-by: default avatarDan Sanders <sandersd@chromium.org>
    Cr-Commit-Position: refs/heads/master@{#633864}
    dca111a7
random_tree_trainer_unittest.cc 8.07 KB