Clustering stories of the benchmarks

These scripts help with gathering and processing historical data to be used for clustering. Bug: chromium:959971 Change-Id: I1b42612324be0e9ce2e74babe6cdfb52c92f5c33 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/1594779 Commit-Queue: Behdad Bakhshinategh <behdadb@chromium.org> Reviewed-by: Caleb Rouleau <crouleau@chromium.org> Reviewed-by: Sadrul Chowdhury <sadrul@chromium.org> Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org> Cr-Commit-Position: refs/heads/master@{#664800}

Clustering stories of the benchmarks
These scripts help with gathering and processing historical data to be used for clustering. Bug: chromium:959971 Change-Id: I1b42612324be0e9ce2e74babe6cdfb52c92f5c33 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/1594779 Commit-Queue: Behdad Bakhshinategh <behdadb@chromium.org> Reviewed-by: Caleb Rouleau <crouleau@chromium.org> Reviewed-by: Sadrul Chowdhury <sadrul@chromium.org> Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org> Cr-Commit-Position: refs/heads/master@{#664800}
525a742c · behdad · Commit Bot · 2a8efe44 · 525a742c · 525a742c
Commit 525a742c authored May 30, 2019 by behdad Committed by Commit Bot May 30, 2019
8 changed files
--- a/tools/perf/experimental/__init__.py
+++ b/tools/perf/experimental/__init__.py
--- a/tools/perf/experimental/story_clustering/OWNERS
+++ b/tools/perf/experimental/story_clustering/OWNERS
+behdadb@chromium.org
+sadrul@chromium.org
--- a/tools/perf/experimental/story_clustering/README.md
+++ b/tools/perf/experimental/story_clustering/README.md
+The code is this directory provides support for clustering and choosing
+representatives for benchmarks.
+Input needed for the clustering methods are:
+1. Benchmark name
+2. List of metrics to use.
+    Clustering will be done once for each metric.
+3. List of platforms to gather data from.
+4. List of test-cases/story-names in the benchmark which should be clustered.
+    If some stories are recognized as outliers, they can be removed from this
+    list. The testcases can be provided as a story per line text file.
+5. Maximum number of clusters.
+    The actual number of clusters may be less than this number, as clusters
+    with only one member are not presented as a cluster.
+6. How many days of data history to be used for clustering
+Examples of creating clusters:
+```shell
+python ./tools/perf/experimental/story_clustering/gather_historical_records_and_cluster_stories.py \
+rendering.desktop \
+--metrics frame_times thread_total_all_cpu_time_per_frame \
+--platforms ChromiumPerf:mac-10_13_laptop_high_end-perf ChromiumPerf:mac-10_12_laptop_low_end-perf \
+--testcases-path //tmp/story_clustering/rendering.desktop/test_cases.txt \
+--days=100 \
+--normalize
+```
+```shell
+python ./tools/perf/experimental/story_clustering/gather_historical_records_and_cluster_stories.py \
+rendering.desktop \
+--metrics frame_times thread_total_all_cpu_time_per_frame \
+--platforms 'ChromiumPerf:Win 7 Nvidia GPU Perf' 'ChromiumPerf:Win 7 Perf' ChromiumPerf:win-10-perf \
+--testcases-path //tmp/story_clustering/rendering.desktop/test_cases.txt \
+--days=100 \
+--normalize
+```
+```shell
+python ./tools/perf/experimental/story_clustering/gather_historical_records_and_cluster_stories.py \
+rendering.mobile \
+--metrics frame_times thread_total_all_cpu_time_per_frame \
+--platforms 'ChromiumPerf:Android Nexus5 Perf' 'ChromiumPerf:Android Nexus5X WebView Perf' \
+'ChromiumPerf:Android Nexus6 WebView Perf' \
+--testcases-path //tmp/story_clustering/rendering.mobile/test_cases.txt \
+--days=100 \
+--normalize
+```
+Results of the clustering will be written in `clusters.json` file, located in the output directory given to the script
+[Method explanation](https://goto.google.com/chrome-benchmark-clustering)
\ No newline at end of file
--- a/tools/perf/experimental/story_clustering/__init__.py
+++ b/tools/perf/experimental/story_clustering/__init__.py
--- a/tools/perf/experimental/story_clustering/cluster_stories.py
+++ b/tools/perf/experimental/story_clustering/cluster_stories.py
+# Copyright 2019 The Chromium Authors. All rights reserved.
+# Use of this source code is governed by a BSD-style license that can be
+# found in the LICENSE file.
+from __future__ import division
+import heapq
+class Cluster(object):
+  def __init__(self, members):
+    """Initializes the cluster instance.
+    Args:
+      members: Set of story names which belong to this cluster.
+    """
+    self._members = frozenset(members)
+    self._representative = None
+  def __len__(self):
+    return len(self._members)
+  def GetDistanceFrom(self, other_cluster, distance_matrix):
+    """Calculates the distance of two clusters.
+    The maximum distance between any story of first cluster to any story of
+    the second cluster is used as the distance between clusters._members
+    Args:
+      other_cluster: Cluster object to calculate distance from.
+      distance_matrix: A dataframe containing the distances between any
+      two stories.
+    Returns:
+      A float number representing the distacne between two clusters.
+    """
+    matrix_slice = distance_matrix.loc[self.members, other_cluster.members]
+    return matrix_slice.max().max()
+  @property
+  def members(self):
+    return self._members
+  def GetRepresentative(self, distance_matrix=None):
+    """Finds and sets the representative of cluster.
+    The story which its max distance to all other members is minimum is
+    used as the representative.
+    Args:
+      distance_matrix: A dataframe containing the distances between any
+      two stories.
+    Returns:
+      A story which is the representative of cluster
+    """
+    if self._representative:
+      return self._representative
+    if distance_matrix is None:
+      raise Exception('Distance matrix is not set.')
+    self._representative = distance_matrix.loc[
+      self._members, self._members].sum().idxmin()
+    return self._representative
+  def Merge(self, other_cluster):
+    """Merges two clusters.
+    Returns:
+      A new cluster object which is a result of merging two clusters.
+    """
+    return Cluster(self.members | other_cluster.members)
+  def AsDict(self):
+    """Creates a dictionary which describes cluster object.
+    Returns:
+      A dictionary containing the members of the cluster and its
+      representative. The representative will not be listed in members
+      list.
+    """
+    representative = self.GetRepresentative()
+    members_list = list(self.members.difference([representative]))
+    return {
+      'members': members_list,
+      'representative': self.GetRepresentative()
+    }
+def RunHierarchicalClustering(
+  distance_matrix,
+  max_cluster_count,
+  min_cluster_size):
+  """Clusters stories.
+  Runs a hierarchical clustering algorithm based on the similarity measures.
+  Args:
+    distance_matrix: A dataframe containing distance matrix of stories.
+    max_cluster_count: number representing the maximum number of clusters
+      needed per metric.
+    min_cluster_size: number representing the least number of members needed
+      to make the cluster valid.
+  Returns:
+    A tuple containing:
+    clusters: A list of cluster objects
+    coverage: Ratio(float) of stories covered using this clustering
+  """
+  stories = distance_matrix.index.values
+  remaining_clusters = set([])
+  for story in stories:
+    remaining_clusters.add(Cluster([story]))
+  # The hierarchical clustering relies on a sorted list of possible
+  # cluster merges ordered by the distance between them.
+  heap = []
+  # Initially each story is a cluster on it's own. And story pairs are
+  # added all possible merges.
+  for cluster1 in remaining_clusters:
+    for cluster2 in remaining_clusters:
+      if cluster1 == cluster2:
+        break
+      heapq.heappush(heap,
+        (cluster1.GetDistanceFrom(cluster2, distance_matrix),
+         cluster1, cluster2))
+  # At each step the two clusters will be merged together.
+  while (len(remaining_clusters) > max_cluster_count and len(heap) > 0):
+    _, cluster1, cluster2 = heapq.heappop(heap)
+    if (cluster1 not in remaining_clusters or
+      cluster2 not in remaining_clusters):
+      continue
+    new_cluster = cluster1.Merge(cluster2)
+    remaining_clusters.discard(cluster1)
+    remaining_clusters.discard(cluster2)
+    # Adding all possible merges to the heap
+    for cluster in remaining_clusters:
+      distance = new_cluster.GetDistanceFrom(cluster, distance_matrix)
+      heapq.heappush(heap, (distance, new_cluster, cluster))
+    remaining_clusters.add(new_cluster)
+  final_clusters = []
+  number_of_stories_covered = 0
+  for cluster in remaining_clusters:
+    cluster.GetRepresentative(distance_matrix)
+    if len(cluster) >= min_cluster_size:
+      final_clusters.append(cluster)
+      number_of_stories_covered += len(cluster)
+  coverage = number_of_stories_covered / len(stories)
+  return final_clusters, coverage
--- a/tools/perf/experimental/story_clustering/create_soundwave_input.py
+++ b/tools/perf/experimental/story_clustering/create_soundwave_input.py
+# Copyright 2019 The Chromium Authors. All rights reserved.
+# Use of this source code is governed by a BSD-style license that can be
+# found in the LICENSE file.
+import argparse
+import json
+import sys
+def CreateInput(test_suite, platforms, metrics, test_cases_path, output_dir):
+  with open(test_cases_path, 'r') as test_case_file:
+    test_cases = [line.strip() for line in test_case_file]
+  json_data = []
+  for platform in platforms:
+    for metric in metrics:
+      for test_case in test_cases:
+        json_data.append({
+          'test_suite': test_suite,
+          'bot': platform,
+          'measurement': metric,
+          'test_case': test_case
+        })
+  with open(output_dir, 'w') as output:
+    json.dump(json_data, output)
+def Main(argv):
+  parser = argparse.ArgumentParser(
+    description=('Creates the Json needed for the soundwave'))
+  parser.add_argument('test_suite', help=('Name of test_suite (example: "'
+            'rendering.desktop")'))
+  parser.add_argument('--platforms', help='Name of platform (example: '
+            '"ChromiumPerf:Win 7 Nvidia GPU Perf")', nargs='*')
+  parser.add_argument('--metrics', help='Name of measurement (example: '
+            '"frame_times")', nargs='*')
+  parser.add_argument('--test-cases-path', type=str,
+            help='Path for the file having test_cases')
+  parser.add_argument('--output-dir', type=str,
+            help='Path for the output file')
+  args = parser.parse_args(argv[1:])
+  return CreateInput(
+    args.test_suite,
+    args.platforms,
+    args.metrics,
+    args.test_cases_path,
+    args.output_dir)
+if __name__ == '__main__':
+  sys.exit(Main(sys.argv))
--- a/tools/perf/experimental/story_clustering/gather_historical_records_and_cluster_stories.py
+++ b/tools/perf/experimental/story_clustering/gather_historical_records_and_cluster_stories.py
+# Copyright 2019 The Chromium Authors. All rights reserved.
+# Use of this source code is governed by a BSD-style license that can be
+# found in the LICENSE file.
+from __future__ import print_function
+import argparse
+import json
+import logging
+import os
+import shutil
+import subprocess
+import sys
+import tempfile
+TOOLS_PERF_PATH = os.path.abspath(os.path.join(
+  os.path.dirname(__file__), '..', '..'))
+sys.path.insert(1, TOOLS_PERF_PATH)
+from experimental.story_clustering import similarity_calculator
+from experimental.story_clustering import cluster_stories
+from experimental.story_clustering import create_soundwave_input
+from core.external_modules import pandas
+def CalculateDistances(
+  all_bots_dataframe,
+  bots,
+  rolling_window,
+  metric_name,
+  normalize = False):
+  timeseries = []
+  for bot_name, bot_group in all_bots_dataframe.groupby(bots):
+    temp_dataframe = bot_group.pivot(index='test_case',
+      columns='commit_pos', values='value')
+    temp_dataframe_with_solling_avg = temp_dataframe.rolling(
+      rolling_window,
+      min_periods=1,
+      axis=1
+    ).mean().stack().rename('value').reset_index()
+    temp_dataframe_with_solling_avg['bot'] = bot_name
+    timeseries.append(temp_dataframe_with_solling_avg)
+  all_bots = pandas.concat(timeseries)
+  distance_matrix = similarity_calculator.CalculateDistances(
+    all_bots,
+    metric=metric_name,
+    normalize=normalize,
+  )
+  print('Similarities are calculated for', metric_name)
+  return distance_matrix
+def Main(argv):
+  parser = argparse.ArgumentParser(
+    description=('Gathers the values of each metric and platfrom pair in a'
+    ' csv file to be used in clustering of stories.'))
+  parser.add_argument('benchmark', type=str, help='benchmark to be used')
+  parser.add_argument('--metrics', type=str, nargs='*',
+    help='List of metrics to use')
+  parser.add_argument('--platforms', type=str, nargs='*',
+    help='List of platforms to use')
+  parser.add_argument('--testcases-path', type=str, help=('Path to the file '
+    'containing a list of all test_cases in the benchmark that needs to '
+    'be clustered'))
+  parser.add_argument('--days', default=180, help=('Number of days to gather'
+    ' data about'))
+  parser.add_argument('--output-path', type=str, help='Output file',
+    default='//tmp/story_clustering/clusters.json')
+  parser.add_argument('--max-cluster-count', default='10',
+    help='Number of not valid clusters needed')
+  parser.add_argument('--min-cluster-size', default='2', help=('Least number '
+            'of members in cluster, to make cluster valied'))
+  parser.add_argument('--rolling-window', default='1', help=('Number of '
+    'samples to take average from while calculating the moving average'))
+  parser.add_argument('--normalize', default=False,
+    help='Normalize timeseries to calculate similarity', action='store_true')
+  args = parser.parse_args(argv[1:])
+  temp_dir = tempfile.mkdtemp('telemetry')
+  startup_timeseries = os.path.join(temp_dir, 'startup_timeseries.json')
+  soundwave_output_path = os.path.join(temp_dir, 'data.csv')
+  soundwave_path = os.path.join(TOOLS_PERF_PATH, 'soundwave')
+  try:
+    output_dir = os.path.dirname(args.output_path)
+    clusters_json = {}
+    if not os.path.isdir(output_dir):
+      os.makedirs(output_dir)
+    # creating the json file needed for soundwave
+    create_soundwave_input.CreateInput(
+      test_suite=args.benchmark,
+      platforms=args.platforms,
+      metrics=args.metrics,
+      test_cases_path=args.testcases_path,
+      output_dir=startup_timeseries)
+    subprocess.call([
+      soundwave_path,
+      '-d', args.days,
+      'timeseries',
+      '-i', startup_timeseries,
+      '--output-csv', soundwave_output_path
+    ])
+    # Processing the data.
+    dataframe = pandas.read_csv(soundwave_output_path)
+    dataframe_per_metric = dataframe.groupby(dataframe['measurement'])
+    for metric_name, all_bots in list(dataframe_per_metric):
+      clusters_json[metric_name] = []
+      distance_matrix = CalculateDistances(
+        all_bots_dataframe=all_bots,
+        bots=dataframe['bot'],
+        rolling_window=int(args.rolling_window),
+        metric_name=metric_name,
+        normalize=args.normalize)
+      clusters, coverage = cluster_stories.RunHierarchicalClustering(
+        distance_matrix,
+        max_cluster_count=int(args.max_cluster_count),
+        min_cluster_size=int(args.min_cluster_size),
+      )
+      print()
+      print(metric_name, ':')
+      print(format(coverage * 100.0, '.1f'), 'percent coverage.')
+      print('Stories are grouped into', len(clusters), 'clusters.')
+      print('representatives:')
+      for cluster in clusters:
+        print (cluster.GetRepresentative())
+      print()
+      for cluster in clusters:
+        clusters_json[metric_name].append(cluster.AsDict())
+    with open(args.output_path, 'w') as outfile:
+      json.dump(
+        clusters_json,
+        outfile,
+        separators=(',',': '),
+        indent=4,
+        sort_keys=True
+      )
+  except Exception:
+    logging.exception('The following exception may have prevented the code'
+      ' from clustering stories.')
+  finally:
+    shutil.rmtree(temp_dir, ignore_errors=True)
+if __name__ == '__main__':
+  sys.exit(Main(sys.argv))
--- a/tools/perf/experimental/story_clustering/similarity_calculator.py
+++ b/tools/perf/experimental/story_clustering/similarity_calculator.py
+# Copyright 2019 The Chromium Authors. All rights reserved.
+# Use of this source code is governed by a BSD-style license that can be
+# found in the LICENSE file.
+import os
+from core.external_modules import pandas
+HIGHEST_VALID_NAN_RATIO = 0.5
+def CalculateDistances(
+  input_dataframe,
+  metric,
+  normalize=False,
+  output_path=None):
+  """Calculates the distances of stories.
+  If normalize flag is set the values are first normalized using min-max
+  normalization. Then the similarity measure between every two stories is
+  calculated using pearson correlation.
+  Args:
+    input_dataframe: A dataframe containing a list of records
+    having (test_case, commit_pos, bot, value).
+    metric: String containing name of the metric.
+    normalize: A flag to determine if normalization is needed.
+    output_path: Path to write the calculated distances.
+  Returns:
+    A dataframe containing the distance matrix of the stories.
+  """
+  input_by_story = input_dataframe.groupby('test_case')['value']
+  total_values_per_story = input_by_story.size()
+  nan_values_per_story = input_by_story.apply(lambda s: s.isna().sum())
+  should_keep = nan_values_per_story < (
+    total_values_per_story * HIGHEST_VALID_NAN_RATIO)
+  valid_stories = total_values_per_story[should_keep].index
+  filtered_dataframe = input_dataframe[
+    input_dataframe['test_case'].isin(valid_stories)]
+  temp_df = filtered_dataframe.copy()
+  if normalize:
+    # Min Max normalization
+    grouped = temp_df.groupby(['bot', 'test_case'])['value']
+    min_value = grouped.transform('min')
+    max_value = grouped.transform('max')
+    temp_df['value'] = temp_df['value'] / (1 + max_value - min_value)
+  distances = pandas.DataFrame()
+  grouped_temp = temp_df.groupby(temp_df['bot'])
+  for _, group in grouped_temp:
+    sample_df = group.pivot(index='commit_pos', columns='test_case',
+      values='value')
+    if distances.empty:
+      distances = 1 - sample_df.corr(method='pearson')
+    else:
+      distances = distances.add(1 - sample_df.corr(method='pearson'),
+        fill_value=0)
+  if output_path is not None:
+    if not os.path.isdir(output_path):
+      os.makedirs(output_path)
+    distances.to_csv(
+      os.path.join(output_path, metric + '_distances.csv')
+    )
+  return distances