Commit 525a742c authored by behdad's avatar behdad Committed by Commit Bot

Clustering stories of the benchmarks

These scripts help with gathering and processing historical data to be used for clustering.

Bug: chromium:959971
Change-Id: I1b42612324be0e9ce2e74babe6cdfb52c92f5c33
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/1594779
Commit-Queue: Behdad Bakhshinategh <behdadb@chromium.org>
Reviewed-by: default avatarCaleb Rouleau <crouleau@chromium.org>
Reviewed-by: default avatarSadrul Chowdhury <sadrul@chromium.org>
Reviewed-by: default avatarJuan Antonio Navarro Pérez <perezju@chromium.org>
Cr-Commit-Position: refs/heads/master@{#664800}
parent 2a8efe44
behdadb@chromium.org
sadrul@chromium.org
The code is this directory provides support for clustering and choosing
representatives for benchmarks.
Input needed for the clustering methods are:
1. Benchmark name
2. List of metrics to use.
Clustering will be done once for each metric.
3. List of platforms to gather data from.
4. List of test-cases/story-names in the benchmark which should be clustered.
If some stories are recognized as outliers, they can be removed from this
list. The testcases can be provided as a story per line text file.
5. Maximum number of clusters.
The actual number of clusters may be less than this number, as clusters
with only one member are not presented as a cluster.
6. How many days of data history to be used for clustering
Examples of creating clusters:
```shell
python ./tools/perf/experimental/story_clustering/gather_historical_records_and_cluster_stories.py \
rendering.desktop \
--metrics frame_times thread_total_all_cpu_time_per_frame \
--platforms ChromiumPerf:mac-10_13_laptop_high_end-perf ChromiumPerf:mac-10_12_laptop_low_end-perf \
--testcases-path //tmp/story_clustering/rendering.desktop/test_cases.txt \
--days=100 \
--normalize
```
```shell
python ./tools/perf/experimental/story_clustering/gather_historical_records_and_cluster_stories.py \
rendering.desktop \
--metrics frame_times thread_total_all_cpu_time_per_frame \
--platforms 'ChromiumPerf:Win 7 Nvidia GPU Perf' 'ChromiumPerf:Win 7 Perf' ChromiumPerf:win-10-perf \
--testcases-path //tmp/story_clustering/rendering.desktop/test_cases.txt \
--days=100 \
--normalize
```
```shell
python ./tools/perf/experimental/story_clustering/gather_historical_records_and_cluster_stories.py \
rendering.mobile \
--metrics frame_times thread_total_all_cpu_time_per_frame \
--platforms 'ChromiumPerf:Android Nexus5 Perf' 'ChromiumPerf:Android Nexus5X WebView Perf' \
'ChromiumPerf:Android Nexus6 WebView Perf' \
--testcases-path //tmp/story_clustering/rendering.mobile/test_cases.txt \
--days=100 \
--normalize
```
Results of the clustering will be written in `clusters.json` file, located in the output directory given to the script
[Method explanation](https://goto.google.com/chrome-benchmark-clustering)
\ No newline at end of file
# Copyright 2019 The Chromium Authors. All rights reserved.
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
from __future__ import division
import heapq
class Cluster(object):
def __init__(self, members):
"""Initializes the cluster instance.
Args:
members: Set of story names which belong to this cluster.
"""
self._members = frozenset(members)
self._representative = None
def __len__(self):
return len(self._members)
def GetDistanceFrom(self, other_cluster, distance_matrix):
"""Calculates the distance of two clusters.
The maximum distance between any story of first cluster to any story of
the second cluster is used as the distance between clusters._members
Args:
other_cluster: Cluster object to calculate distance from.
distance_matrix: A dataframe containing the distances between any
two stories.
Returns:
A float number representing the distacne between two clusters.
"""
matrix_slice = distance_matrix.loc[self.members, other_cluster.members]
return matrix_slice.max().max()
@property
def members(self):
return self._members
def GetRepresentative(self, distance_matrix=None):
"""Finds and sets the representative of cluster.
The story which its max distance to all other members is minimum is
used as the representative.
Args:
distance_matrix: A dataframe containing the distances between any
two stories.
Returns:
A story which is the representative of cluster
"""
if self._representative:
return self._representative
if distance_matrix is None:
raise Exception('Distance matrix is not set.')
self._representative = distance_matrix.loc[
self._members, self._members].sum().idxmin()
return self._representative
def Merge(self, other_cluster):
"""Merges two clusters.
Returns:
A new cluster object which is a result of merging two clusters.
"""
return Cluster(self.members | other_cluster.members)
def AsDict(self):
"""Creates a dictionary which describes cluster object.
Returns:
A dictionary containing the members of the cluster and its
representative. The representative will not be listed in members
list.
"""
representative = self.GetRepresentative()
members_list = list(self.members.difference([representative]))
return {
'members': members_list,
'representative': self.GetRepresentative()
}
def RunHierarchicalClustering(
distance_matrix,
max_cluster_count,
min_cluster_size):
"""Clusters stories.
Runs a hierarchical clustering algorithm based on the similarity measures.
Args:
distance_matrix: A dataframe containing distance matrix of stories.
max_cluster_count: number representing the maximum number of clusters
needed per metric.
min_cluster_size: number representing the least number of members needed
to make the cluster valid.
Returns:
A tuple containing:
clusters: A list of cluster objects
coverage: Ratio(float) of stories covered using this clustering
"""
stories = distance_matrix.index.values
remaining_clusters = set([])
for story in stories:
remaining_clusters.add(Cluster([story]))
# The hierarchical clustering relies on a sorted list of possible
# cluster merges ordered by the distance between them.
heap = []
# Initially each story is a cluster on it's own. And story pairs are
# added all possible merges.
for cluster1 in remaining_clusters:
for cluster2 in remaining_clusters:
if cluster1 == cluster2:
break
heapq.heappush(heap,
(cluster1.GetDistanceFrom(cluster2, distance_matrix),
cluster1, cluster2))
# At each step the two clusters will be merged together.
while (len(remaining_clusters) > max_cluster_count and len(heap) > 0):
_, cluster1, cluster2 = heapq.heappop(heap)
if (cluster1 not in remaining_clusters or
cluster2 not in remaining_clusters):
continue
new_cluster = cluster1.Merge(cluster2)
remaining_clusters.discard(cluster1)
remaining_clusters.discard(cluster2)
# Adding all possible merges to the heap
for cluster in remaining_clusters:
distance = new_cluster.GetDistanceFrom(cluster, distance_matrix)
heapq.heappush(heap, (distance, new_cluster, cluster))
remaining_clusters.add(new_cluster)
final_clusters = []
number_of_stories_covered = 0
for cluster in remaining_clusters:
cluster.GetRepresentative(distance_matrix)
if len(cluster) >= min_cluster_size:
final_clusters.append(cluster)
number_of_stories_covered += len(cluster)
coverage = number_of_stories_covered / len(stories)
return final_clusters, coverage
# Copyright 2019 The Chromium Authors. All rights reserved.
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
import argparse
import json
import sys
def CreateInput(test_suite, platforms, metrics, test_cases_path, output_dir):
with open(test_cases_path, 'r') as test_case_file:
test_cases = [line.strip() for line in test_case_file]
json_data = []
for platform in platforms:
for metric in metrics:
for test_case in test_cases:
json_data.append({
'test_suite': test_suite,
'bot': platform,
'measurement': metric,
'test_case': test_case
})
with open(output_dir, 'w') as output:
json.dump(json_data, output)
def Main(argv):
parser = argparse.ArgumentParser(
description=('Creates the Json needed for the soundwave'))
parser.add_argument('test_suite', help=('Name of test_suite (example: "'
'rendering.desktop")'))
parser.add_argument('--platforms', help='Name of platform (example: '
'"ChromiumPerf:Win 7 Nvidia GPU Perf")', nargs='*')
parser.add_argument('--metrics', help='Name of measurement (example: '
'"frame_times")', nargs='*')
parser.add_argument('--test-cases-path', type=str,
help='Path for the file having test_cases')
parser.add_argument('--output-dir', type=str,
help='Path for the output file')
args = parser.parse_args(argv[1:])
return CreateInput(
args.test_suite,
args.platforms,
args.metrics,
args.test_cases_path,
args.output_dir)
if __name__ == '__main__':
sys.exit(Main(sys.argv))
# Copyright 2019 The Chromium Authors. All rights reserved.
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
from __future__ import print_function
import argparse
import json
import logging
import os
import shutil
import subprocess
import sys
import tempfile
TOOLS_PERF_PATH = os.path.abspath(os.path.join(
os.path.dirname(__file__), '..', '..'))
sys.path.insert(1, TOOLS_PERF_PATH)
from experimental.story_clustering import similarity_calculator
from experimental.story_clustering import cluster_stories
from experimental.story_clustering import create_soundwave_input
from core.external_modules import pandas
def CalculateDistances(
all_bots_dataframe,
bots,
rolling_window,
metric_name,
normalize = False):
timeseries = []
for bot_name, bot_group in all_bots_dataframe.groupby(bots):
temp_dataframe = bot_group.pivot(index='test_case',
columns='commit_pos', values='value')
temp_dataframe_with_solling_avg = temp_dataframe.rolling(
rolling_window,
min_periods=1,
axis=1
).mean().stack().rename('value').reset_index()
temp_dataframe_with_solling_avg['bot'] = bot_name
timeseries.append(temp_dataframe_with_solling_avg)
all_bots = pandas.concat(timeseries)
distance_matrix = similarity_calculator.CalculateDistances(
all_bots,
metric=metric_name,
normalize=normalize,
)
print('Similarities are calculated for', metric_name)
return distance_matrix
def Main(argv):
parser = argparse.ArgumentParser(
description=('Gathers the values of each metric and platfrom pair in a'
' csv file to be used in clustering of stories.'))
parser.add_argument('benchmark', type=str, help='benchmark to be used')
parser.add_argument('--metrics', type=str, nargs='*',
help='List of metrics to use')
parser.add_argument('--platforms', type=str, nargs='*',
help='List of platforms to use')
parser.add_argument('--testcases-path', type=str, help=('Path to the file '
'containing a list of all test_cases in the benchmark that needs to '
'be clustered'))
parser.add_argument('--days', default=180, help=('Number of days to gather'
' data about'))
parser.add_argument('--output-path', type=str, help='Output file',
default='//tmp/story_clustering/clusters.json')
parser.add_argument('--max-cluster-count', default='10',
help='Number of not valid clusters needed')
parser.add_argument('--min-cluster-size', default='2', help=('Least number '
'of members in cluster, to make cluster valied'))
parser.add_argument('--rolling-window', default='1', help=('Number of '
'samples to take average from while calculating the moving average'))
parser.add_argument('--normalize', default=False,
help='Normalize timeseries to calculate similarity', action='store_true')
args = parser.parse_args(argv[1:])
temp_dir = tempfile.mkdtemp('telemetry')
startup_timeseries = os.path.join(temp_dir, 'startup_timeseries.json')
soundwave_output_path = os.path.join(temp_dir, 'data.csv')
soundwave_path = os.path.join(TOOLS_PERF_PATH, 'soundwave')
try:
output_dir = os.path.dirname(args.output_path)
clusters_json = {}
if not os.path.isdir(output_dir):
os.makedirs(output_dir)
# creating the json file needed for soundwave
create_soundwave_input.CreateInput(
test_suite=args.benchmark,
platforms=args.platforms,
metrics=args.metrics,
test_cases_path=args.testcases_path,
output_dir=startup_timeseries)
subprocess.call([
soundwave_path,
'-d', args.days,
'timeseries',
'-i', startup_timeseries,
'--output-csv', soundwave_output_path
])
# Processing the data.
dataframe = pandas.read_csv(soundwave_output_path)
dataframe_per_metric = dataframe.groupby(dataframe['measurement'])
for metric_name, all_bots in list(dataframe_per_metric):
clusters_json[metric_name] = []
distance_matrix = CalculateDistances(
all_bots_dataframe=all_bots,
bots=dataframe['bot'],
rolling_window=int(args.rolling_window),
metric_name=metric_name,
normalize=args.normalize)
clusters, coverage = cluster_stories.RunHierarchicalClustering(
distance_matrix,
max_cluster_count=int(args.max_cluster_count),
min_cluster_size=int(args.min_cluster_size),
)
print()
print(metric_name, ':')
print(format(coverage * 100.0, '.1f'), 'percent coverage.')
print('Stories are grouped into', len(clusters), 'clusters.')
print('representatives:')
for cluster in clusters:
print (cluster.GetRepresentative())
print()
for cluster in clusters:
clusters_json[metric_name].append(cluster.AsDict())
with open(args.output_path, 'w') as outfile:
json.dump(
clusters_json,
outfile,
separators=(',',': '),
indent=4,
sort_keys=True
)
except Exception:
logging.exception('The following exception may have prevented the code'
' from clustering stories.')
finally:
shutil.rmtree(temp_dir, ignore_errors=True)
if __name__ == '__main__':
sys.exit(Main(sys.argv))
# Copyright 2019 The Chromium Authors. All rights reserved.
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
import os
from core.external_modules import pandas
HIGHEST_VALID_NAN_RATIO = 0.5
def CalculateDistances(
input_dataframe,
metric,
normalize=False,
output_path=None):
"""Calculates the distances of stories.
If normalize flag is set the values are first normalized using min-max
normalization. Then the similarity measure between every two stories is
calculated using pearson correlation.
Args:
input_dataframe: A dataframe containing a list of records
having (test_case, commit_pos, bot, value).
metric: String containing name of the metric.
normalize: A flag to determine if normalization is needed.
output_path: Path to write the calculated distances.
Returns:
A dataframe containing the distance matrix of the stories.
"""
input_by_story = input_dataframe.groupby('test_case')['value']
total_values_per_story = input_by_story.size()
nan_values_per_story = input_by_story.apply(lambda s: s.isna().sum())
should_keep = nan_values_per_story < (
total_values_per_story * HIGHEST_VALID_NAN_RATIO)
valid_stories = total_values_per_story[should_keep].index
filtered_dataframe = input_dataframe[
input_dataframe['test_case'].isin(valid_stories)]
temp_df = filtered_dataframe.copy()
if normalize:
# Min Max normalization
grouped = temp_df.groupby(['bot', 'test_case'])['value']
min_value = grouped.transform('min')
max_value = grouped.transform('max')
temp_df['value'] = temp_df['value'] / (1 + max_value - min_value)
distances = pandas.DataFrame()
grouped_temp = temp_df.groupby(temp_df['bot'])
for _, group in grouped_temp:
sample_df = group.pivot(index='commit_pos', columns='test_case',
values='value')
if distances.empty:
distances = 1 - sample_df.corr(method='pearson')
else:
distances = distances.add(1 - sample_df.corr(method='pearson'),
fill_value=0)
if output_path is not None:
if not os.path.isdir(output_path):
os.makedirs(output_path)
distances.to_csv(
os.path.join(output_path, metric + '_distances.csv')
)
return distances
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment