Commit 2cda996d authored by tdresser's avatar tdresser Committed by Commit Bot

Add documentation on authoring metrics.

BUG=None

Review-Url: https://codereview.chromium.org/2973213002
Cr-Commit-Position: refs/heads/master@{#488299}
parent af101146
# Making Metrics Actionable with Diagnostic Metrics
[TOC]
We want our metrics to be reflective of user experience, so we know we’re optimizing for the right thing. However, metrics which accurately reflect user experience are often so high level that they aren’t very actionable. Diagnostic metrics are submetrics which enable us to act on our high level user experience metrics. Also see the document on constructing a [good toplevel metric](good_toplevel_metrics.md) for guidance on constructing high quality user experience metrics.
There are three types of diagnostic metrics:
* Summations
* Slices
* Proxies
## Summation Diagnostics
We often notice that a number is Too Big. Whether it’s the time it took to generate a frame, or the time until a page was visible, the first thing we want to know is what’s contributing to the number.
Summations enable us to answer these questions. In a Summation diagnostic, the diagnostic metrics sum up to the higher level metric. For example, a Summation diagnostic for First Meaningful Paint (FMP) might be the durations the main thread spent doing various tasks, such as Style, Layout, V8, Idle, etc before FMP fired. These diagnostics often lead to hierarchies, where the top level metric, such as FMP, has a diagnostic metric, such as time spent in V8 before FMP, which has further diagnostic metrics, such as the time spent parsing, compiling, or executing JS. Summation breakdowns are implemented in telemetry as [Related Histogram Breakdowns](https://cs.chromium.org/chromium/src/third_party/catapult/tracing/tracing/value/diagnostics/related_histogram_breakdown.html?q=RelatedHistogramBreakdown&sq=package:chromium&l=18).
With Summation diagnostics, the top level metric equals the sum of all diagnostics metrics. It’s **extremely important** that you don’t leave things out of a Summation diagnostic. This can seem a little daunting - how are you going to account for everything that contributes to the top level metric?
The best way to do this is to start with something you can easily measure, and also report the "unexplained time".
Suppose we're creating a Summation diagnostic for TimeFromNavStartToInteractive. And, suppose we can easily time Idle and Script. So, we report those two only (don’t do this!)
* TimeInScript: 800ms
* TimeInIdle: 300ms
You'd incorrectly conclude from this data that script is the problem, and focus on optimizing script. This would be a shame, because if you had reported unexplained time, the reality would become clearer:
* TimeInScript: 800ms
* TimeInIdle: 300ms
* Unexplained: 800ms
Here, it jumps out that you've got some data that you've not explained and you should, before you leap to conclusions.
So, start with a single pair of data:
1. a specific submetric that you're sure you can measure, and
2. a way to measure "the rest."
It might be that you start off just with:
1. Time in Script
2. Unexplained == TimeToInteractive - TimeInScript
But at least when you do this, your "unexplained time" is jumping out at you. From there, your goal is to drive that number downward to the 5%-ish range. Maybe on most pages, script is so huge that you get to 80% coverage. Great! Then, you study a few pages with high "unexplained" time and figure out, "aha, this has a lot of idle time." So you add idle to your diagnostics, and maybe that gets you to 90% coverage. Repeat until you're happy enough.
Diagnostics are imperfect. You'll always have some unexplained. And tracking your unexplained time will keep you honest and pointed in the right direction.
## Slicing Diagnostics
Slicing Diagnostics split up a metric based on its context. For example, we could split up Memory Use by whether a process has foreground tabs, or the number of tabs a user has open, or whether there’s a video playing. For each way we slice the metric, the higher level metric is a weighted average of the diagnostic metrics.
With Slicing diagnostics, the top level metric equals the weighted sum of all diagnostic metrics. In the examples above, the weight of each diagnostic is the fraction of the time spent in the given context. Slicing diagnostics are implemented in telemetry via [Related Histogram Maps](https://cs.chromium.org/chromium/src/third_party/catapult/tracing/tracing/value/diagnostics/related_histogram_map.html?q=RelatedHistogramMap&sq=package:chromium&l=16).
In the same way that when constructing a Summation Diagnostic we account for everything which contributes to the high level metric, when producing a Slicing Diagnostic, we ensure that we don’t leave out any contexts. If you want to Slice a metric by the number of tabs a user has open, you shouldn’t just use a set of reasonable tab numbers, from 1-8 for example. You should make sure to also have an overflow context (9+), so we get the full picture.
## Proxy Diagnostics
Some diagnostic metrics correlate with the higher level metric, but aren’t related in any precise way. For example, the top level FMP metric measures wall clock time. We could add a CPU time equivalent as a diagnostic metric, which is likely to have lower noise. In cases like this, we expect there to exist some monotonic function which approximately maps from the top level metric to the diagnostic metric, but this relationship could be quite rough.
Slicing diagnostics are implemented in telemetry via [Related Histogram Maps](https://cs.chromium.org/chromium/src/third_party/catapult/tracing/tracing/value/diagnostics/related_histogram_map.html?q=RelatedHistogramMap&sq=package:chromium&l=16).
## Composing Diagnostics
Many metrics will have multiple sets of diagnostics. For example, this set of FMP diagnostics involves Slicing, Summation, and Proxy Diagnostics.
With this diagnostic we can tell if there’s a regression in JS times that’s specific to when there’s a ServiceWorker on the page, or if there’s a reduction in idle time spent on pages when we’ve got metered connections.
![example of diagnostic metrics](images/diagnostic-metrics-example.png)
# Properties of a Good Top Level Metric
When defining a top level metric, there are several desirable properties which are frequently in tension. This document attempts to roughly outline the desirable properties we should keep in mind when defining a metric. Also see the document on improving the actionability of a top level metric via [diagnostic metrics](diagnostic_metrics.md).
[TOC]
## Representative
Top level metrics are how we understand our product’s high level behavior, and if they don’t correlate with user experience, our understanding is misaligned with the real world. However, measuring representativeness is costly. In the long term, we can use ablation studies (in the browser or in partnership with representative sites), or user studies to confirm representativeness. In the short term, we use our intuition in defining the metric, and carefully measure the metric implementation’s accuracy.
These metrics would ideally also correlate strongly with business value, making it easy to motivate site owners to optimizing these metrics.
## Accurate
When we first come up with a metric, we have a concept in mind of what the metric is trying to measure. The accuracy of a metric implementation is how closely the metric implementation aligns to our conceptual model of what we’re trying to measure.
For example, First Contentful Paint was created to measure the first time we paint something the user might actually care about. Our current implementation looks at when the browser first painted any text, image, non-white canvas or SVG. The accuracy of this metric is determined by how often the first thing painted which the user cares about is text, image, canvas or SVG.
To evaluate how accurate a metric is, there’s no substitute for manual evaluation. Ideally, this evaluation phase would be performed by multiple people, with little knowledge of the metric in question.
To initially evaluate the accuracy of a point in time metric:
* Gather a bunch of samples of pages where we can compute our metric.
* Get a group of people unfamiliar with proposed metric implementations to identify what they believe the correct point in time for each sample.
* Measure the variability of the hand picked points in time. If this amount of variability is deemed too high, we’ll need to come up with a more specific metric, which is easier to hand evaluate.
* Measure the error between the implementation results and the hand picked results. Ideally, our error measurement would be more forgiving in cases where humans were unsure of the correct point in time. We don’t have a concrete plan here yet.
To initially evaluate accuracy of a quality of experience metric, we rely heavily on human intuition:
* Gather a bunch of samples of pages where we can compute our metric.
* Get a group of people unfamiliar with the proposed metric implementations to sort the samples by their estimated quality of experience.
* Use [Spearman's rank-order correlation](https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php) to examine how well correlated the different orderings are. If they aren’t deemed consistent enough, we’ll need to come up with a more specific metric, which is easier to hand evaluate.
* Use the metric implementation to sort the samples.
* Use [Spearman's rank-order correlation](https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php) to evaluate how similar the metric implementation is to the hand ordering.
## Stable
A metric is stable if the result doesn’t vary much between successive runs on similar input. This can be quantitatively evaluated, ideally using Chrome Trace Processor and cluster telemetry on the top 10k sites. Eventually we hope to have a concrete threshold for a specific spread metric here, but for now, we gather the stability data, and analyze it by hand.
Different domains have different amounts of inherent instability - for example, when measuring page load performance using a real network, the network injects significant variability. We can’t avoid this, but we can try to implement metrics which minimize instability, and don’t exaggerate the instability inherent in the system.
## Interpretable
A metric is interpretable if the numbers it produces are easy to understand, especially for individuals without strong domain knowledge. For example, point-in-time metrics tend to be easy to explain, even if their implementations are complicated (see "Simplicity"). For example, it’s easy to communicate what First Meaningful Paint is, even if how we compute it is very complicated. Conversely, something like [SpeedIndex](https://sites.google.com/a/webpagetest.org/docs/using-webpagetest/metrics/speed-index) is somewhat difficult to explain and [hard to reason about](https://docs.google.com/document/d/14K3HTKN7tyROlYQhSiFP89TT-Ddg2aId9uyEsWj5UAY/edit) - it’s the average time at which things were displayed on the page.
Metrics which are easy to interpret are often easier to evaluate. For example, First Meaningful Paint can be evaluated by comparing hand picked first meaningful paint times to the results of a given approach for computing first meaningful paint. SpeedIndex is more complicated to evaluate - we’d need to use the approach given [above](#Accurate) for quality of experience metrics.
## Simple
A metric is simple if the way its computed is easy to understand. There’s a strong correlation between being simple and being interpretable, but there are counter examples, such as FMP being interpretable, but not simple.
A simple metric is less likely to have been overfit during the metric development / evaluation phase, and has other obvious advantages (easier to maintain, often faster to execute, less likely to contain bugs).
One crude way of quantifying simplicity is to measure the number of tunable parameters. For example, we can look at two ways of aggregating Frame Throughput. We could look at the average Frame Throughput defined over all animations during the pageview. Alternatively, we could look for the 300ms window with the worst average Frame Throughput. The second approach has one additional parameter, and is thus strictly more complex.
## Elastic
A good metric is [elastic](https://en.wikipedia.org/wiki/Elasticity_of_a_function), that is, a small change in the input (the page) results in a small change in the output.
In a continuous integration environment, you want to know whether or not a given code change resulted in metric improvements or regressions. Non-elastic metrics often obscure changes, making it hard to justify small but meaningful improvements, or allowing small but meaningful regressions to slip by. Elastic metrics also generally have lower variability.
This is frequently at odds with the interpretability requirement. For example, First Meaningful Paint is easier to interpret than SpeedIndex, but is non-elastic.
If your metric involves thresholds (such as the 50ms task length threshold in TTI), or heuristics (looking at the largest jump in the number of layout objects in FMP), it’s likely to be non-elastic.
## Realtime
We’d like to have metrics which we can compute in realtime. For example, if we’re measuring First Meaningful Paint, we’d like to know when First Meaningful Paint occurred *at the time it occurred*. This isn’t always attainable, but when possible, it avoids some classes of [survivorship bias](https://en.wikipedia.org/wiki/Survivorship_bias), which makes metrics easier to analyze.
# Example
[Time to Consistently Interactive](https://docs.google.com/document/d/1GGiI9-7KeY3TPqS3YT271upUVimo-XiL5mwWorDUD4c/edit):
* Representative
* We should eventually do an ablation study, similar to the page load ablation study [here](https://docs.google.com/document/d/1wpu8aqZIUVgjNm9zBP9gU_swx5ODleH1s2Kueo1pIfc/edit#).
* Accurate
* Summary [here](https://docs.google.com/document/d/1GGiI9-7KeY3TPqS3YT271upUVimo-XiL5mwWorDUD4c/edit#heading=h.iqlwzaf6lqrh), analysis [here](https://docs.google.com/document/d/1pZsTKqcBUb1pc49J89QbZDisCmHLpMyUqElOwYqTpSI/edit#bookmark=id.4euqu19nka18). Overall, based on manual investigation of 25 sites, our approach fired uncontroversially at the right time 64% of the time, and possibly too late the other 36% of time. We split TTI in two to allow this metric to be quite pessimistic about when TTI fires, so we’re happy with when this fires for all 25 sites. A few issues with this research:
* Ideally someone less familiar with our approach would have performed the evaluation.
* Ideally we’d have looked at more than 25 sites.
* Stable
* Analysis [here](https://docs.google.com/document/d/1GGiI9-7KeY3TPqS3YT271upUVimo-XiL5mwWorDUD4c/edit#heading=h.27s41u6tkfzj).
* Interpretable
* Time to Consistently Interactive is easy to explain. We report the first 5 second window where the network is roughly idle and no tasks are greater than 50ms long.
* Elastic
* Time to Consistently Interactive is generally non-elastic. We’re investigating another metric which will quantify how busy the main thread is between FMP and TTI, which should be a nice elastic proxy metric for TTI.
* Simple
* Time To Consistently Interactive has a reasonable amount of complexity, but is much simpler than Time to First Interactive. Time to Consistently Interactive has 3 parameters:
* Number of allowable requests during network idle (currently 2).
* Length of allowable tasks during main thread idle (currently 50ms).
* Window length (currently 5 seconds).
* Realtime
* Time To Consistently Interactive is definitely not realtime, as it needs to wait until it’s seen 5 seconds of idle time before declaring that we became interactive at the start of the 5 second window.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment