Clarifications to network bug triage process.

Specfiically: * Highlight top level responsibilities. * Clarify detailed responsibilities around data gathering, filing crasher bugs, and monitoring Gasper/UMA. BUG=None R=mmenke@chromium.org Review URL: https://codereview.chromium.org/970953002 Cr-Commit-Position: refs/heads/master@{#319903}

Clarifications to network bug triage process.
Specfiically: * Highlight top level responsibilities. * Clarify detailed responsibilities around data gathering, filing crasher bugs, and monitoring Gasper/UMA. BUG=None R=mmenke@chromium.org Review URL: https://codereview.chromium.org/970953002 Cr-Commit-Position: refs/heads/master@{#319903}
9ae51ac1 · rdsmith · Commit bot · a125b85c · 9ae51ac1 · 9ae51ac1
Commit 9ae51ac1 authored Mar 10, 2015 by rdsmith Committed by Commit bot Mar 10, 2015
Hide whitespace changes
Inline Side-by-side

Showing with 100 additions and 31 deletions

net/docs/bug-triage-suggested-workflow.txt net/docs/bug-triage-suggested-workflow.txt +70 -19

net/docs/bug-triage.txt net/docs/bug-triage.txt +30 -12

No files found.
--- a/net/docs/bug-triage-suggested-workflow.txt
+++ b/net/docs/bug-triage-suggested-workflow.txt
+Look for new crashers:
+* Go to go/chromecrash.
+* For each platform, look through the releases for which releases to
+  investigate.  As per bug-triage.txt, this should be the
+  most recent canary, the previous canary (if the most recent is less
+  than a day old), and any of dev/beta/stable that were released in the
+  last couple of days.  
+* For each release, in the "Process Type" frame, click on "browser".
+* At the bottom of the "Magic Signature" frame,  click "limit 1000".
+  Reported crashers are sorted in decreasing order of the number of reports for
+  that crash signature.
+* Search the page for "net::".  
+* For each found signature:
+  * If there is a bug already filed, make sure it is correctly
+    describing the current bug (e.g. not closed, or not describing a
+    long-past issue), and make sure that if it is a net:: bug, that
+    it is labeled as such.  
+  * Ignore signatures that only occur once, as memory corruption can
+    easily cause one-off failures when the sample size is large
+    enough.
+  * Ignore signatures that only come from a single client ID, as
+    individual machine malware and breakage can also easily cause
+    one-off failures.  
+  * Click on the number of reports field to see details of
+    crash. Ignore it if it doesn't appear to be a network bug. 
+  * Otherwise, file a new bug directly from chromecrash.  Note that
+    this may result in filing bugs for low- and very-low- frequency
+    crashes.  That's ok; the bug tracker is a better tool to figure
+    out whether or not we put resources into those crashes than a snap
+    judgement when filing bugs.
+* For each bug you file, include the following information:
+  * The backtrace.  Note that the backtrace should not be added to the
+    bug if Restrict-View-Google isn't set on the bug as it may contain
+    PII.  Filing the bug from the crash reporter should do this
+    automatically, but check.
+  * The channel in which the bug is seen (canary/dev/beta/stable),
+    its frequency in that channel, and its rank among crashers in the channel.
+  * The frequency of this signature in recent releases.  This
+    information is available by:
+    * Clicking on the signature in the "Magic Signature" list
+    * Clicking "Edit" on the dremel query at the top of the page
+    * Removing the "product.version='X.Y.Z.W' AND" string and clicking
+      "Update".
+    * Clicking "Limit 1000" in the Product Version list in the
+      resulting page (without this, the listing will be restricted to
+      the releases in which the signature is most common, which will
+      often not include the canary/dev release being investigated).
+    * Choose some subset of that list, or all of it, to include in the
+      bug.  Make sure to indicate if there is a defined point in the
+      past before which the signature is not present.
 Identifying unlabeled network bugs on the tracker:
 * Look at new uncomfirmed bugs since noon PST on the last triager's rotation:
    https://code.google.com/p/chromium/issues/list?can=2&q=status%3Aunconfirmed&sort=-id&num=1000
@@ -28,6 +79,9 @@ Investigating Cr-Internals-Network bugs:
 * Look through uncomfirmed and untriaged Cr-Internals-Network bugs, prioritizing
    those updated within the last week:
    https://code.google.com/p/chromium/issues/list?can=2&q=Cr%3DInternals-Network+-status%3AAssigned+-status%3AStarted+-status%3AAvailable+&sort=-modified
+* If more information is needed from the reporter, ask for it and
+    add the 'Needs-Feedback' label.  If the reporter has answered an
+    earlier request for information, remove that label.
 * While investigating a new issue, change the status to Untriaged.
 * If a bug is a potential security issue (Allows for code execution from remote
    site, allows crossing security boundaries, unchecked array bounds, etc) mark
@@ -75,25 +129,22 @@ Investigating Cr-Internals-Network bugs:
    sublabel applies, or only the Cr-Internals-Network-HTTP sublabel applies,
    and there's no clear owner), try to figure out the exact cause.
-Look for new crashers:
+Monitor UMA histograms and gasper alerts.  For each Gasper alert that
-* Go to go/chromecrash.
+fires, determine if it's a real alert and file a bug if so.  
-* For each platform, go to the latest canary.
+* Don't file if the alert is coincident with a major volume change.
-* In the "Process Type" frame, click on "browser".
+  The volume at a particular date can be determined by hovering the
-* At the bottom of the "Magic Signature" frame,  click "limit 1000".
+  mouse over the appropriate location on the alert line.
-  Reported crashers are sorted in decreasing order of the number of reports for
+* Don't file if the alert is on a graph with very low volume (< ~200
-  that crash signature.
+  data points); it's probably noise, and we probably don't care even
-* Search the page for "net::".  Ignore crashes that only occur once, as
+  if it isn't.
-    memory corruption can easily cause one-off failures when the sample size is
+* Don't file if the graph is really noisy (but eyeball it to decide if
-    large enough.
+  there is an underlying important shift under the noise).
-* Click on the number of reports field to see details of crash. Look at the
+* Don't file if the alert is in the "Known Ignorable" list:
-    stack trace to confirm it's a network bug.
+  * SimpleCache on Windows
-  * If it is, and there's no associated bug filed, file a new bug directly from
+  * DiskCache on Android.
-      chromecrash, looking at earlier canaries to determine if it's a recent
+For each Gasper alert, respond to chrome-network-debugging@ with a
-      regression.  Use the most specific label possible.
+summary of the action you've taken and why, including issue link if an
-* The most recent Canary may not yet have a full day of crashes, so it may be
+issue was filed.  
-    worth looking at more than one version.
-* If there's been a dev, beta, or stable release in the last couple days, should
-    also look at those.
 Investigating crashers:
 * Only investigate crashers that are still occurring, as identified by above

--- a/net/docs/bug-triage.txt
+++ b/net/docs/bug-triage.txt
@@ -4,15 +4,30 @@ label seems suitable.
 Responsibilities
-To be done on each rotation.  These responsibilities should be tracked, and
+Required:
-anything left undone at the end of a rotation should be handed off to the next
+* Identify new crashers
-triager.  The downside to passing along bug investigations like this is each new
+* Identify new network issues.
+* Request data about recent Cr-Internals-Network issue.
+* Investigate each recent Cr-Internals-Network issue.
+* Monitor UMA histograms and gasper alerts.
+Best effort:
+* Investigate unowned and owned-but-forgotten net/ crashers
+* Investigate old bugs
+* Close obsolete bugs.
+All of the above is to be done on each rotation.  These
+responsibilities should be tracked, and anything left undone at the
+end of a rotation should be handed off to the next triager.  The
+downside to passing along bug investigations like this is each new
 triager has to get back up to speed on bugs the previous triager was
-investigating.  The upside is that triagers don't get stuck investigating issues
+investigating.  The upside is that triagers don't get stuck
-after their time after their rotation, and it results in a uniform, predictable
+investigating issues after their time after their rotation, and it
-two day commitment for all triagers.
+results in a uniform, predictable two day commitment for all triagers.
+More detail:
-Primary Responsibilities:
+Required activities:
 * Identify new crashers that are potentially network related.  You should check
    the most recent canary, the previous canary (if the most recent less than a
    day old), and any of dev/beta/stable that were released in the last couple
@@ -24,10 +39,9 @@ Primary Responsibilities:
    responsible for looking at bugs reported from noon PST / 3:00 pm EST of the
    last day of the previous triager's rotation until the same time on the last
    day of their rotation.
-* Request data about recent unassigned Cr-Internals-Network bugs from reporters.
-    "Recent" means issues updated in the past week or so.
 * Investigate each recent (New comment within the past week or so)
-    Cr-Internals-Network issue until you can do one of the following:
+  Cr-Internals-Network issue, driving getting information from reporters as
+  needed, until you can do one of the following:
  * Mark it as WontFix (working as intended, obsolete issue) or a duplicate.
  * Mark it as a feature request.
  * Remove the Cr-Internals-Network label, replacing it with at least one more
@@ -43,9 +57,13 @@ Primary Responsibilities:
    Available.  Future triagers should ignore bugs with this status, unless
    investigating stale bugs.
 * Monitor UMA histograms and gasper alerts.
-    TODO (mmenke):  Add a suggested workflow.
+  * For each Gasper alert that fires, the triager should determine if
+    the alert is real (not due to noise), and file a bug with the
+    appropriate label if so.  Note that if no label more specific than
+    Cr-Internals-Network is appropriate, the responsibility remains
+    with the triager to continue investigating the bug, as above.
-Best Effort (As you time):
+Best Effort (As you have time):
 * Investigate unowned and owned but forgotten net/ crashers that are still
    occurring (As indicated by go/chromecrash), prioritizing frequent and long
    standing crashers.