Commit 9ae51ac1 authored by rdsmith's avatar rdsmith Committed by Commit bot

Clarifications to network bug triage process.

Specfiically:
* Highlight top level responsibilities.
* Clarify detailed responsibilities around data gathering, filing crasher bugs, and monitoring Gasper/UMA.

BUG=None
R=mmenke@chromium.org

Review URL: https://codereview.chromium.org/970953002

Cr-Commit-Position: refs/heads/master@{#319903}
parent a125b85c
Look for new crashers:
* Go to go/chromecrash.
* For each platform, look through the releases for which releases to
investigate. As per bug-triage.txt, this should be the
most recent canary, the previous canary (if the most recent is less
than a day old), and any of dev/beta/stable that were released in the
last couple of days.
* For each release, in the "Process Type" frame, click on "browser".
* At the bottom of the "Magic Signature" frame, click "limit 1000".
Reported crashers are sorted in decreasing order of the number of reports for
that crash signature.
* Search the page for "net::".
* For each found signature:
* If there is a bug already filed, make sure it is correctly
describing the current bug (e.g. not closed, or not describing a
long-past issue), and make sure that if it is a net:: bug, that
it is labeled as such.
* Ignore signatures that only occur once, as memory corruption can
easily cause one-off failures when the sample size is large
enough.
* Ignore signatures that only come from a single client ID, as
individual machine malware and breakage can also easily cause
one-off failures.
* Click on the number of reports field to see details of
crash. Ignore it if it doesn't appear to be a network bug.
* Otherwise, file a new bug directly from chromecrash. Note that
this may result in filing bugs for low- and very-low- frequency
crashes. That's ok; the bug tracker is a better tool to figure
out whether or not we put resources into those crashes than a snap
judgement when filing bugs.
* For each bug you file, include the following information:
* The backtrace. Note that the backtrace should not be added to the
bug if Restrict-View-Google isn't set on the bug as it may contain
PII. Filing the bug from the crash reporter should do this
automatically, but check.
* The channel in which the bug is seen (canary/dev/beta/stable),
its frequency in that channel, and its rank among crashers in the channel.
* The frequency of this signature in recent releases. This
information is available by:
* Clicking on the signature in the "Magic Signature" list
* Clicking "Edit" on the dremel query at the top of the page
* Removing the "product.version='X.Y.Z.W' AND" string and clicking
"Update".
* Clicking "Limit 1000" in the Product Version list in the
resulting page (without this, the listing will be restricted to
the releases in which the signature is most common, which will
often not include the canary/dev release being investigated).
* Choose some subset of that list, or all of it, to include in the
bug. Make sure to indicate if there is a defined point in the
past before which the signature is not present.
Identifying unlabeled network bugs on the tracker: Identifying unlabeled network bugs on the tracker:
* Look at new uncomfirmed bugs since noon PST on the last triager's rotation: * Look at new uncomfirmed bugs since noon PST on the last triager's rotation:
https://code.google.com/p/chromium/issues/list?can=2&q=status%3Aunconfirmed&sort=-id&num=1000 https://code.google.com/p/chromium/issues/list?can=2&q=status%3Aunconfirmed&sort=-id&num=1000
...@@ -28,6 +79,9 @@ Investigating Cr-Internals-Network bugs: ...@@ -28,6 +79,9 @@ Investigating Cr-Internals-Network bugs:
* Look through uncomfirmed and untriaged Cr-Internals-Network bugs, prioritizing * Look through uncomfirmed and untriaged Cr-Internals-Network bugs, prioritizing
those updated within the last week: those updated within the last week:
https://code.google.com/p/chromium/issues/list?can=2&q=Cr%3DInternals-Network+-status%3AAssigned+-status%3AStarted+-status%3AAvailable+&sort=-modified https://code.google.com/p/chromium/issues/list?can=2&q=Cr%3DInternals-Network+-status%3AAssigned+-status%3AStarted+-status%3AAvailable+&sort=-modified
* If more information is needed from the reporter, ask for it and
add the 'Needs-Feedback' label. If the reporter has answered an
earlier request for information, remove that label.
* While investigating a new issue, change the status to Untriaged. * While investigating a new issue, change the status to Untriaged.
* If a bug is a potential security issue (Allows for code execution from remote * If a bug is a potential security issue (Allows for code execution from remote
site, allows crossing security boundaries, unchecked array bounds, etc) mark site, allows crossing security boundaries, unchecked array bounds, etc) mark
...@@ -75,25 +129,22 @@ Investigating Cr-Internals-Network bugs: ...@@ -75,25 +129,22 @@ Investigating Cr-Internals-Network bugs:
sublabel applies, or only the Cr-Internals-Network-HTTP sublabel applies, sublabel applies, or only the Cr-Internals-Network-HTTP sublabel applies,
and there's no clear owner), try to figure out the exact cause. and there's no clear owner), try to figure out the exact cause.
Look for new crashers: Monitor UMA histograms and gasper alerts. For each Gasper alert that
* Go to go/chromecrash. fires, determine if it's a real alert and file a bug if so.
* For each platform, go to the latest canary. * Don't file if the alert is coincident with a major volume change.
* In the "Process Type" frame, click on "browser". The volume at a particular date can be determined by hovering the
* At the bottom of the "Magic Signature" frame, click "limit 1000". mouse over the appropriate location on the alert line.
Reported crashers are sorted in decreasing order of the number of reports for * Don't file if the alert is on a graph with very low volume (< ~200
that crash signature. data points); it's probably noise, and we probably don't care even
* Search the page for "net::". Ignore crashes that only occur once, as if it isn't.
memory corruption can easily cause one-off failures when the sample size is * Don't file if the graph is really noisy (but eyeball it to decide if
large enough. there is an underlying important shift under the noise).
* Click on the number of reports field to see details of crash. Look at the * Don't file if the alert is in the "Known Ignorable" list:
stack trace to confirm it's a network bug. * SimpleCache on Windows
* If it is, and there's no associated bug filed, file a new bug directly from * DiskCache on Android.
chromecrash, looking at earlier canaries to determine if it's a recent For each Gasper alert, respond to chrome-network-debugging@ with a
regression. Use the most specific label possible. summary of the action you've taken and why, including issue link if an
* The most recent Canary may not yet have a full day of crashes, so it may be issue was filed.
worth looking at more than one version.
* If there's been a dev, beta, or stable release in the last couple days, should
also look at those.
Investigating crashers: Investigating crashers:
* Only investigate crashers that are still occurring, as identified by above * Only investigate crashers that are still occurring, as identified by above
......
...@@ -4,15 +4,30 @@ label seems suitable. ...@@ -4,15 +4,30 @@ label seems suitable.
Responsibilities Responsibilities
To be done on each rotation. These responsibilities should be tracked, and Required:
anything left undone at the end of a rotation should be handed off to the next * Identify new crashers
triager. The downside to passing along bug investigations like this is each new * Identify new network issues.
* Request data about recent Cr-Internals-Network issue.
* Investigate each recent Cr-Internals-Network issue.
* Monitor UMA histograms and gasper alerts.
Best effort:
* Investigate unowned and owned-but-forgotten net/ crashers
* Investigate old bugs
* Close obsolete bugs.
All of the above is to be done on each rotation. These
responsibilities should be tracked, and anything left undone at the
end of a rotation should be handed off to the next triager. The
downside to passing along bug investigations like this is each new
triager has to get back up to speed on bugs the previous triager was triager has to get back up to speed on bugs the previous triager was
investigating. The upside is that triagers don't get stuck investigating issues investigating. The upside is that triagers don't get stuck
after their time after their rotation, and it results in a uniform, predictable investigating issues after their time after their rotation, and it
two day commitment for all triagers. results in a uniform, predictable two day commitment for all triagers.
More detail:
Primary Responsibilities: Required activities:
* Identify new crashers that are potentially network related. You should check * Identify new crashers that are potentially network related. You should check
the most recent canary, the previous canary (if the most recent less than a the most recent canary, the previous canary (if the most recent less than a
day old), and any of dev/beta/stable that were released in the last couple day old), and any of dev/beta/stable that were released in the last couple
...@@ -24,10 +39,9 @@ Primary Responsibilities: ...@@ -24,10 +39,9 @@ Primary Responsibilities:
responsible for looking at bugs reported from noon PST / 3:00 pm EST of the responsible for looking at bugs reported from noon PST / 3:00 pm EST of the
last day of the previous triager's rotation until the same time on the last last day of the previous triager's rotation until the same time on the last
day of their rotation. day of their rotation.
* Request data about recent unassigned Cr-Internals-Network bugs from reporters.
"Recent" means issues updated in the past week or so.
* Investigate each recent (New comment within the past week or so) * Investigate each recent (New comment within the past week or so)
Cr-Internals-Network issue until you can do one of the following: Cr-Internals-Network issue, driving getting information from reporters as
needed, until you can do one of the following:
* Mark it as WontFix (working as intended, obsolete issue) or a duplicate. * Mark it as WontFix (working as intended, obsolete issue) or a duplicate.
* Mark it as a feature request. * Mark it as a feature request.
* Remove the Cr-Internals-Network label, replacing it with at least one more * Remove the Cr-Internals-Network label, replacing it with at least one more
...@@ -43,9 +57,13 @@ Primary Responsibilities: ...@@ -43,9 +57,13 @@ Primary Responsibilities:
Available. Future triagers should ignore bugs with this status, unless Available. Future triagers should ignore bugs with this status, unless
investigating stale bugs. investigating stale bugs.
* Monitor UMA histograms and gasper alerts. * Monitor UMA histograms and gasper alerts.
TODO (mmenke): Add a suggested workflow. * For each Gasper alert that fires, the triager should determine if
the alert is real (not due to noise), and file a bug with the
appropriate label if so. Note that if no label more specific than
Cr-Internals-Network is appropriate, the responsibility remains
with the triager to continue investigating the bug, as above.
Best Effort (As you time): Best Effort (As you have time):
* Investigate unowned and owned but forgotten net/ crashers that are still * Investigate unowned and owned but forgotten net/ crashers that are still
occurring (As indicated by go/chromecrash), prioritizing frequent and long occurring (As indicated by go/chromecrash), prioritizing frequent and long
standing crashers. standing crashers.
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment