Commit 9ded41f6 authored by Joey Scarr's avatar Joey Scarr Committed by Commit Bot

Replace most of sheriff.md with a link to internal docs.

I've left the section about sheriffing philosophy, since that's also useful for non-sheriffs.

Bug: 1095911
Change-Id: I9e97cc1951e5cc96fe9aa8ccc42e14c875f0f0b1
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2274481
Commit-Queue: Joey Scarr <jsca@chromium.org>
Reviewed-by: default avatarElly Fong-Jones <ellyjones@chromium.org>
Reviewed-by: default avatarDirk Pranke <dpranke@google.com>
Cr-Commit-Position: refs/heads/master@{#784319}
parent 9eb8fd64
...@@ -2,15 +2,6 @@ ...@@ -2,15 +2,6 @@
Author: ellyjones@ Author: ellyjones@
Audience: Chromium build sheriff rotation members
This document describes how to be a Chromium sheriff: what your responsibilities
are and how to go about them. It outlines a specific, opinionated view of how to
be an effective sheriff which is not universally practiced but may be useful for
you.
[TOC]
## Sheriffing Philosophy ## Sheriffing Philosophy
Sheriffs have one overarching role: to ensure that the Chromium build Sheriffs have one overarching role: to ensure that the Chromium build
...@@ -29,419 +20,12 @@ necessary authority to fulfill them. In particular, you have the authority to: ...@@ -29,419 +20,12 @@ necessary authority to fulfill them. In particular, you have the authority to:
* Revert changes that you know or suspect are causing breakages * Revert changes that you know or suspect are causing breakages
* Disable or otherwise mark misbehaving tests * Disable or otherwise mark misbehaving tests
* Use [TBRs] freely as part of your sheriffing duties * Use TBRs freely as part of your sheriffing duties
* Pull in any other engineer or team you need to help you do these duties * Pull in any other engineer or team you need to help you do these duties
Do not be shy about asking for debugging help from subject-matter experts, and ## How to be a Sheriff
do not hesitate to ask for guidance on Slack (see below) when you need it. In
particular, there are many experienced sheriffs, ops folks, and other helpful
pseudo-humans in [slack #sheriffing].
## Tools Of The Trade
### A Developer Checkout
Effective sheriffing requires an [up-to-date checkout][get-the-code], primarily
so that you can create CLs to mark tests, but also so you can attempt local
reproduction or debugging of failures if necessary. If you are a googler, you
will want to have [goma] working as well, since fast builds provide faster
turnaround times.
### Slack
Sheriffing is coordinated in the [slack #sheriffing] channel. If you don't yet
have [Slack] set up, it is worth setting it up to sheriff. If you don't want to
use Slack, you will at a minimum need a way to coordinate with your fellow
sheriffs and probably with the ops team.
These are important Slack channels for sheriffs:
* [#sheriffing][slack #sheriffing]: for sheriffing
tasks and work tracking
* [#ops][slack #ops]: for [infra
issues][contacting-troopers] with gerrit, git, the bots, monorail,
sheriff-o-matic, findit, ...
* [#halp][slack #halp]: for general chromium
build & development issues - "no question too newbie!"
A good way to use Slack for sheriffing is as follows: for each new task you do
(investigating a build failure, marking a specific test flaky/slow/etc, working
with a trooper, ...), post a new message in the #sheriffing channel, then
immediately start a thread based on that message. Post all your status updates
on that specific task in that thread, being verbose and using lots of
detail/links; be especially diligent about referencing bug numbers, CL numbers,
usernames, and so on, because these will be searchable later. For example, let's
suppose you see the mac-rel bot has gone red in browser tests:
you: looking at mac-rel browser_tests failure
[in thread]
you: a handful of related-looking tests failed
you: ah, looks like this was caused by CL 12345, reverting
you: revert is CL 23456, TBR @othersheriff
you: revert is in now, snoozing failure
Only use "also send to channel" when either the thread is old enough to have
scrolled off people's screens, or the message you are posting is important
enough to appear in the channel at the top level as well - this helps keep the
main channel clear of smaller status updates and makes it more like an index of
the threads.
### Monorail
[Monorail] is our bug tracker. Hopefully you are already familiar with it.
When sheriffing, you use Monorail to:
* Look for existing bugs
* File bugs when you revert CLs, explaining what happened and including any
stack traces or error messages
* File bugs when you disable tests, again with as much information as practical
### The Waterfall
The [CI console page], commonly referred to as "the waterfall" because of how it
looks, displays the current state of many of the Chromium bots. In the old days
before Sheriff-o-matic (see below) this was the main tool sheriffs used, and it
is still extremely useful. Especially note the "links" panel, which can take you
not just to other waterfalls but to other sheriffing tools.
### Sheriff-o-Matic
[Sheriff-o-matic] attempts to
aggregate build or test failures together into groups. How to use
sheriff-o-matic effectively is described below, but the key parts of the UI are:
* The bug queue: this is a dashboard of bugs that might be relevant to the
on-duty sheriffs, such as known infra issues, failures that are under active
investigation by engineering teams, and so on.
* The "consistent failures" list: these are failures that have happened more
than once. These are **sometimes** actual consistent failures, and sometimes
the same suite or bot failing twice in two different ways, so it is not always
to be trusted but can be useful.
* The "new failures" list: these are singleton failures that have not yet become
"consistent". These tend to disappear if a bot or suite goes green on a
subsequent pass, so this section can be fairly noisy.
Sheriff-o-matic can automatically generate CLs to do some sheriffing tasks, like
disabling specific tests. You can access this feature via the "Layout Test
Expectations" link in the sidebar; it is called "TA/DA".
### Builder (or "Bot") Pages
Each builder page displays a view of the history of that builder's most recent builds -
here is an example: [Win10 Tests
x64](https://ci.chromium.org/p/chromium/builders/ci/Win10%20Tests%20x64). You
can get more history on this page using the links to the bottom-left of the
build list, or appending a `?limit=n` to the builder page's URL.
If you click through to an individual build ([Win10 Tests x64 Build
36397](https://ci.chromium.org/p/chromium/builders/ci/Win10%20Tests%20x64/36397)
for example), the important features of the build page are:
1. The build revision, which is listed at the top - this is key for knowing if a
given build had a specific CL or not.
2. The blamelist (one of the tabs at the top) - this lists all CLs that were in
this build, but not in the previous build for this builder. Note that this
only shows **Chromium** changes; any changes that were incorporated via a
DEPS roll of another repo won't be listed here, and you will instead have to
look at the changelog for that DEP manually. There are usually links to views
of those changes in the commit messages of DEPS roll CLs.
3. The list of steps, each of which has a link to its output and metadata.
Especially important is the "lookup builder gn args" step, which tells you
how to replicate the builder's build config (although probably you will want
to use your own value of `goma_dir`).
4. Within individual steps, the "stdout" link and sometimes also the "shard"
links, which let you read the builder's actual console logs during that step.
Note that the sharded tasks are actually executing on different machines and
are aggregated back into the builder's results.
#### CI<->Try Mapping
To translate CI builder names to try builder names, refer to the
[CI<->Try mapping](https://source.chromium.org/chromium/chromium/tools/build/+/master:scripts/slave/recipe_modules/chromium_tests/trybots.py).
### The Flake Portal (The "New Flakiness Dashboard")
The [new flakiness dashboard] is much faster than the old one but has a
different set of features. **Note that to effectively use this tool you must log
in** via the link in the top right.
To look at the history of a suite on a specific builder here:
1. In the "Flakes" view (the default), use Search By Tags
2. Add a filter for `builder` == (eg) `Win10 Tests x64`
3. Add a filter for `test_type` == (eg) `browser_tests`
To look at how flaky a specific test is across all builders:
1. In the "Flakes" view, use Search By Test
2. Fill in the test name into the default filter
This gives you a UI listing the failures for the named test broken down by
builder. At this point you should check to see where the flakes start
revision-wise. If that helps you identify a culprit CL revert it and move on;
otherwise, [disable the test][test-disable].
For more advice on dealing with flaky tests, look at the "Test Failed" section
below under "Diagnosing Build Failures".
### The Old Flakiness Dashboard
The [old flakiness dashboard] is often **extremely slow** and has a tendency not
to provide full results, but it sometimes can diagnose kinds of flakiness that
are not as easy to see in the new dashboard. It is not generally used these
days, but if you need it, you use it like this:
1. Pick a builder - ideally one matching the test failure you're investigating,
or as similar to it as possible. After this, the dashboard will likely hang
for some seconds, or possibly crash, but reloading the page will load the
dashboard with the right builder set.
2. Pick the relevant test type. Do not pick a 'with patch' suite - these are
from people doing CQ runs with their changes.
3. Find-in-page to locate the test you have in mind.
If you instead want to see how flaky a given test is across *all* builders, like
if you're trying to diagnose whether a specific test flakes on macOS generally:
1. Type a glob pattern or substring of the test name in the "Show tests on all
platforms" field - this will clear out the "builder" dropdown.
2. Scroll down **past** the default results - the query results you asked for
are below them.
In either view, you are looking for grey or black cells, which indicate
flakiness. Clicking one of these cells will let you see the actual log of these
failures; you should eyeball a couple of these to make sure that it's the same
kind of flake. It's also a good idea to look through history (scroll right) to
see if the flakes started at a specific point, in which case you can look for
culprit CLs around there. In general, for a flaky test, you should either:
* Revert a culprit CL, if you can find it, or
* [Disable the test][test-disable] it as narrowly as possible to fix the flake
(eg, if the test is broken on Windows, only disable it on Windows)
For more advice on dealing with flaky tests, look at the "Test Failed" section
below under "Diagnosing Build Failures".
### <a name="tree"></a>Tree Status Page
The [tree status page] tracks and lets you
set the state of the Chromium tree. You'll need this page to reopen the tree,
and sometimes to find out who previously closed it if it was manually closed.
The tree states are:
* "Open": green banner. The CQ runs and commits land as normal
* "Closed": red banner. The CQ will not run; commits can still land via direct
submit (which is how you'll land fix commits) or CQ bypass
* "Throttled": yellow banner. semi-synonymous with "closed". In theory this
means "ask the sheriff before landing a change", but in practice this state
is not used now and few people know what it means.
Note that various pieces of automation parse the tree state directly from the
textual message, and some pieces update it automatically as well - e.g., if the
tree is automatically closed by a builder, it will automatically reopen if that
builder goes green. To satisfy that automation, the message should always be
formatted as "Tree is $state ($details)", like:
* "Tree is open ('s all good!)"
* "Tree is closed (sheriff investigating win bot infra failure)"
Another key phrase is "channel is sheriff", which roughly means "nobody is
on-duty as sheriff right now"; you can use this if (eg) you are the only sheriff
on duty and you need to be away for more than 15min or so:
* "Tree is open (channel is sheriff, back at 1330 UTC)"
## The Sheriffing Loop
You are on duty during your normal work hours; don't worry about adjusting your
work hours for maximum coverage or anything like that.
Note that you are expected **not** to do any of your normal project work while
you are on sheriff duty - you are expected to be spending 100% of your work time
on the sheriffing work listed here.
While you are on duty and at your desk, you should be in the "sheriffing loop",
which goes like this:
### 1. Is The Tree Closed?
If yes:
* Start a new thread immediately in Slack - don't wait until you have started
investigating or "have something to say" to do this!
* Start figuring out what went wrong and working on fixing it
* Once it's fixed, or you're pretty confident it's fixed, reopen the tree via the [Tree Status Page](#tree)
Do not wait for a slow builder to cycle green before reopening if you are
reasonably confident you have landed a fix - "90% confidence" is an okay
threshold for a reopen.
### 2. Are there consistent failures in sheriff-o-matic?
If yes:
* Start a new thread in Slack for one of the failures
* Investigate the failure and figure out why it happened - see the "diagnosing
build failures" section below
* Fix what went wrong, remembering to file a bug to document both the failure
and your fix
* Snooze the failure, with a link to the bug if your investigation resulted in a
bug
### 3. Are there other red bots on the waterfall?
If yes:
* Start a new thread in Slack for one of the bots
* Investigate the bot - has it been red for a long time, or newly so? Is it an
important bot to someone?
* Fix, notify, or bother people as appropriate
### 4. Does anything in the sheriff-o-matic bug queue need action?
If yes:
* Find the relevant Slack thread or start a new one
* Investigate the bug and see what needs to happen - ping the owner, see if the
priority & assignment are right, or try fixing it yourself. Check if it still
needs to be in the queue! If it doesn't need to be in the queue, remove the
`Sheriff-Chromium` label. Do this when
* the test has been disabled or marked flaky
* the problem is no longer impacting any of the builders
### 5. All's well?
If none of the above conditions obtain, it's time to do some longer-term project
health work!
* Look for disabled/flaky tests and try to figure out how to re-enable them
* Look through test expectation files and see if expectations are obsolete or
can be removed
* Re-triage old or forgotten bugs
* Manually hunt through the flakiness dashboards for flaky tests and mark them
as such or file bugs for them
* Honestly, maybe just take a break, then restart the loop :)
**Don't go back to doing your regular project work** - sheriffing is a full-time
job while your shift is happening.
## Diagnosing Build Failures
This section discusses how to figure out what's wrong with a failed build.
Generally a build fails for one of four reasons, each of which has its own
section below.
### Compile Failed
You can spot this kind of failure because the "Compile" step is marked red on
the bot's page. For this kind of failure the cause is virtually always a CL in
the bot's blamelist; find that CL and revert it. If you can't find the CL, try
reproducing the build config for the bot locally and seeing if you can reproduce
the compile failure. If you don't have that build setup (eg, the broken bot is
an iOS bot and you are a Linux developer and thus unable to build for iOS), get
in touch with members of that team for help reproing / fixing the failure.
Remember that you are empowered to pull in other engineers to help you fix the
tree!
### Test Failed
You can spot this kind of failure by a test step being marked red on the bot's
page. Note that test steps with "(experimental)" at the end don't count - these
can fail without turning the entire bot red, and can usually be safely ignored.
Also, if a suite fails, it is usually best to focus on the first failed test
within the suite, since a failing test will sometimes disturb the state of the
test runner for the rest of the tests.
Take a look at the first red step. The buildbot page will probably say something
like "Deterministic failures: Foo.Bar", and then "Flaky failures: Baz.Quxx
(ignored)". You too can ignore the flaky failures for the moment - the
deterministic failures are the ones that actually made the step go red. Note
that here "deterministic" means "failed twice in a row" and "flaky" means
"failed once", so a deterministic failure can still be caused by a flake.
From here there are a couple of places to go:
* If a set of related-looking tests have all failed, probably something in the
blamelist caused it; the other option is that some framework or service that
all of them depend upon flaked at the same time. One example of this kind of
framework flake is the infamous [bug 869227](https://crbug.com/869227).
* If a single test failed, check the blamelist for any obvious culprits;
otherwise, check the flakiness dashboard for that test and see if the test
should be marked as flaky.
* If a LOT of tests failed, look for common symptoms in the output of a handful
of them at random; this will probably point you at either a CL in the
blamelist or some framework/service flake culprit. Also check the sheriff bug
queue to see if there are possible causes listed there.
When debugging a `layout_tests` failure: use the `layout_test_results` step link
from the bot page; this will give you a useful UI that lets you see image diffs
and similar. This will be under "archive results" usually.
One thing to specifically look out for: if a test is often slow, it will
sometimes flakily time out on bots that are extra-slow, especially bots that run
sanitizers (MSAN/TSAN/ASAN) or debug bots. If you see flaky timeouts for a test,
but only on these bots, that test might just be slow. For Blink tests there is a
file called [SlowTests] that lists these tests and gives them more time to run;
for Chromium tests you can just mark these as `DISABLED_` in those configurations
if you want.
When marking a Blink test as slow/flaky/etc, it's often not clear who to send
the CL to, especially if the test is a WPT test imported from another repository
or hasn't been edited in a while. In those situations, or when you just can't
find the right owner for a test, feel free to TBR your CL to one of the other
sheriffs on your rotation and kick it into the Blink triage queue (ie: mark the
bug as Untriaged with component Blink).
#### Skia Gold Outage
This is a special case of a failed test. The following test suites rely on the
Skia Gold image diffing service:
* `*pixel_skia_gold_test`
* `*maps_pixel_test`
* `chrome_public_test_apk`
* `chrome_public_test_vr_apk`
* `pixel_browser_tests`
In the unlikely event of a Gold outage, **all** of those suites will begin
failing with errors related to Gold/`goldctl`. If this occurs, the best way to
prevent failures from breaking bots is to uncomment the
`--bypass-skia-gold-functionality` argument in the `skia_gold_test` mixin in
[`//testing/buildbot/mixins.pyl`][mixins].
This should only be used when you are sure that the failures are caused by an
outage, as it will cause all pixel tests that use Skia Gold to no longer
perform any actual pixel testing.
### Infra Breakage
If a bot turns purple rather than red, that indicates an infra failure of some
type. These can have unpredictable effects, and in particular, if a bot goes red
after a purple run, that can often be caused by corrupt disk state on the bot.
Ask a trooper for help with this via [go/bugatrooper][bug-a-trooper]. Do not
immediately ping in [slack #ops] unless you have just filed (or are about
to file) a Pri-0 infra bug, or you have a Pri-1 bug that has been ignored for >2
hours.
### Other Causes
There are many other things that can go wrong, which are too individually rare
and numerous to be listed here. Ask for help with diagnosis in
[slack #sheriffing] and hopefully someone else will be able to help you figure
it out.
[CI console page]: https://ci.chromium.org/p/chromium/g/chromium/console To be a sheriff, you must be both a Chromium committer and a Google employee.
[SlowTests]: https://cs.chromium.org/chromium/src/third_party/blink/web_tests/SlowTests For more detailed sheriffing instructions, please see the internal documentation
[TBRs]: https://chromium.googlesource.com/chromium/src/+/master/docs/code_reviews.md#TBR-To-Be-Reviewed at
[bug-a-trooper]: https://goto.google.com/bugatrooper [go/chrome-sheriffing-how-to](https://goto.google.com/chrome-sheriffing-how-to).
[contacting-troopers]: https://chromium.googlesource.com/infra/infra/+/master/doc/users/contacting_troopers.md
[get-the-code]: https://www.chromium.org/developers/how-tos/get-the-code
[goma]: http://shortn/_Iox00npQJW
[test-disable]: https://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-How-do-I-disable-a-flaky-test-
[new flakiness dashboard]: https://analysis.chromium.org/p/chromium/flake-portal
[old flakiness dashboard]: https://test-results.appspot.com/dashboards/flakiness_dashboard.html
[sheriff-o-matic]: https://sheriff-o-matic.appspot.com/chromium
[slack #halp]: https://chromium.slack.com/messages/CGGPN5GDT/
[slack #ops]: https://chromium.slack.com/messages/CGM8DQ3ST/
[slack #sheriffing]: https://chromium.slack.com/messages/CGJ5WKRUH/
[slack]: https://chromium.slack.com
[tree status page]: https://chromium-status.appspot.com/
[mixins]: https://source.chromium.org/chromium/chromium/src/+/master:testing/buildbot/mixins.pyl
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment