Commit 9618adde authored by Kenneth Russell's avatar Kenneth Russell Committed by Commit Bot

Update documentation for deploying new GPU FYI test bots.

Worked with zmo@ to develop this documentation along with the example
CLs.

Also added top-level links to the GPU docs and cleaned up some other
now-incorrect documentation. More cleanups are still needed.

Bug: 838970
Tbr: zmo@chromium.org
Change-Id: I40e07adf864974a645490627f5a055ce686237a3
Reviewed-on: https://chromium-review.googlesource.com/1041273
Commit-Queue: Kenneth Russell <kbr@chromium.org>
Reviewed-by: default avatarKenneth Russell <kbr@chromium.org>
Cr-Commit-Position: refs/heads/master@{#555661}
parent 60347d6c
...@@ -163,6 +163,16 @@ used when committed. ...@@ -163,6 +163,16 @@ used when committed.
built, with associated desciptions. built, with associated desciptions.
* [IPC Fuzzer](ipc_fuzzer.md) - Fuzz testing of Chromium IPC interfaces. * [IPC Fuzzer](ipc_fuzzer.md) - Fuzz testing of Chromium IPC interfaces.
### GPU-related docs
* [GPU Pixel Wrangling](gpu/pixel_wrangling.md) - Instructions for GPU
pixel wrangling (the GPU sheriffing rotation).
* [Debugging GPU related code](gpu/debugging_gpu_related_code.md) - Hints for
debugging GPU- and graphics-related code.
* [GPU Testing](gpu/gpu_testing.md) - Description of Chromium's GPU testing
infrastructure.
* [GPU Bot Details](gpu/gpu_testing_bot_details.md) - In-depth description of
how the bots are maintained.
### Misc Linux-Specific Docs ### Misc Linux-Specific Docs
* [Linux Proxy Config](linux_proxy_config.md) - Network proxy sources on Linux * [Linux Proxy Config](linux_proxy_config.md) - Network proxy sources on Linux
* [Debugging SSL on Linux](linux_debugging_ssl.md) - Tips on debugging SSL * [Debugging SSL on Linux](linux_debugging_ssl.md) - Tips on debugging SSL
......
# GPU Bot Details # GPU Bot Details
This PAGE describes in detail how the GPU bots are set up, which files affect This page describes in detail how the GPU bots are set up, which files affect
their configuration, and how to both modify their behavior and add new bots. their configuration, and how to both modify their behavior and add new bots.
[TOC] [TOC]
...@@ -20,15 +20,16 @@ browser. ...@@ -20,15 +20,16 @@ browser.
The GPU bots are utilized on the [chromium.gpu] and [chromium.gpu.fyi] The GPU bots are utilized on the [chromium.gpu] and [chromium.gpu.fyi]
waterfalls, and various tryservers, as described in [Using the GPU Bots]. waterfalls, and various tryservers, as described in [Using the GPU Bots].
[chromium.gpu]: https://build.chromium.org/p/chromium.gpu/console [chromium.gpu]: https://ci.chromium.org/p/chromium/g/chromium.gpu/console
[chromium.gpu.fyi]: https://build.chromium.org/p/chromium.gpu.fyi/console [chromium.gpu.fyi]: https://ci.chromium.org/p/chromium/g/chromium.gpu.fyi/console
[Using the GPU Bots]: gpu_testing.md#Using-the-GPU-Bots [Using the GPU Bots]: gpu_testing.md#Using-the-GPU-Bots
The vast majority of the hardware for the bots lives in the Chrome-GPU Swarming All of the physical hardware for the bots lives in the Swarming pool, and most
pool. The waterfall bots are simply virtual machines which spawn Swarming tasks of it in the Chrome-GPU Swarming pool. The waterfall bots are simply virtual
with the appropriate tags to get them to run on the desired GPU and operating machines which spawn Swarming tasks with the appropriate tags to get them to run
system type. So, for example, the [Win10 Release (NVIDIA)] bot is actually a on the desired GPU and operating system type. So, for example, the [Win10
virtual machine which spawns all of its jobs with the Swarming parameters: Release (NVIDIA)] bot is actually a virtual machine which spawns all of its jobs
with the Swarming parameters:
[Win10 Release (NVIDIA)]: https://ci.chromium.org/buildbot/chromium.gpu/Win10%20Release%20%28NVIDIA%29/?limit=200 [Win10 Release (NVIDIA)]: https://ci.chromium.org/buildbot/chromium.gpu/Win10%20Release%20%28NVIDIA%29/?limit=200
...@@ -239,44 +240,7 @@ This is described in [Adding new tests to the GPU bots]. ...@@ -239,44 +240,7 @@ This is described in [Adding new tests to the GPU bots].
### How to add a new bot ### How to add a new bot
The first decision point when adding a new GPU bot is whether it is a one-off #### How to add a new tester bot to the chromium.gpu.fyi waterfall
piece of hardware, or one which is expected to be scaled up at some point. If
it's a one-off piece of hardware, it can be added to the chromium.gpu.fyi
waterfall as a non-swarmed test machine. If it's expected to be scaled up at
some point, the hardware should be added to the swarming pool. These two
scenarios are described in more detail below.
#### How to add a new, non-swarmed, physical bot to the chromium.gpu.fyi waterfall
1. Work with the Chrome Infrastructure Labs team to get the hardware deployed
so it can talk to the chromium.gpu.fyi master.
1. Create a CL in the build workspace which:
1. Add the new machine to
[`masters/master.chromium.gpu.fyi/builders.pyl`][master.chromium.gpu.fyi/builders.pyl].
1. Add the new machine to
[`scripts/slave/recipe_modules/chromium_tests/chromium_gpu_fyi.py`][chromium_gpu_fyi.py].
Set the `enable_swarming` property to `False`.
1. Retrain recipe expectations
(`scripts/slave/recipes.py --use-bootstrap test train`) and add the
newly created JSON file(s) corresponding to the new machines to your CL.
1. Create a CL in the Chromium workspace to:
1. Add the new machine to
[`src/content/test/gpu/generate_buildbot_json.py`][generate_buildbot_json.py].
Make sure to set the `swarming` property to `False`.
1. If the machine runs GN, add a description to
[`src/tools/mb/mb_config.pyl`][mb_config.pyl].
1. Once the build workspace CL lands, use go/bug-a-trooper (or contact kbr@)
to schedule a restart of the chromium.gpu.fyi waterfall. This is only
necessary when modifying files under the masters/ directory. A reboot of
the machine may be needed once the waterfall has been restarted in order to
make it connect properly.
1. The CLs from (2) and (3) can land in either order, though it is preferable
to land the Chromium-side CL first so that the machine knows what tests to
run the first time it boots up.
[master.chromium.gpu.fyi/builders.pyl]: https://chromium.googlesource.com/chromium/tools/build/+/master/masters/master.chromium.gpu.fyi/builders.pyl
#### How to add a new swarmed bot to the chromium.gpu.fyi waterfall
When deploying a new GPU configuration, it should be added to the When deploying a new GPU configuration, it should be added to the
chromium.gpu.fyi waterfall first. The chromium.gpu waterfall should be reserved chromium.gpu.fyi waterfall first. The chromium.gpu waterfall should be reserved
...@@ -290,37 +254,63 @@ In order to add Release and Debug waterfall bots for a new configuration, ...@@ -290,37 +254,63 @@ In order to add Release and Debug waterfall bots for a new configuration,
experience has shown that at least 4 physical machines are needed in the experience has shown that at least 4 physical machines are needed in the
swarming pool. The reason is that the tests all run in parallel on the Swarming swarming pool. The reason is that the tests all run in parallel on the Swarming
cluster, so the load induced on the swarming bots is higher than it would be cluster, so the load induced on the swarming bots is higher than it would be
for a non-swarmed bot that executes its tests serially. if the tests were run strictly serially.
With these prerequisites, these are the steps to add a new swarmed bot. With these prerequisites, these are the steps to add a new (swarmed) tester bot.
(Actually, pair of bots -- Release and Debug.) (Actually, pair of bots -- Release and Debug. If deploying just one or the
other, ignore the other configuration.) These instructions assume that you are
reusing one of the existing builders, like [`GPU FYI Win Builder`][GPU FYI Win
Builder].
1. Work with the Chrome Infrastructure Labs team to get the (minimum 4) 1. Work with the Chrome Infrastructure Labs team to get the (minimum 4)
physical machines added to the Swarming pool. Use physical machines added to the Swarming pool. Use
[chromium-swarm.appspot.com] or `src/tools/swarming_client/swarming.py bots` [chromium-swarm.appspot.com] or `src/tools/swarming_client/swarming.py bots`
to determine the PCI IDs of the GPUs in the bots. (These instructions will to determine the PCI IDs of the GPUs in the bots. (These instructions will
need to be updated for Android bots which don't have PCI buses.) need to be updated for Android bots which don't have PCI buses.)
1. Make sure to add these new machines to the Chrome-GPU Swarming pool by 1. Make sure to add these new machines to the Chrome-GPU Swarming pool by
creating a CL against [`configs/chromium-swarm/bots.cfg`][bots.cfg] in creating a CL against [`configs/chromium-swarm/bots.cfg`][bots.cfg] in
the [infradata/config] workspace. the [infradata/config] (Google internal) workspace. Here is an [example
CL](https://chrome-internal-review.googlesource.com/524420).
1. File a Chrome Infrastructure Labs ticket requesting 2 virtual machines for 1. File a Chrome Infrastructure Labs ticket requesting 2 virtual machines for
the testers. These need to match the OS of the physical machines and the testers. These need to match the OS of the physical machines and
builders because of limitations in the scripts which transfer builds from builders because of limitations in the scripts which transfer builds from
the builder to the tester; see [this feature the builder to the tester; see [this feature
request](http://crbug.com/581953). For example, if you're adding a "Windows request](http://crbug.com/581953). For example, if you're adding a "Windows
7 CoolNewGPUType" tester, you'll need 2 Windows VMs. 7 CoolNewGPUType" tester, you'll need 2 Windows VMs.
1. Once the VMs are ready, create a CL in the build workspace which:
1. Adds the new VMs as the Release and Debug bots in 1. Once the VMs are ready, create a CL in the
[`master.chromium.gpu.fyi/builders.pyl`][master.chromium.gpu.fyi/builders.pyl]. [`infradata/config`][infradata/config] (Google internal) workspace which
1. Adds the new VMs to [`chromium_gpu_fyi.py`][chromium_gpu_fyi.py]. Make does the following. Here's an [example
sure to set the `enable_swarming` and `serialize_tests` properties to CL](https://chrome-internal-review.googlesource.com/619497).
`True`. Double-check the `parent_buildername` property for each. It 1. Adds two new "bot_group" blocks in the Chromium GPU FYI section of
must match the Release/Debug flavor of the builder. [`configs/chromium-swarm/bots.cfg`][bots.cfg], one for the Release bot
1. Retrain recipe expectations and one for the Debug bot. Copy the closest configuration you can find
(`scripts/slave/recipes.py --use-bootstrap test train`) and add the -- for example, Windows, Android, etc.
newly created JSON file(s) corresponding to the new machines to your CL. 1. Get this reviewed and landed. This step associates the VM with the bot's
1. Create a CL in the Chromium workspace which: name on the waterfall.
1. Adds the new machine to
1. Create a CL in the [`tools/build`][tools/build] workspace which does the
following. Here's an [example CL](https://chromium-review.googlesource.com/1041145).
1. Adds the new VMs to [`chromium_gpu_fyi.py`][chromium_gpu_fyi.py] in
`scripts/slave/recipe_modules/chromium_tests/`. Make sure to set the
`serialize_tests` property to `True`. This is specified for waterfall
bots, but not trybots, and helps avoid overloading the physical
hardware. Double-check the `BUILD_CONFIG` and `parent_buildername`
properties for each. They must match the Release/Debug flavor of the
builder, like `GPU FYI Win Builder` vs. `GPU FYI Win Builder (dbg)`.
1. Get this reviewed and landed. This step tells the Chromium recipe about
the newly-deployed waterfall bot. In the next CL we will generate the
JSON files which run tests on the bot.
1. It used to be necessary to retrain recipe expectations
(`scripts/slave/recipes.py --use-bootstrap test train`). This doesn't
appear to be necessary any more, but it's something to watch out for if
your CL fails presubmit for some reason.
1. Create a CL in the Chromium workspace which does the following. Here's an
[example CL](https://chromium-review.googlesource.com/1041164).
1. Adds the new machines to
`src/content/test/gpu/generate_buildbot_json.py`. `src/content/test/gpu/generate_buildbot_json.py`.
1. The swarming dimensions are crucial. These must match the GPU and 1. The swarming dimensions are crucial. These must match the GPU and
OS type of the physical hardware in the Swarming pool. This is what OS type of the physical hardware in the Swarming pool. This is what
...@@ -339,19 +329,42 @@ With these prerequisites, these are the steps to add a new swarmed bot. ...@@ -339,19 +329,42 @@ With these prerequisites, these are the steps to add a new swarmed bot.
`Windows-2008ServerR2-SP1` (the Win7-like flavor running in our `Windows-2008ServerR2-SP1` (the Win7-like flavor running in our
data center). Similarly, the Win8 bots had to have a very precise data center). Similarly, the Win8 bots had to have a very precise
OS description (`Windows-2012ServerR2-SP0`). OS description (`Windows-2012ServerR2-SP0`).
1. If the machine runs GN, adds a description to 1. If you're deploying a new bot that's similar to another existing
[`src/tools/mb/mb_config.pyl`][mb_config.pyl]. configuration, please search around in the file for references to
1. Once the tools/build CL lands, use go/bug-a-trooper (or contact kbr@) to the other bot's name and see if your new bot needs to be added to
schedule a restart of the chromium.gpu.fyi waterfall. This is only any exclusion lists. For example, some of the tests don't run on
necessary when modifying files under the masters/ directory. A reboot of certain Win bots because of missing OpenGL extensions.
the VMs may be needed once the waterfall has been restarted in order to 1. Run this script to regenerate
make them connect properly. `src/testing/buildbot/chromium.gpu.fyi.json`.
1. The CLs from (3) and (4) can land in either order, though it is preferable 1. Updates [`cr-buildbucket.cfg`][cr-buildbucket.cfg]:
to land the Chromium-side CL first so that the machine knows what tests to * Add the two new machines (Release and Debug) inside the
run the first time it boots up. luci.chromium.ci bucket. This sets up storage for the builds in the
system. Use the appropriate mixin; for example, "win-gpu-fyi-ci" has
already been set up for Windows GPU FYI bots on the waterfall.
1. Updates [`luci-scheduler.cfg`][luci-scheduler.cfg]:
* Add new "job" blocks for your new Release and Debug test bots. They
should go underneath the builder which triggers them (like "GPU Win
FYI Builder"), in alphabetical order. Make sure the "id" and
"builer" entries match. This job block should use the acl_sets
"triggered-by-parent-builders", because it's triggered by the
builder, and not by changes to the git repository.
1. Updates [`luci-milo.cfg`][luci-milo.cfg]:
* Add new "builders" blocks for your new testers (Release and Debug)
on the [`chromium.gpu.fyi`][chromium.gpu.fyi] console. Look at the
short names and categories and try to come up with a reasonable
organization.
1. If you were adding a new builder, you would need to also add the new
machine to [`src/tools/mb/mb_config.pyl`][mb_config.pyl].
1. After the Chromium-side CL lands it will take some time for all of the
configuration changes to be picked up by the system. The bot might be in a
purple state until that's happened, but it will recover.
[bots.cfg]: https://chrome-internal.googlesource.com/infradata/config/+/master/configs/chromium-swarm/bots.cfg [bots.cfg]: https://chrome-internal.googlesource.com/infradata/config/+/master/configs/chromium-swarm/bots.cfg
[infradata/config]: https://chrome-internal.googlesource.com/infradata/config/ [infradata/config]: https://chrome-internal.googlesource.com/infradata/config/
[cr-buildbucket.cfg]: https://chromium.googlesource.com/chromium/src/+/master/infra/config/global/cr-buildbucket.cfg
[luci-milo.cfg]: https://chromium.googlesource.com/chromium/src/+/master/infra/config/global/luci-milo.cfg
[luci-scheduler.cfg]: https://chromium.googlesource.com/chromium/src/+/master/infra/config/global/luci-scheduler.cfg
[GPU FYI Win Builder]: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/GPU%20FYI%20Win%20Builder
#### How to start running tests on a new GPU type on an existing try bot #### How to start running tests on a new GPU type on an existing try bot
...@@ -444,34 +457,40 @@ to run the new driver on one of the waterfalls for a day or two to make sure ...@@ -444,34 +457,40 @@ to run the new driver on one of the waterfalls for a day or two to make sure
the tests are running reliably green before rolling out the driver update the tests are running reliably green before rolling out the driver update
everywhere. To do this: everywhere. To do this:
1. Work with the Chrome Infrastructure Labs team to deploy a single, 1. Make sure that all of the current Swarming jobs for this OS and GPU
non-swarmed, physical machine on the chromium.gpu.fyi waterfall running the configuration are targeted at the "stable" version of the driver in
new driver. The OS and GPU should exactly match the configuration you `src/testing/gpu/generate_buildbot_json.py`.
intend to upgrade. See 1. File a `Build Infrastructure` bug, component `Infra>Labs`, to have ~4 of the
[How to add a new, non-swarmed, physical bot to the chromium.gpu.fyi waterfall](#How-to-add-a-new_non-swarmed_physical-bot-to-the-chromium_gpu_fyi-waterfall). physical machines already in the Swarming pool upgraded to the new version
2. Hopefully, the new machine will pass the pixel tests. If it doesn't, then of the driver.
1. If an "experimental" version of this bot doesn't yet exist, follow the
instructions above for [How to add a new tester bot to the chromium.gpu.fyi
waterfall](#How-to-add-a-new-tester-bot-to-the-chromium_gpu_fyi-waterfall)
to deploy one.
1. Have this experimental bot target the new version of the driver in
`src/testing/gpu/generate_buildbot_json.py`.
1. Hopefully, the new machine will pass the pixel tests. If it doesn't, then
unfortunately, it'll be necessary to follow the instructions on unfortunately, it'll be necessary to follow the instructions on
[updating the pixel tests] to temporarily suppress the failures on this [updating the pixel tests] to temporarily suppress the failures on this
particular configuration. Keep the time window for these test suppressions particular configuration. Keep the time window for these test suppressions
as narrow as possible. as narrow as possible.
3. Watch the new machine for a day or two to make sure it's stable. 1. Watch the new machine for a day or two to make sure it's stable.
4. When it is, ask the Chrome Infrastructure Labs team to roll out the driver 1. When it is, update `src/testing/gpu/generate_buildbot_json.py` to use the
update across all of the similarly configured bots in the swarming pool. "gpu trigger script" functionality to select *either* the stable *or* the
5. If necessary, update pixel test expectations and remove the suppressions new driver version on the stable version of the bot. See [this
CL](https://chromium-review.googlesource.com/882344) for an example, though
that CL was targeting a different OS version rather than driver version.
1. After that lands, ask the Chrome Infrastructure Labs team to roll out the
driver update across all of the similarly configured bots in the swarming
pool.
1. If necessary, update pixel test expectations and remove the suppressions
added above. added above.
6. Prepare and land a CL removing the temporary machine from the 1. Remove the alternate swarming dimensions for the stable bot from
chromium.gpu.fyi waterfall. Request a waterfall restart. `generate_buildbot_json.py`, locking it to the new driver version.
7. File a ticket with the Chrome Infrastructure Labs team to reclaim the
temporary machine. Note that we leave the experimental bot in place. We could reclaim it, but it
seems worthwhile to continuously test the "next" version of graphics drivers as
Note that with recent improvements to Swarming, in particular [this well as the current stable ones.
RFE](https://github.com/luci/luci-py/issues/253) and others, these steps are no
longer strictly necessary – it's possible to target Swarming jobs at a
particular driver version. If
[`generate_buildbot_json.py`][generate_buildbot_json.py] were improved to be
more specific about the driver version on the various bots, then the machines
with the new drivers could simply be added to the Swarming pool, and this
process could be a lot simpler. Patches welcome. :)
[updating the pixel tests]: https://www.chromium.org/developers/testing/gpu-testing/#TOC-Updating-and-Adding-New-Pixel-Tests-to-the-GPU-Bots [updating the pixel tests]: https://www.chromium.org/developers/testing/gpu-testing/#TOC-Updating-and-Adding-New-Pixel-Tests-to-the-GPU-Bots
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment