Commit d5b7ee52 authored by David Munro's avatar David Munro Committed by Chromium LUCI CQ

crostini: Timeouts for every step

NEVER_FINISHED is one of the larger (and growing :'( ) error buckets.
Add timeouts to every step to ensure that hangs are caught, turned into
errors for the right state, and the user has a chance to retry.

To get there we:
1. Add some timeouts, initially set to quite generous values which we'll
   tighten in future CLs. We have crosbolt data, but would like some
   real-world timings before getting stricter.
2. Add a metric tracking the time  spent in each step, to give us a
   real-world breakdown to supplement what we have in crosbolt
3. Add a metric tracking how often we see invalid transitions. In theory
   a slow operation could send its response during the next restart
   attempt, and since there's no validation on the state transitions
   we'd end up in a weird state with unknown effects. There are options
   with varying thoroughness (and corresponding cost) to address this,
   but first let's see if it's actually a problem. This could already
   happen (though we haven't had reports) and there are some mitigations
   in place already so it may be a non-issue.

Bug: 1142319
Test: Unit tests, mannually run installation + app launch.
Change-Id: Iafd15cf8f659b9a29764ce399359a29711e35043
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2563042
Commit-Queue: David Munro <davidmunro@google.com>
Reviewed-by: default avatarMark Pearson <mpearson@chromium.org>
Reviewed-by: default avatarRyo Hashimoto <hashimoto@chromium.org>
Reviewed-by: default avatarDominick Ng <dominickn@chromium.org>
Reviewed-by: default avatarMichael Lippautz <mlippautz@chromium.org>
Reviewed-by: default avatarNicholas Verne <nverne@chromium.org>
Reviewed-by: default avatarFergus Dall <sidereal@google.com>
Cr-Commit-Position: refs/heads/master@{#836505}
parent c0b1e944
......@@ -645,6 +645,12 @@ class CrostiniManager : public KeyedService,
CrostiniMicSharingEnabledObserver* observer);
void RemoveCrostiniMicSharingEnabledObserver(
CrostiniMicSharingEnabledObserver* observer);
void CallRestarterStartLxdContainerFinishedForTesting(
CrostiniManager::RestartId id,
CrostiniResult result);
void SetInstallTerminaNeverCompletesForTesting(bool never_completes) {
install_termina_never_completes_ = never_completes;
}
private:
class CrostiniRestarter;
......@@ -930,6 +936,8 @@ class CrostiniManager : public KeyedService,
TerminaInstaller termina_installer_{};
bool install_termina_never_completes_ = false;
// Note: This should remain the last member so it'll be destroyed and
// invalidate its weak pointers before any other members are destroyed.
base::WeakPtrFactory<CrostiniManager> weak_ptr_factory_{this};
......
......@@ -79,7 +79,17 @@ enum class CrostiniResult {
NEVER_FINISHED = 49,
CONTAINER_SETUP_FAILED = 50,
START_LXD_FAILED = 51,
kMaxValue = START_LXD_FAILED,
INSTALL_IMAGE_LOADER_TIMED_OUT = 52,
CREATE_DISK_IMAGE_TIMED_OUT = 53,
START_TERMINA_VM_TIMED_OUT = 54,
START_LXD_TIMED_OUT = 55,
CREATE_CONTAINER_TIMED_OUT = 56,
SETUP_CONTAINER_TIMED_OUT = 57,
START_CONTAINER_TIMED_OUT = 58,
FETCH_SSH_KEYS_TIMED_OUT = 59,
MOUNT_CONTAINER_TIMED_OUT = 60,
UNKNOWN_STATE_TIMED_OUT = 61,
kMaxValue = UNKNOWN_STATE_TIMED_OUT,
// When adding a new value, check you've followed the steps in the comment at
// the top of this enum.
};
......
......@@ -4,18 +4,21 @@
module crostini.mojom;
// Keep this in sync with CrostiniInstallerState in enums.xml
// These values are persisted to logs. Entries should not be renumbered and
// numeric values should never be reused.
enum InstallerState {
kStart, // Just started installation
kInstallImageLoader, // Loading the Termina VM component.
kCreateDiskImage, // Creating the image for the Termina VM.
kStartTerminaVm, // Starting the Termina VM.
kStartLxd, // Starting LXD.
kCreateContainer, // Creating the container inside the Termina VM.
kSetupContainer, // Setting up the container inside the Termina VM.
kStartContainer, // Starting the container inside the Termina VM.
kFetchSshKeys, // Fetch ssh keys from concierge.
kMountContainer, // Do sshfs mount of container.
kConfigureContainer, // Configuring the container inside the Termina VM.
kStart = 0, // Just started installation
kInstallImageLoader = 1, // Loading the Termina VM component.
kCreateDiskImage = 2, // Creating the image for the Termina VM.
kStartTerminaVm = 3, // Starting the Termina VM.
kStartLxd = 4, // Starting LXD.
kCreateContainer = 5, // Creating the container inside the Termina VM.
kSetupContainer = 6, // Setting up the container inside the Termina VM.
kStartContainer = 7, // Starting the container inside the Termina VM.
kFetchSshKeys = 8, // Fetch ssh keys from concierge.
kMountContainer = 9, // Do sshfs mount of container.
kConfigureContainer = 10, // Configuring the container inside the Termina VM.
};
enum InstallerError {
......
......@@ -215,9 +215,10 @@ void FakeCiceroneClient::CreateLxdContainer(
const vm_tools::cicerone::CreateLxdContainerRequest& request,
DBusMethodCallback<vm_tools::cicerone::CreateLxdContainerResponse>
callback) {
base::ThreadTaskRunnerHandle::Get()->PostTask(
base::ThreadTaskRunnerHandle::Get()->PostDelayedTask(
FROM_HERE,
base::BindOnce(std::move(callback), create_lxd_container_response_));
base::BindOnce(std::move(callback), create_lxd_container_response_),
send_create_lxd_container_response_delay_);
// Trigger CiceroneClient::Observer::NotifyLxdContainerCreatedSignal.
vm_tools::cicerone::LxdContainerCreatedSignal signal;
......@@ -225,9 +226,11 @@ void FakeCiceroneClient::CreateLxdContainer(
signal.set_vm_name(request.vm_name());
signal.set_container_name(request.container_name());
signal.set_status(lxd_container_created_signal_status_);
base::ThreadTaskRunnerHandle::Get()->PostTask(
FROM_HERE, base::BindOnce(&FakeCiceroneClient::NotifyLxdContainerCreated,
weak_factory_.GetWeakPtr(), std::move(signal)));
base::ThreadTaskRunnerHandle::Get()->PostDelayedTask(
FROM_HERE,
base::BindOnce(&FakeCiceroneClient::NotifyLxdContainerCreated,
weak_factory_.GetWeakPtr(), std::move(signal)),
send_notify_lxd_container_created_signal_delay_);
}
void FakeCiceroneClient::DeleteLxdContainer(
......@@ -245,33 +248,35 @@ void FakeCiceroneClient::StartLxdContainer(
callback) {
start_lxd_container_response_.mutable_os_release()->CopyFrom(
lxd_container_os_release_);
base::ThreadTaskRunnerHandle::Get()->PostTask(
base::ThreadTaskRunnerHandle::Get()->PostDelayedTask(
FROM_HERE,
base::BindOnce(std::move(callback), start_lxd_container_response_));
base::BindOnce(std::move(callback), start_lxd_container_response_),
send_start_lxd_container_response_delay_);
// Trigger CiceroneClient::Observer::NotifyLxdContainerStartingSignal.
vm_tools::cicerone::LxdContainerStartingSignal signal;
signal.set_owner_id(request.owner_id());
signal.set_vm_name(request.vm_name());
signal.set_container_name(request.container_name());
signal.set_status(lxd_container_starting_signal_status_);
signal.mutable_os_release()->CopyFrom(lxd_container_os_release_);
base::ThreadTaskRunnerHandle::Get()->PostTask(
FROM_HERE, base::BindOnce(&FakeCiceroneClient::NotifyLxdContainerStarting,
weak_factory_.GetWeakPtr(), std::move(signal)));
if (send_container_started_signal_) {
// Trigger CiceroneClient::Observer::NotifyContainerStartedSignal.
vm_tools::cicerone::ContainerStartedSignal signal;
signal.set_owner_id(request.owner_id());
signal.set_vm_name(request.vm_name());
signal.set_container_name(request.container_name());
signal.set_container_username(last_container_username_);
base::ThreadTaskRunnerHandle::Get()->PostTask(
FROM_HERE,
base::BindOnce(&FakeCiceroneClient::NotifyContainerStarted,
weak_factory_.GetWeakPtr(), std::move(signal)));
}
vm_tools::cicerone::LxdContainerStartingSignal starting_signal;
starting_signal.set_owner_id(request.owner_id());
starting_signal.set_vm_name(request.vm_name());
starting_signal.set_container_name(request.container_name());
starting_signal.set_status(lxd_container_starting_signal_status_);
starting_signal.mutable_os_release()->CopyFrom(lxd_container_os_release_);
base::ThreadTaskRunnerHandle::Get()->PostDelayedTask(
FROM_HERE,
base::BindOnce(&FakeCiceroneClient::NotifyLxdContainerStarting,
weak_factory_.GetWeakPtr(), std::move(starting_signal)),
send_container_starting_signal_delay_);
// Trigger CiceroneClient::Observer::NotifyContainerStartedSignal.
vm_tools::cicerone::ContainerStartedSignal started_signal;
started_signal.set_owner_id(request.owner_id());
started_signal.set_vm_name(request.vm_name());
started_signal.set_container_name(request.container_name());
started_signal.set_container_username(last_container_username_);
base::ThreadTaskRunnerHandle::Get()->PostDelayedTask(
FROM_HERE,
base::BindOnce(&FakeCiceroneClient::NotifyContainerStarted,
weak_factory_.GetWeakPtr(), std::move(started_signal)),
send_container_started_signal_delay_);
}
void FakeCiceroneClient::GetLxdContainerUsername(
......@@ -289,9 +294,10 @@ void FakeCiceroneClient::SetUpLxdContainerUser(
DBusMethodCallback<vm_tools::cicerone::SetUpLxdContainerUserResponse>
callback) {
last_container_username_ = request.container_username();
base::ThreadTaskRunnerHandle::Get()->PostTask(
base::ThreadTaskRunnerHandle::Get()->PostDelayedTask(
FROM_HERE,
base::BindOnce(std::move(callback), setup_lxd_container_user_response_));
base::BindOnce(std::move(callback), setup_lxd_container_user_response_),
send_set_up_lxd_container_user_response_delay_);
}
void FakeCiceroneClient::ExportLxdContainer(
......@@ -369,8 +375,9 @@ void FakeCiceroneClient::CancelUpgradeContainer(
void FakeCiceroneClient::StartLxd(
const vm_tools::cicerone::StartLxdRequest& request,
DBusMethodCallback<vm_tools::cicerone::StartLxdResponse> callback) {
base::ThreadTaskRunnerHandle::Get()->PostTask(
FROM_HERE, base::BindOnce(std::move(callback), start_lxd_response_));
base::ThreadTaskRunnerHandle::Get()->PostDelayedTask(
FROM_HERE, base::BindOnce(std::move(callback), start_lxd_response_),
send_start_lxd_response_delay_);
}
void FakeCiceroneClient::AddFileWatch(
......
......@@ -7,6 +7,7 @@
#include "base/memory/weak_ptr.h"
#include "base/observer_list.h"
#include "base/time/time.h"
#include "chromeos/dbus/cicerone_client.h"
namespace chromeos {
......@@ -283,9 +284,6 @@ class COMPONENT_EXPORT(CHROMEOS_DBUS) FakeCiceroneClient
void set_lxd_container_os_release(vm_tools::cicerone::OsRelease os_release) {
lxd_container_os_release_ = std::move(os_release);
}
void set_send_container_started_signal(bool send) {
send_container_started_signal_ = send;
}
void set_apply_ansible_playbook_response(
const vm_tools::cicerone::ApplyAnsiblePlaybookResponse&
apply_ansible_playbook_response) {
......@@ -322,6 +320,30 @@ class COMPONENT_EXPORT(CHROMEOS_DBUS) FakeCiceroneClient
get_vsh_session_response_ = std::move(get_vsh_session_response);
}
void set_send_container_starting_signal_delay(base::TimeDelta delay) {
send_container_starting_signal_delay_ = delay;
}
void set_send_container_started_signal_delay(base::TimeDelta delay) {
send_container_started_signal_delay_ = delay;
}
void set_send_start_lxd_response_delay(base::TimeDelta delay) {
send_start_lxd_response_delay_ = delay;
}
void set_send_create_lxd_container_response_delay(base::TimeDelta delay) {
send_create_lxd_container_response_delay_ = delay;
}
void set_send_notify_lxd_container_created_signal_delay(
base::TimeDelta delay) {
send_notify_lxd_container_created_signal_delay_ = delay;
}
void set_send_set_up_lxd_container_user_response_delay(
base::TimeDelta delay) {
send_set_up_lxd_container_user_response_delay_ = delay;
}
void set_send_start_lxd_container_response_delay(base::TimeDelta delay) {
send_start_lxd_container_response_delay_ = delay;
}
// Returns true if the method has been invoked at least once, false otherwise.
bool configure_for_arc_sideload_called() {
return configure_for_arc_sideload_called_;
......@@ -381,7 +403,6 @@ class COMPONENT_EXPORT(CHROMEOS_DBUS) FakeCiceroneClient
bool is_file_watch_triggered_signal_connected_ = true;
std::string last_container_username_;
bool send_container_started_signal_ = true;
bool configure_for_arc_sideload_called_ = false;
......@@ -432,6 +453,14 @@ class COMPONENT_EXPORT(CHROMEOS_DBUS) FakeCiceroneClient
vm_tools::cicerone::RemoveFileWatchResponse remove_file_watch_response_;
vm_tools::cicerone::GetVshSessionResponse get_vsh_session_response_;
base::TimeDelta send_container_starting_signal_delay_;
base::TimeDelta send_container_started_signal_delay_;
base::TimeDelta send_start_lxd_response_delay_;
base::TimeDelta send_create_lxd_container_response_delay_;
base::TimeDelta send_notify_lxd_container_created_signal_delay_;
base::TimeDelta send_set_up_lxd_container_user_response_delay_;
base::TimeDelta send_start_lxd_container_response_delay_;
vm_tools::cicerone::OsRelease lxd_container_os_release_;
LaunchContainerApplicationCallback launch_container_application_callback_;
......
......@@ -81,9 +81,10 @@ void FakeConciergeClient::CreateDiskImage(
const vm_tools::concierge::CreateDiskImageRequest& request,
DBusMethodCallback<vm_tools::concierge::CreateDiskImageResponse> callback) {
create_disk_image_called_ = true;
base::ThreadTaskRunnerHandle::Get()->PostTask(
base::ThreadTaskRunnerHandle::Get()->PostDelayedTask(
FROM_HERE,
base::BindOnce(std::move(callback), create_disk_image_response_));
base::BindOnce(std::move(callback), create_disk_image_response_),
send_create_disk_image_response_delay_);
}
void FakeConciergeClient::CreateDiskImageWithFd(
......@@ -166,8 +167,9 @@ void FakeConciergeClient::StartTerminaVm(
const vm_tools::concierge::StartVmRequest& request,
DBusMethodCallback<vm_tools::concierge::StartVmResponse> callback) {
start_termina_vm_called_ = true;
base::ThreadTaskRunnerHandle::Get()->PostTask(
FROM_HERE, base::BindOnce(std::move(callback), start_vm_response_));
base::ThreadTaskRunnerHandle::Get()->PostDelayedTask(
FROM_HERE, base::BindOnce(std::move(callback), start_vm_response_),
send_start_vm_response_delay_);
if (!start_vm_response_ ||
start_vm_response_->status() != vm_tools::concierge::VM_STATUS_STARTING) {
......@@ -179,10 +181,12 @@ void FakeConciergeClient::StartTerminaVm(
vm_tools::cicerone::TremplinStartedSignal tremplin_started_signal;
tremplin_started_signal.set_vm_name(request.name());
tremplin_started_signal.set_owner_id(request.owner_id());
base::ThreadTaskRunnerHandle::Get()->PostTask(
FROM_HERE, base::BindOnce(&FakeConciergeClient::NotifyTremplinStarted,
weak_ptr_factory_.GetWeakPtr(),
std::move(tremplin_started_signal)));
base::ThreadTaskRunnerHandle::Get()->PostDelayedTask(
FROM_HERE,
base::BindOnce(&FakeConciergeClient::NotifyTremplinStarted,
weak_ptr_factory_.GetWeakPtr(),
std::move(tremplin_started_signal)),
send_tremplin_started_signal_delay_);
}
void FakeConciergeClient::NotifyTremplinStarted(
......@@ -263,9 +267,10 @@ void FakeConciergeClient::GetContainerSshKeys(
callback) {
get_container_ssh_keys_called_ = true;
base::ThreadTaskRunnerHandle::Get()->PostTask(
base::ThreadTaskRunnerHandle::Get()->PostDelayedTask(
FROM_HERE,
base::BindOnce(std::move(callback), container_ssh_keys_response_));
base::BindOnce(std::move(callback), container_ssh_keys_response_),
send_get_container_ssh_keys_response_delay_);
}
void FakeConciergeClient::AttachUsbDevice(
......
......@@ -10,6 +10,7 @@
#include "base/memory/weak_ptr.h"
#include "base/observer_list.h"
#include "base/optional.h"
#include "base/time/time.h"
#include "chromeos/dbus/cicerone_client.h"
#include "chromeos/dbus/concierge_client.h"
......@@ -252,6 +253,19 @@ class COMPONENT_EXPORT(CHROMEOS_DBUS) FakeConciergeClient
resize_disk_image_response_ = resize_disk_image_response;
}
void set_send_create_disk_image_response_delay(base::TimeDelta delay) {
send_create_disk_image_response_delay_ = delay;
}
void set_send_start_vm_response_delay(base::TimeDelta delay) {
send_start_vm_response_delay_ = delay;
}
void set_send_tremplin_started_signal_delay(base::TimeDelta delay) {
send_tremplin_started_signal_delay_ = delay;
}
void send_get_container_ssh_keys_response_delay(base::TimeDelta delay) {
send_get_container_ssh_keys_response_delay_ = delay;
}
void NotifyVmStarted(const vm_tools::concierge::VmStartedSignal& signal);
void NotifyVmStopped(const vm_tools::concierge::VmStoppedSignal& signal);
bool HasVmObservers() const;
......@@ -328,6 +342,11 @@ class COMPONENT_EXPORT(CHROMEOS_DBUS) FakeConciergeClient
resize_disk_image_response_;
base::Optional<vm_tools::concierge::SetVmIdResponse> set_vm_id_response_;
base::TimeDelta send_create_disk_image_response_delay_;
base::TimeDelta send_start_vm_response_delay_;
base::TimeDelta send_tremplin_started_signal_delay_;
base::TimeDelta send_get_container_ssh_keys_response_delay_;
// Can be set to fake a series of disk image status signals.
std::vector<vm_tools::concierge::DiskImageStatusResponse>
disk_image_status_signals_;
......
......@@ -14641,6 +14641,22 @@ metrics consent we also won't be able to send UMA metrics. -->
<int value="5" label="Failed Insufficient Space"/>
</enum>
<enum name="CrostiniInstallerState">
<!-- Keep this in sync with InstallerState in crostini_types.mojom -->
<int value="0" label="Start"/>
<int value="1" label="InstallImageLoader"/>
<int value="2" label="CreateDiskImage"/>
<int value="3" label="StartTerminaVm"/>
<int value="4" label="StartLxd"/>
<int value="5" label="CreateContainer"/>
<int value="6" label="SetupContainer"/>
<int value="7" label="StartContainer"/>
<int value="8" label="FetchSshKeys"/>
<int value="9" label="MountContainer"/>
<int value="10" label="ConfigureContainer"/>
</enum>
<enum name="CrostiniResult">
<int value="0" label="SUCCESS"/>
<int value="4" label="CREATE_DISK_IMAGE_FAILED"/>
......@@ -14684,6 +14700,16 @@ metrics consent we also won't be able to send UMA metrics. -->
<int value="49" label="NEVER_FINISHED"/>
<int value="50" label="CONTAINER_SETUP_FAILED"/>
<int value="51" label="START_LXD_FAILED"/>
<int value="52" label="INSTALL_IMAGE_LOADER_TIMED_OUT"/>
<int value="53" label="CREATE_DISK_IMAGE_TIMED_OUT"/>
<int value="54" label="START_TERMINA_VM_TIMED_OUT"/>
<int value="55" label="START_LXD_TIMED_OUT"/>
<int value="56" label="CREATE_CONTAINER_TIMED_OUT"/>
<int value="57" label="SETUP_CONTAINER_TIMED_OUT"/>
<int value="58" label="START_CONTAINER_TIMED_OUT"/>
<int value="59" label="FETCH_SSH_KEYS_TIMED_OUT"/>
<int value="60" label="MOUNT_CONTAINER_TIMED_OUT"/>
<int value="61" label="UNKNOWN_STATE_TIMED_OUT"/>
</enum>
<enum name="CrostiniSettingsEvent">
......@@ -21,6 +21,19 @@ reviews. Googlers can read more about this at go/gwsq-gerrit.
<histograms>
<variants name="CrostiniState">
<variant name="CreateContainer" summary="creating container"/>
<variant name="CreateDiskImage" summary="creating disk image"/>
<variant name="FetchSshKeys" summary="fetching SSH keys"/>
<variant name="InstallImageLoader" summary="installing component or DLC"/>
<variant name="MountContainer" summary="creating SSHFS mounts"/>
<variant name="SetupContainer" summary="setting up container for use"/>
<variant name="Start" summary="initial"/>
<variant name="StartContainer" summary="starting container"/>
<variant name="StartLxd" summary="starting LXD"/>
<variant name="StartTerminaVm" summary="starting VM"/>
</variants>
<histogram name="Crostini.AppLaunch" enum="CrostiniAppLaunchAppType"
expires_after="2021-01-31">
<owner>clumptini@google.com</owner>
......@@ -238,6 +251,19 @@ reviews. Googlers can read more about this at go/gwsq-gerrit.
</summary>
</histogram>
<histogram name="Crostini.InvalidStateTransition" enum="CrostiniInstallerState"
expires_after="2021-04-04">
<owner>clumptini@google.com</owner>
<owner>tbuckley@chromium.org</owner>
<summary>
Emitted when an invalid request to transition states during the Crostini
restarter flow is received. For example, when a container start signal is
received when the restarter is still waiting for the vm to finish launching.
The value is the state the restarter should have been in for the transition
to be legal (in this example StartContainer).
</summary>
</histogram>
<histogram name="Crostini.RecoverySource" enum="CrostiniUISurface"
expires_after="2021-01-31">
<owner>clumptini@google.com</owner>
......@@ -268,6 +294,19 @@ reviews. Googlers can read more about this at go/gwsq-gerrit.
</summary>
</histogram>
<histogram name="Crostini.RestarterTimeInState.{state}" units="ms"
expires_after="2021-05-16">
<owner>clumptini@google.com</owner>
<owner>tbuckley@chromium.org</owner>
<summary>
Base histogram for measuring how much time the restarter flow spends in the
{state} state. Used to set timeouts. Note that this since this is for any
restarter run (no-op relaunch, installation, etc) the results are expected
to be multi-modal.
</summary>
<token key="state" variants="CrostiniState"/>
</histogram>
<histogram name="Crostini.Restore" enum="CrostiniImportContainerResult"
expires_after="2021-06-06">
<owner>clumptini@google.com</owner>
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment