Improve libFuzzer documentation based on fuzzathon feedback.

R=ochang@chromium.org,mmoroz@chromium.org Bug: Change-Id: Ie4a1446726e84b2cb3826515f06a9a9d74f3b0fb Reviewed-on: https://chromium-review.googlesource.com/794593 Commit-Queue: Abhishek Arya <inferno@chromium.org> Reviewed-by: Oliver Chang <ochang@chromium.org> Cr-Commit-Position: refs/heads/master@{#519889}

Improve libFuzzer documentation based on fuzzathon feedback.
R=ochang@chromium.org,mmoroz@chromium.org Bug: Change-Id: Ie4a1446726e84b2cb3826515f06a9a9d74f3b0fb Reviewed-on: https://chromium-review.googlesource.com/794593 Commit-Queue: Abhishek Arya <inferno@chromium.org> Reviewed-by: Oliver Chang <ochang@chromium.org> Cr-Commit-Position: refs/heads/master@{#519889}
968d2f73 · Abhishek Arya · Commit Bot · bc0ed76b · 968d2f73 · 968d2f73
Commit 968d2f73 authored Nov 28, 2017 by Abhishek Arya Committed by Commit Bot Nov 28, 2017
Hide whitespace changes
Inline Side-by-side

Showing with 173 additions and 132 deletions

testing/libfuzzer/efficient_fuzzer.md testing/libfuzzer/efficient_fuzzer.md +144 -112

testing/libfuzzer/getting_started.md testing/libfuzzer/getting_started.md +29 -20

No files found.
--- a/testing/libfuzzer/efficient_fuzzer.md
+++ b/testing/libfuzzer/efficient_fuzzer.md
 # Efficient Fuzzer

-This document describes ways to determine your fuzzer efficiency and ways 
+This document describes ways to determine your fuzzer efficiency and ways
 to improve it.

 ## Overview

 Being a coverage-driven fuzzer, libFuzzer considers a certain input *interesting*
-if it results in new coverage. The set of all interesting inputs is called 
-*corpus*. 
-Items in corpus are constantly mutated in search of new interesting input.
-Corpus is usually maintained between multiple fuzzer runs.
+if it results in new code coverage. The set of all interesting inputs is called
+*corpus*.

-Following things can be so effective for a fuzzer, that we *strongly recommend*
-for any fuzzer:
+Items in corpus are constantly mutated in search of new interesting inputs.
+Corpus can be shared across fuzzer runs and grows over time as new code is
+reached.

-* [seed corpus](#Seed-Corpus) gives your fuzzer examples of input.
-* [fuzzer dictionary](#Fuzzer-Dictionary) improves fuzzer mutations by using
-  supplied dictionary.
+The following things are extremely effective for improving fuzzer efficiency, so we
+*strongly recommend* them for any fuzzer:
+
+* [seed corpus](#Seed-Corpus)
+* [fuzzer dictionary](#Fuzzer-Dictionary)

 There are several metrics you should look at to determine your fuzzer effectiveness:

-* [fuzzer speed](#Fuzzer-Speed) (exec/s)
+* [fuzzer speed](#Fuzzer-Speed)
 * [corpus size](#Corpus-Size)
 * [coverage](#Coverage)

 You can collect these metrics manually or take them from [ClusterFuzz status]
-pages.
+pages after a fuzzer is checked in Chromium repository.

 ## Seed Corpus

-You can pass a corpus directory to a fuzzer that you run manually:
+Seed corpus is a set of *valid* and *interesting* inputs that serve as starting points
+for a fuzzer. If one is not provided, a fuzzer would have to guess these inputs
+from scratch, which can take an indefinite amount of time depending of the size
+of inputs.

-```
-./out/libfuzzer/my_fuzzer ~/tmp/my_fuzzer_corpus
-```
+Seed corpus works especially well for strictly defined file formats and data
+transmission protocols. 
+
+* For file format parsers, add valid files from your test suite.
+* For protocol parsers, add valid raw streams from test suite into separate files.

-The directory can initially be empty. The fuzzer would store all the interesting
-items it finds in the directory. You can help the fuzzer by "seeding" the corpus:
-simply copy interesting inputs for your function to the corpus directory before
-running. This works especially well for strictly defined file formats or data
-transmission protocols.
+Other examples include a graphics library seed corpus, which  would be a variety of
+small PNG/JPG/GIF files.

-* For file-parsing functionality just use some valid files from your test suite.
+If you are running the fuzzer locally, you can pass a corpus directory as an argument:

-* For protocol processing targets put raw streams from test suite into separate
-files.
+```
+./out/libfuzzer/my_fuzzer ~/tmp/my_fuzzer_corpus
+```

+While libFuzzer can start with an empty corpus, most fuzzers require a seed corpus
+to be useful. The fuzzer would store all the interesting items it finds in that directory.

-ClusterFuzz uses seed corpus stored in Chromium repository. You need to add
-`seed_corpus` attribute to fuzzer target:
+ClusterFuzz uses seed corpus defined in Chromium source repository. You need to
+add a `seed_corpus` attribute to your fuzzer definition in BUILD.gn file:

 ```
 fuzzer_test("my_protocol_fuzzer") {
@@ -68,137 +74,161 @@ fuzzer_test("my_protocol_fuzzer") {
 }
 ```

-All files found in the directories and their subdirectories will be archived
-into `%YOUR_FUZZER_NAME%_seed_corpus.zip` output archive.
-
-If you don't want to store seed corpus in Chromium repository, you can upload
-corpus to Google Cloud Storage bucket used by ClusterFuzz:
+All files found in these directories and their subdirectories will be archived
+into a `<my_fuzzer_name>_seed_corpus.zip` output archive.

+If you can't store seed corpus in Chromium repository (e.g. it is too large, has
+licensing issues, etc), you can upload the corpus to Google Cloud Storage bucket
+used by ClusterFuzz:

 1) go to [Corpus GCS Bucket]
-
-2) open directory named `%YOUR_FUZZER_NAME%_static`
-
+2) open directory named `<my_fuzzer_name>_static`
 3) upload corpus files into the directory

-
-Alternative way is to use `gsutil` tool:
+Alternative and faster way is to use [gsutil] command line tool:
 ```bash
-gsutil -m rsync <corpus_dir_on_disk> gs://clusterfuzz-corpus/libfuzzer/%YOUR_FUZZER_NAME%_static
+gsutil -m rsync <path_to_corpus> gs://clusterfuzz-corpus/libfuzzer/<my_fuzzer_name>_static
 ```

-### Corpus Minimization
-
-It's important to minimize seed corpus before uploading. The minimization can
-be done with `-merge=1` option of libFuzzer:
-
-```bash
-# Create an empty directory.
-mkdir seed_corpus_minimized
-# Run the fuzzer with -merge=1 flag.
-./my_fuzzer -merge=1 ./seed_corpus_minimized ./seed_corpus
-```
-
-After running the command above, `seed_corpus_minimized` directory will contain
-a minimized corpus that gives the same code coverage as the initial
-`seed_corpus` directory.
-
-
 ## Fuzzer Dictionary

-It is very useful to provide fuzzer a set of common words/values that you expect
-to find in the input. This greatly improves efficiency of finding new units and
-works especially well while fuzzing file format decoders.
+It is very useful to provide fuzzer a set of *common words or values* that you
+expect to find in the input. Adding a dictionary highly improves the efficiency of
+finding new units and works especially well in certain usecases (e.g. fuzzing file
+format decoders).

-To add a dictionary, first create a dictionary file.
-Dictionary syntax is similar to that used by [AFL] for its -x option:
+To add a dictionary, first create a dictionary file. Dictionary file is a flat text file
+where tokens are listed one per line in the format of name="value". The
+alphanumeric name is ignored and can be omitted, although it is a convenient
+way to document the meaning of a particular token. The value must appear in
+quotes, with hex escaping (\xNN) applied to all non-printable, high-bit, or
+otherwise problematic characters (\\ and \" shorthands are recognized too).
+This syntax is similar to the one used by [AFL] fuzzing engine (-x option).
+
+An examples dictionary looks like:

 ```
 # Lines starting with '#' and empty lines are ignored.

-# Adds "blah" (w/o quotes) to the dictionary.
+# Adds "blah" word (w/o quotes) to the dictionary.
 kw1="blah"
 # Use \\ for backslash and \" for quotes.
 kw2="\"ac\\dc\""
-# Use \xAB for hex values
+# Use \xAB for hex values.
 kw3="\xF7\xF8"
-# the name of the keyword followed by '=' may be omitted:
+# Key name before '=' can be omitted:
 "foo\x0Abar"
 ```

-Test your dictionary by running your fuzzer locally:
+Make sure to test your dictionary by running your fuzzer locally:

 ```bash
 ./out/libfuzzer/my_protocol_fuzzer -dict=<path_to_dict> <path_to_corpus>
 ```

-You should see lots of new units discovered.
+If the dictionary is effective, you should see new units discovered in fuzzer output.
+
+To submit a dictionary to Chromium repository:

-Add `dict` attribute to fuzzer target:
+1) Add the dictionary file in the same directory as your fuzz target, with name
+`<my_fuzzer>.dict`.
+2) Add `dict` attribute to fuzzer definition in BUILD.gn file:

 ```
 fuzzer_test("my_protocol_fuzzer") {
  ...
-  dict = "protocol.dict"
+  dict = "my_protocol_fuzzer.dict"
 }
 ```

-Make sure to submit dictionary file to git. The dictionary will be used
-automatically by ClusterFuzz once it picks up new fuzzer version (once a day).
+The dictionary will be used automatically by ClusterFuzz once it picks up a new
+revision build.
+
+### Corpus Minimization
+
+It's important to minimize seed corpus to a *small set of interesting inputs* before
+uploading. The reason being that seed corpus is synced to all fuzzing bots for every
+iteration, so it is important to keep it small both for fuzzing efficiency and to prevent
+our bots from running out of disk space (should not exceed 1 Gb).
+
+The minimization can be done using `-merge=1` option of libFuzzer:
+
+```bash
+# Create an empty directory.
+mkdir seed_corpus_minimized
+
+# Run the fuzzer with -merge=1 flag.
+./my_fuzzer -merge=1 ./seed_corpus_minimized ./seed_corpus
+```
+
+After running the command above, `seed_corpus_minimized` directory will contain
+a minimized corpus that gives the same code coverage as the initial
+`seed_corpus` directory.

 ## Fuzzer Speed

-Fuzzer speed is printed while fuzzer runs:
+Fuzzer speed is calculated in executions per second. It is printed while the fuzzer
+is running:

 ```
 #19346  NEW    cov: 2815 bits: 1082 indir: 43 units: 150 exec/s: 19346 L: 62
 ```

-Because libFuzzer performs randomized search, it is critical to have it as fast
-as possible. You should try to get to at least 1,000 exec/s. Profile the fuzzer
+Because libFuzzer performs randomized mutations, it is critical to have it run as
+fast as possible to navigate the large search space efficiently and find interesting
+code paths. You should try to get to at least 1,000 exec/s from your fuzzer runs
+locally before submitting the fuzzer to Chromium repository. Profile the fuzzer
 using any standard tool to see where it spends its time.


 ### Initialization/Cleanup

-Try to keep your fuzzing function as simple as possible. Prefer to use static
-initialization and shared resources rather than bringing environment up and down
-every single run.
+Try to keep `LLVMFuzzerTestOneInput` function as simple as possible. If your fuzzing
+function is too complex, it can bring down fuzzer execution speed OR it might target
+very specific usecases and fail to account for unexpected scenarios.

-Fuzzers don't have to shutdown gracefully (we either kill them or they crash
-because sanitizer has found a problem). You can skip freeing static resource.
+Prefer to use static initialization and shared resources rather than bringing the
+environment up and down on every single run. Otherwise, it will slow down
+fuzzer speed on every run and its ability to find new interesting inputs.
+Checkout example on [startup initialization] in libFuzzer documentation. 

-Of course all resources allocated within `LLVMFuzzerTestOneInput` function
-should be deallocated since this function is called millions of times during
-one fuzzing session.
+Fuzzers don't have to shutdown gracefully. We either kill them or they crash
+because memory sanitizer tool found a problem. You can skip freeing static
+resources.
+
+All resources allocated within `LLVMFuzzerTestOneInput` function should be
+de-allocated since this function is called millions of times during a fuzzing session.
+Otherwise, we will hit OOMs frequently and reduce overall fuzzing efficiency.


 ### Memory Usage

-Avoid allocation of dynamic memory wherever possible. Instrumentation works
-faster for stack-based and static objects than for heap allocated ones.
+Avoid allocation of dynamic memory wherever possible. Memory instrumentation
+works faster for stack-based and static objects, than for heap allocated ones.

-It is always a good idea to play with different versions of a fuzzer to find the
-fastest implementation.
+It is always a good idea to try different variants for your fuzzer locally, and then
+submit the fastest implementation.


 ### Maximum Testcase Length

-Experiment with different values of `-max_len` parameter. This parameter often
-significantly affects execution speed, but not always.
+You can control the maximum length of a test input using `-max_len` parameter
+(see [custom options](#Custom-Options)). This parameter can often significantly
+improve execution speed. Beware that you might miss coverage and unexpected
+scenarios happening from longer size inputs.

 1) Define which `-max_len` value is reasonable for your target. For example, it
 may be useless to fuzz an image decoder with too small value of testcase length.

 2) Increase the value defined on previous step. Check its influence on execution
 speed of fuzzer. If speed doesn't drop significantly for long inputs, it is fine
-to have some bigger value for `-max_len`.
+to have some bigger value for `-max_len` or even skip it completely.

-In general, bigger `-max_len` value gives better coverage. Coverage is main
+In general, bigger `-max_len` value gives better coverage which is the main
 priority for fuzzing. However, low execution speed may result in waste of
-resources used for fuzzing. If large inputs make fuzzer too slow you have to
-adjust value of `-max_len` and find a trade-off between coverage and execution
-speed.
+fuzzing resources and being unable to find interesting inputs in reasonable time.
+If large inputs make the fuzzer too slow, you should adjust the value of `-max_len`
+and find a trade-off between coverage and execution speed.

 *Note:* ClusterFuzz runs two different fuzzing engines (**LibFuzzer** and
 **AFL**) using the same target functions. AFL doesn't support `-max_len`
@@ -216,44 +246,44 @@ For more information check out the discussion in [issue 638836].

 ## Code Coverage

-[ClusterFuzz status] page provides fuzzer source-level coverage report from the
-recent run. Looking at the report might provide an insight to improve fuzzer
+[ClusterFuzz status] page provides fuzzer source-level coverage report from
+recent runs. Looking at the report might provide an insight to improve fuzzer
 coverage.

-You can also generate source-level coverage report locally via running
+You can also generate source-level coverage report locally by running the
 [coverage script] stored in Chromium repository. The script provides detailed
-instructions as well as usage example.
+instructions as well as an usage example.

 We encourage you to try out the script, as it usually generates a better code
 coverage visualization compared to the coverage report hosted on ClusterFuzz.
-
 *NOTE: This is an experimental feature and an active area of work. We are
 working on improving this process.*


 ## Corpus Size

-After running for a while the fuzzer would reach a plateau and won't discover
-new interesting input. Corpus for a reasonably complex functionality
-should contain hundreds (if not thousands) of items.
+After running for a while, the fuzzer would reach a plateau and won't discover
+new interesting inputs. Corpus for a reasonably complex functionality should
+contain hundreds (if not thousands) of items.

-Too small corpus size indicates some code barrier that
-libFuzzer is having problems penetrating. Common cases include: checksums,
-magic numbers etc. The easiest way to diagnose this problem is to generate a 
-[coverage report](#Coverage). To fix the issue you can:
+Too small of a corpus size indicates fuzzer is hitting a code barrier and is unable
+to get past it. Common cases of such issues include: checksums, magic numbers,
+etc. The easiest way to diagnose this problem is to generate and analyze a 
+[coverage report](#Coverage). To fix the issue, you can:

-* change the code (e.g. disable crc checks while fuzzing)
-* prepare [corpus seed](#Corpus-Seed)
-* prepare [fuzzer dictionary](#Fuzzer-Dictionary)
-* specify [custom options](#Custom-Options)
+* change the code (e.g. disable crc checks while fuzzing).
+* prepare or improve [corpus seed](#Corpus-Seed).
+* prepare or improve [fuzzer dictionary](#Fuzzer-Dictionary).
+* add [custom options](#Custom-Options).

 ### Custom Options

-It is possible to specify [libFuzzer parameters](http://llvm.org/docs/LibFuzzer.html#usage)
-for any fuzzer being run at ClusterFuzz. Custom options will overwrite default
-values provided by ClusterFuzz.
+Custom options help to fine tune libFuzzer execution parameters and will also
+override the default values used by ClusterFuzz. Please read [libFuzzer options]
+page for detailed documentation on how these work.

-Just list all parameters in `libfuzzer_options` variable of build target:
+Add the options needed in `libfuzzer_options` attribute to your fuzzer definition in
+BUILD.gn file:

 ```
 fuzzer_test("my_protocol_fuzzer") {
@@ -266,11 +296,13 @@ fuzzer_test("my_protocol_fuzzer") {
 ```

 Please note that `dict` parameter should be provided [separately](#Fuzzer-Dictionary).
-Other options may be passed through `libfuzzer_options` property.
-
+All other options can be passed using `libfuzzer_options` property.

 [AFL]: http://lcamtuf.coredump.cx/afl/
 [ClusterFuzz status]: clusterfuzz.md#Status-Links
 [Corpus GCS Bucket]: https://goto.google.com/libfuzzer-clusterfuzz-corpus
 [issue 638836]: https://bugs.chromium.org/p/chromium/issues/detail?id=638836
 [coverage script]: https://cs.chromium.org/chromium/src/testing/libfuzzer/coverage.py
+[gsutil]: https://cloud.google.com/storage/docs/gsutil
+[libFuzzer options]: http://llvm.org/docs/LibFuzzer.html#options
+[startup initialization]: http://llvm.org/docs/LibFuzzer.html#startup-initialization
--- a/testing/libfuzzer/getting_started.md
+++ b/testing/libfuzzer/getting_started.md
 # Getting Started with libFuzzer in Chrome

-*** note
-**Prerequisites:** libFuzzer in Chrome is supported with GN on Linux only. 
+***
+**Prerequisites:** libFuzzer in Chrome is supported with GN on Linux and Mac only. 
 ***

 This document will walk you through:
@@ -15,9 +15,9 @@ This document will walk you through:
 Use `use_libfuzzer` GN argument together with sanitizer to generate build files:

 *Notice*: current implementation also supports `use_afl` argument, but it is
-recommended to use libFuzzer for development. Running libFuzzer locally doesn't
-require any special configuration and quickly gives meaningful output for speed,
-coverage and other parameters.
+recommended to use libFuzzer for local development. Running libFuzzer locally
+doesn't require any special configuration and gives meaningful output quickly for
+speed, coverage and other parameters.

 ```bash
 # With address sanitizer
@@ -42,7 +42,7 @@ To get the exact GN configuration that are used on our builders, see

 ## Write Fuzzer Function

-Create a new .cc file and define a `LLVMFuzzerTestOneInput` function:
+Create a new `<my_fuzzer>.cc` file and define a `LLVMFuzzerTestOneInput` function:

 ```cpp
 #include <stddef.h>
@@ -54,11 +54,16 @@ extern "C" int LLVMFuzzerTestOneInput(const uint8_t* data, size_t size) {
 }
 ```

-[url_parse_fuzzer.cc] is a simple example of real-world fuzzer.
+*Note*: You should create the fuzzer file `<my_fuzzer>.cc next to the code that is being
+tested and in the same directory as your other unit tests. Please do not use
+`testing/libfuzzer/fuzzers` directory, this was a directory used for initial sample fuzzers and
+is no longer recommended for any new fuzzers.
+
+[quic_stream_factory_fuzzer.cc] is a simple example of real-world fuzzer.

 ## Define GN Target

-Define `fuzzer_test` GN target:
+Define `fuzzer_test` GN target in BUILD.gn:

 ```python
 import("//testing/libfuzzer/fuzzer_test.gni")
@@ -88,7 +93,7 @@ INFO: PreferSmall: 1
 #2      NEW    cov: 2710 bits: 359 indir: 36 units: 2 exec/s: 0 L: 64 MS: 0
 ```

-The `... NEW ...` line appears when libFuzzer finds new and interesting input.
+The `... NEW ...` line appears when libFuzzer finds new and interesting inputs.
 The efficient fuzzer should be able to finds lots of them rather quickly.
 The `... pulse ...` line will appear periodically to show the current status.

@@ -99,15 +104,15 @@ stacktrace, make sure that you have directory containing `llvm-symbolizer`
 binary added in `$PATH`. The symbolizer binary is included in Chromium's Clang
 package located at `third_party/llvm-build/Release+Asserts/bin/` directory.

-Alternatively, you can set `external_symbolizer_path` option via `ASAN_OPTIONS`
-env variable:
+Alternatively, you can set `external_symbolizer_path` option via
+`ASAN_OPTIONS` env variable:

 ```bash
 $ ASAN_OPTIONS=external_symbolizer_path=/my/local/llvm/build/llvm-symbolizer \
    ./fuzzer ./crash-input
 ```

-The same approach works with other sanitizers (e.g. `MSAN_OPTIONS` and others).
+The same approach works with other sanitizers (e.g. `MSAN_OPTIONS`, `UBSAN_OPTIONS`, etc).

 ## Improving Your Fuzzer

@@ -120,18 +125,19 @@ in [Seed Corpus] section of efficient fuzzer guide.
 *Make sure corpus files are appropriately licensed.*
 * Create mutation dictionary. With a `dict = "protocol.dict"` attribute and
 `key=value` dicitionary file format, mutations can be more effective.
-See [Fuzzer Dictionary].
+See [Fuzzer Dictionary] section of efficient fuzzer guide.
 * Specify maximum testcase length. By default libFuzzer uses `-max_len=64`
 (or takes the longest testcase in a corpus). ClusterFuzz takes
 random value in range from `1` to `10000` for each fuzzing session and passes
 that value to libFuzzers. If corpus contains testcases of size greater than
 `max_len`, libFuzzer will use only first `max_len` bytes of such testcases. 
-See [Maximum Testcase Length].
+See [Maximum Testcase Length] section of efficient fuzzer guide.

 ## Disable noisy error message logging

-If the code that you are a fuzzing generates error messages when encountering
-incorrect or invalid data then you need to silence those errors in the fuzzer.
+If the code that you are a fuzzing generates lot of error messages when
+encountering incorrect or invalid data, then you need to silence those errors
+in the fuzzer. Otherwise, fuzzer will be slow and inefficient.

 If the target uses the Chromium logging APIs, the best way to do that is to
 override the environment used for logging in your fuzzer:
@@ -148,12 +154,15 @@ Environment* env = new Environment();

 ## Submitting Fuzzer to ClusterFuzz

-ClusterFuzz builds and executes all `fuzzer_test` targets in the source tree.
-The only thing you should do is to submit a fuzzer into Chrome.
+ClusterFuzz builds and executes all `fuzzer_test` targets in the Chromium
+repository. It is extremely important to submit a fuzzer into Chromium
+repository so that ClusterFuzz can run it at scale. Do not rely on just
+running fuzzing locally in your own environment, as it will catch far less
+issues and needs to run it continuously forever to catch regressions.

 ## Next Steps

-* After your fuzzer is submitted, you should check its [ClusterFuzz status] in
+* After your fuzzer is submitted, you should check [ClusterFuzz status] page in
 a day or two.
 * Check the [Efficient Fuzzer Guide] to better understand your fuzzer
 performance and for optimization hints.
@@ -168,5 +177,5 @@ performance and for optimization hints.
 [Seed Corpus]: efficient_fuzzer.md#Seed-Corpus
 [Undefined Behavior Sanitizer]: http://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html
 [crbug/598448]: https://bugs.chromium.org/p/chromium/issues/detail?id=598448
-[url_parse_fuzzer.cc]: https://code.google.com/p/chromium/codesearch#chromium/src/testing/libfuzzer/fuzzers/url_parse_fuzzer.cc
+[quic_stream_factory_fuzzer.cc]: https://cs.chromium.org/chromium/src/net/quic/chromium/quic_stream_factory_fuzzer.cc
 [Build Config]: reference.md#Builder-configurations