Convert abarth's "Strings in Blink" doc to a README in the tree

In August 2013, Adam Barth wrote an overview of the WTF::String type[1] and associated classes. Now that we prefer to keep docs in the tree, move it into Blink's platform/wtf/text directory as a README. The following edits were made to the doc: * Markdown formatting (headers, italics, code) * Updated StringImpl flag count/descriptions * Updated method names with new casing/spelling. * Dropped discussion of String::Append() (removed) * Dropped #blink IRC channel reference [1] https://docs.google.com/document/d/1kOCUlJdh2WJMJGDf-WoEQhmnjKLaOYRbiHz5TiGJl14/edit?usp=sharing Change-Id: I6e6c92ef21d32f647487e0558c3fdf66cd7f7e51 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/1555003 Auto-Submit: Joshua Bell <jsbell@chromium.org> Reviewed-by: Jeremy Roman <jbroman@chromium.org> Reviewed-by: Daniel Cheng <dcheng@chromium.org> Commit-Queue: Joshua Bell <jsbell@chromium.org> Cr-Commit-Position: refs/heads/master@{#648730}

Convert abarth's "Strings in Blink" doc to a README in the tree
In August 2013, Adam Barth wrote an overview of the WTF::String type[1] and associated classes. Now that we prefer to keep docs in the tree, move it into Blink's platform/wtf/text directory as a README. The following edits were made to the doc: * Markdown formatting (headers, italics, code) * Updated StringImpl flag count/descriptions * Updated method names with new casing/spelling. * Dropped discussion of String::Append() (removed) * Dropped #blink IRC channel reference [1] https://docs.google.com/document/d/1kOCUlJdh2WJMJGDf-WoEQhmnjKLaOYRbiHz5TiGJl14/edit?usp=sharing Change-Id: I6e6c92ef21d32f647487e0558c3fdf66cd7f7e51 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/1555003 Auto-Submit: Joshua Bell <jsbell@chromium.org> Reviewed-by: Jeremy Roman <jbroman@chromium.org> Reviewed-by: Daniel Cheng <dcheng@chromium.org> Commit-Queue: Joshua Bell <jsbell@chromium.org> Cr-Commit-Position: refs/heads/master@{#648730}
393435b9 · Joshua Bell · Commit Bot · 61750b89 · 393435b9 · 393435b9
Commit 393435b9 authored Apr 08, 2019 by Joshua Bell Committed by Commit Bot Apr 08, 2019
2 changed files
--- a/third_party/blink/renderer/platform/wtf/text/README.md
+++ b/third_party/blink/renderer/platform/wtf/text/README.md
+# Strings in Blink
+
+_Everything you always wanted to know but were afraid to ask_
+
+This document covers the `String` type in Blink, often written with an
+explicit namespace as `WTF::String` to disambiguate from string
+concepts or other types. It also briefly covers associated classes
+used for constructing strings (`StringBuilder`, `StringBuffer`), the
+internal `StringImpl` class, and the special `AtomicString` variant.
+It does not cover other text-related types or utilities (e.g.
+encodings, views, line endings, etc).
+
+## Overview
+
+A `WTF::String` represents a sequence of zero or more Unicode code
+points. A `String` can also represent one of two zero-length strings:
+the empty string and the null string. These correspond to "" and
+`null` in JavaScript, respectively. Both the empty and the null string
+return true from `String::IsEmpty()` but only the null string returns
+true from `String::IsNull()`.
+
+
+Unlike `std::string`, Blink’s `String` object is a pointer to a
+reference counted character buffer. This design makes it easier to
+share the underlying character buffer between different consumers
+because multiple consumers can reference the same underlying buffer.
+The disadvantage of this design is that we need to be careful when
+mixing Strings with multithreading because the character buffer’s
+reference counting is not thread safe.
+
+
+## Storage
+
+### Encoding
+
+A `String` can represent Unicode code points with either `LChar`s or
+`UChar`s, which use 8 bits and 16 bits per code unit respectively.
+Each `LChar` represents a single code unit of ISO-8859-1, more
+commonly called Latin-1 (hence the L in `LChar`). Unlike UTF-8, this
+encoding cannot represent every Unicode code point. However, also
+unlike UTF-8, every representable Unicode code point can represented
+with a single code unit and the code unit is simply the 8 least
+significant bits of the code point. This property makes Latin-1 an
+attractive encoding because we can decode a Latin-1 code unit to a
+Unicode code point simply by zero-extending the `LChar` value.
+
+
+Each `UChar` represents a single UTF-16 code unit (hence the U in
+`UChar`). Unfortunately, Strings do not always contain valid UTF-16
+sequences. Strings that have round-tripped through JavaScript can
+contain invalid UTF-16 sequences because JavaScript isn’t required to
+pair surrogates in its strings. Most code that works with Strings can
+ignore this issue because they operate code-unit-by-code-unit, but
+subsystems need to operate on code points outside the Basic
+Multilingual Plane need to be prepared to handle unpaired surrogates.
+
+
+In addition to `LChar` and `UChar`, Strings also use the type
+`UChar32`, which is a UTF-32 code unit. `UChar32` is particularly easy
+to work with because every UTF-32 code unit has the same numerical
+value as its corresponding Unicode code point. Practically speaking,
+that means you can treat `UChar32` values as if they were Unicode code
+points.
+
+### Layout
+
+The `String` object itself is simply a pointer to a `StringImpl`
+object, which contains the actual character buffer. `String` uses a
+`scoped_refptr<>` to automatically `AddRef()` and `Release()` the
+`StringImpl` object on construction and destruction. The `StringImpl`
+pointer can be zero, in which case the `String` represents the null
+string. Typically, `String` objects are allocated on the stack or as
+members variables. `StringImpl` objects are always allocated in the
+heap.
+
+
+`StringImpl` objects are (logically) immutable, but a given `String`
+object can refer to different `StringImpl` objects over time. For
+example, `String::Replace()` works by creating a new `StringImpl`
+object with the replaced characters rather than by mutating the
+original `StringImpl` object.
+
+
+Rather than using a fixed `sizeof(StringImpl)` allocation size,
+`StringImpl` object are allocated a variable amount of memory and
+store their character data in the same memory allocation as the
+`StringImpl` object itself. In a sense, the `StringImpl` object is a
+header for the actual array of characters, be they `LChar`s or
+`UChar`s.
+
+#### Reference count
+
+The `StringImpl` header contains three 32 bit fields. The first is a
+reference count, which is incremented and decremented by the
+`AddRef()` and `Release()` functions. When the reference count reaches
+zero the `StringImpl` object is deallocated, unless the `StringImpl`
+is marked static (in which case the `StringImpl` object is never
+deallocated).
+
+
+#### Length
+
+The next 32 bits represent the (potentially zero) length of the
+string. The length field always represents the number of code units,
+regardless of whether the `StringImpl` uses `LChar`s or `UChar`s. We
+use a 32 bit length on both 32-bit and 64-bit systems. To avoid
+creating a string whose length is too long to represent in 32 bits, we
+RELEASE_ASSERT that the length doesn’t overflow, which means we’ll
+crash in a controlled way if you try to create a string that’s
+absurdly long.
+
+#### Hash and flags
+
+The final 32 bits are used to cache the string’s hash value and to
+store a number of Boolean flags. We use 24 bits for the hash and
+reserve 8 bits for flags. As of April 2019, the flags represent
+whether the `StringImpl`:
+
+* ...contains only ASCII (7-bit), or whether this is unknown (2
+    flags).
+* ...is a member of the `AtomicString` table (discussed below).
+* ...contains `LChar`s or `UChar`s.
+* ...is Static and will never be deallocated (discussed below).
+
+We haven’t evaluated the performance trade-offs of caching the hash
+value in the `StringImpl` object recently. It’s possible that the
+cache hit rate is sufficiently low that we should remove the final 32
+bit field and move the two flags into the length, reducing the length
+field to 30 bits. Small changes to `StringImpl` can have large effects
+on the overall system, which means we should measure the performance
+impact of this sort of change carefully.
+
+
+## Construction
+
+There are multiple interfaces for constructing strings, each of which
+is useful in different situations.
+
+### `String` constructor
+The most straightforward interface for constructing strings is the
+`String` constructor. You should use this interface when the source
+data for the `String` is an array of `LChar`s or `UChar`s. The
+`String` constructor allocates a new `StringImpl` object and copies
+the source characters into the `StringImpl` object as efficiently as
+possible.
+
+
+The `String` constructor that takes a `UChar` array always creates a
+`UChar`-based `StringImpl` object even if all the source characters
+would actually fit in `LChar`s. If your source data is an array of
+`UChar`s but you have reason to believe that the string will usually
+be representable in Latin-1, you should consider using
+`StringImpl::Create8BitIfPossible()`, which creates an `LChar`- or
+`UChar`-based `StringImpl` object as appropriate at the cost of
+checking whether all the source characters can be represented in
+Latin-1.
+
+
+### `operator+`
+
+The + operator on Strings is the most efficient way to combine smaller
+strings into larger strings. Using templates, `operator+` builds a
+tree of temporary objects that mirrors the tree of `operator+`
+invocations. When the temporary objects are (implicitly) collapsed to
+a `String`, we first compute the length of the final string and then
+allocate a single `StringImpl` object of exactly the correct size.
+After allocating the object, we copy all the characters into the
+string. This approach means we copy the characters exactly once into
+the correctly sized buffer, which is maximally efficient.
+
+### `StringBuilder`
+
+If you’re unable to use `operator+` to build your string, for example
+because you need to use a loop, you should use `StringBuilder`. Like
+similar interfaces in other libraries, `StringBuilder` lets you build
+a `String` by incrementally appending content. `StringBuilder` tries
+to use 8-bit `StringImpl` objects whenever possible but will upconvert
+its internal buffer to 16 bits if necessarily.
+
+
+`StringBuilder::Append()` grows its buffer exponentially, which means
+`StringBuilder` avoids the pathologically bad O(N^2) performance that
+repeated appends/concatenation can cause. One way to further optimize
+`String` construction when using `StringBuilder` is to call
+`StringBuilder::ReserveCapacity()` with (an estimate of) the final
+length of the `String` (in code units) before appending characters. If
+you give `StringBuilder` an accurate estimate of the length of the
+string, `StringBuilder` can pre-allocate the appropriate amount of
+memory and avoid having to reallocate its buffer and copy your string.
+
+
+### `StringBuffer`
+
+Finally, there are some cases where neither the `String` constructor
+or `StringBuilder` work well. For example, sometimes rather than
+having a source array of `UChar`s from which to construct a String,
+you might have a function that will write the `UChar`s into a buffer
+you provide. `StringBuffer` can help you in these cases by allocating
+a character buffer and letting the function write into it.
+
+
+Conceptually, a `StringBuffer` represents the underlying character
+buffer from a String. However, unlike `StringImpl` objects,
+`StringBuffer`s are mutable. `StringBuffer`s work well when you know
+ahead of time exactly how large a buffer you need and whether you want
+to use `LChar`s or `UChar`s. If you’re uncertain about the length of
+the `String` you’re constructing, you probably should use
+`StringBuilder`.
+
+## Atomic Strings
+
+Some `StringImpl` objects are marked as _Atomic_, which means they’re
+stored in a thread-local `HashSet` called the `AtomicString` table.
+Rather than interacting directly with these anointed `StringImpl`
+objects, we usually hold pointers to them via `AtomicString` objects
+(rather than `String` objects). Using `AtomicString` rather than
+`String` to point to an Atomic `StringImpl` object lets the compiler
+generate (and skip!) the appropriate hash lookups in the
+`AtomicString` table as well as use faster comparison operations with
+other `AtomicString`s.
+
+### Construction
+
+Typically, constructing an `AtomicString` from a `String` object will
+involve a hash lookup in the `AtomicString` table for the current
+thread. If the string represented by the `String` object is not
+present in the `AtomicString` table, the `StringImpl` object from that
+`String` will be marked Atomic and added to the table. If the
+represented string is already present in the `AtomicString` table, the
+already-Atomic `StringImpl` object from the table will be used to
+construct the `AtomicString` rather than the `StringImpl` from the
+original String.
+
+
+If you wish to construct an `AtomicString` from an array of `LChar`s
+or `UChar`s, you should use the `AtomicString` constructor directly
+rather than first constructing a `String` object. If the string is
+already present in the `AtomicString` table, the `AtomicString`
+constructor will grab a reference to the existing Atomic `StringImpl`
+object rather than first allocating and populating a `StringImpl`
+object as the `String` constructor would.
+
+### Fast comparisons
+
+If two `StringImpl` objects are atomic, you can compare them for
+equality by comparing their addresses rather than by comparing them
+character-by-character. The reason this works is that we maintain the
+invariant that no two Atomic `StringImpl` objects on a given thread
+represent the same string. Therefore, the two `StringImpl` objects
+represent the same string if, and only if, they are actually the same
+`StringImpl` object. We’ve overloaded `operator==` on `AtomicString`
+to let the compiler generate these optimized comparisons
+automatically.
+
+### Deduplication
+
+Because there are no duplicate Atomic `StringImpl` objects on a given
+thread, `AtomicString`s are useful for coalescing duplicate strings
+into a single `StringImpl` object, saving memory. Unfortunately,
+`AtomicString`s are thread-specific and cannot be used to coalesce
+duplicate strings across threads.
+
+## Threading
+
+### Isolated copies
+
+String objects that you construct yourself are not thread-safe. The
+underlying `StringImpl` object can be shared only by `String` objects
+on the same thread because the `StringImpl`’s reference count isn’t
+incremented or decremented atomically.
+
+
+In some limited cases, you can safely send a `String` from one thread
+to another. In order for that to be safe, you need to make sure that
+the underlying `StringImpl` object has exactly one reference---the one
+you’re sending to another thread. If there is only one outstanding
+reference to the `StringImpl`, then there won’t be any reference count
+data races. The easiest way to get a `StringImpl` object with only one
+reference is to call `String::IsolatedCopy()`. You can check that a
+given `String` is safe to send to another thread by calling
+`String::IsSafeToSendToAnotherThread()`, typically in a `DCHECK`.
+
+
+If you look at the implementation of `IsSafeToSendToAnotherThread()`,
+you’ll notice that it always returns false if the `StringImpl` is
+Atomic, regardless of the reference count. That’s because Atomic
+`StringImpl` objects are not safe to send to another thread because
+they’re associated with an `AtomicString` table local to the current
+thread. If you do send an `AtomicString` to another thread and the
+`StringImpl` object is destructed on that thread, it will try to
+remove itself from that thread’s `AtomicString` table rather than from
+the original thread’s `AtomicString` table.
+
+### Static Strings
+
+At startup, we create a number of _Static_ `StringImpl` objects that
+are safe to use from any thread. These `StringImpl` objects maintain
+the invariant that the least significant bit of their reference count
+is always one, which means their reference count never reaches zero
+and they are never deallocated. In addition to preventing their
+deallocation, we also pre-populate the hash value to ensure that
+Static `StringImpl` objects are otherwise immutable.
+
+
+We first introduced these Static strings for the threaded HTML parser,
+but we are gradually using them more widely in the codebase. There are
+still some delicate interactions between Static strings and the
+`AtomicString` table, but hopefully we’ll smooth over these rough
+edges over time.
+
+# Conclusion
+
+This document contains a brief introduction to Blink’s `String` class.
+There are many details that are not included, but hopefully this
+document has given you a good high-level understanding of Strings.
+More details are available in the source, either in code or in
+comments. Happy hacking!
+
+_Originally authored by Adam Barth (abarth), 5 August 2013._
--- a/third_party/blink/renderer/platform/wtf/text/wtf_string.h
+++ b/third_party/blink/renderer/platform/wtf/text/wtf_string.h
@@ -57,8 +57,7 @@ enum UTF8ConversionMode {
             ? op##IgnoringASCIICase args               \
             : op##IgnoringCase args)

-// You can find documentation about this class in this doc:
-// https://docs.google.com/document/d/1kOCUlJdh2WJMJGDf-WoEQhmnjKLaOYRbiHz5TiGJl14/edit?usp=sharing
+// You can find documentation about this class in README.md in this directory.
 class WTF_EXPORT String {
  USING_FAST_MALLOC(String);