[Supersize] Refactor to prepare for LTO string support integration.

This CL refactors Supersize to prepare for bcanalyzer.py integration. Details: - Variable and function renames. - Split _BulkObjectFileAnalyzerWorker.AnalyzePaths(): - _ClassifyPaths(): Explicitly store .a files and .o files. For LTO we'll split .o into ELF and BC buckets. - _MakeBatches(): Reusable later for ELF and BC separately. - _DoBulkFork(): Reusable. - _RunNm(): Absorbs nm-specific code, will add alternative. - Split _BulkObjectFileAnalyzerWorker.AnalyzeStringLiterals() - _ReadElfStringData(): Separation of concern. - _GetEncodedRangesFromStringAddresses(): ELF-specific code. Will add alternative. - Restructure how results are merged, with more comments. - Split ResolveStringPiecesIndirect() (was ResolveStringPieces()): - _AnnotateStringData(): Reusable. - Will add alternative: ResolveStringPieces(). Bug: 723798 Change-Id: Ib8e0e1785ae11652a17d060c9629e3986df0f93a Reviewed-on: https://chromium-review.googlesource.com/1146775 Commit-Queue: Samuel Huang <huangs@chromium.org> Reviewed-by: Samuel Huang <huangs@chromium.org> Reviewed-by: agrieve <agrieve@chromium.org> Cr-Commit-Position: refs/heads/master@{#577274}

[Supersize] Refactor to prepare for LTO string support integration.
This CL refactors Supersize to prepare for bcanalyzer.py integration. Details: - Variable and function renames. - Split _BulkObjectFileAnalyzerWorker.AnalyzePaths(): - _ClassifyPaths(): Explicitly store .a files and .o files. For LTO we'll split .o into ELF and BC buckets. - _MakeBatches(): Reusable later for ELF and BC separately. - _DoBulkFork(): Reusable. - _RunNm(): Absorbs nm-specific code, will add alternative. - Split _BulkObjectFileAnalyzerWorker.AnalyzeStringLiterals() - _ReadElfStringData(): Separation of concern. - _GetEncodedRangesFromStringAddresses(): ELF-specific code. Will add alternative. - Restructure how results are merged, with more comments. - Split ResolveStringPiecesIndirect() (was ResolveStringPieces()): - _AnnotateStringData(): Reusable. - Will add alternative: ResolveStringPieces(). Bug: 723798 Change-Id: Ib8e0e1785ae11652a17d060c9629e3986df0f93a Reviewed-on: https://chromium-review.googlesource.com/1146775 Commit-Queue: Samuel Huang <huangs@chromium.org> Reviewed-by: Samuel Huang <huangs@chromium.org> Reviewed-by: agrieve <agrieve@chromium.org> Cr-Commit-Position: refs/heads/master@{#577274}
3df5254f · Samuel Huang · Commit Bot · eafc94cb · 3df5254f · 3df5254f
Commit 3df5254f authored Jul 23, 2018 by Samuel Huang Committed by Commit Bot Jul 23, 2018
4 changed files
--- a/tools/binary_size/libsupersize/archive.py
+++ b/tools/binary_size/libsupersize/archive.py
@@ -817,8 +817,8 @@ def _ParseElfInfo(map_path, elf_path, tool_prefix, track_string_literals,
      # More likely for there to be a bug in supersize than an ELF to not have a
      # single string literal.
      assert merge_string_syms
-      string_positions = [(s.address, s.size) for s in merge_string_syms]
-      bulk_analyzer.AnalyzeStringLiterals(elf_path, string_positions)
+      string_ranges = [(s.address, s.size) for s in merge_string_syms]
+      bulk_analyzer.AnalyzeStringLiterals(elf_path, string_ranges)

  logging.info('Stripping linker prefixes from symbol names')
  _StripLinkerAddedSymbolPrefixes(raw_symbols)

--- a/tools/binary_size/libsupersize/nm.py
+++ b/tools/binary_size/libsupersize/nm.py
@@ -2,12 +2,10 @@
 # Use of this source code is governed by a BSD-style license that can be
 # found in the LICENSE file.

-"""Runs nm on every .o file that comprises an ELF (plus some analysis).
-
-The design of this file is entirely to work around Python's lack of concurrency.
+"""Runs nm on specified .a and .o file, plus some analysis.

 CollectAliasesByAddress():
-  Runs "nm" on the elf to collect all symbol names. This reveals symbol names of
+  Runs nm on the elf to collect all symbol names. This reveals symbol names of
  identical-code-folded functions.

 CollectAliasesByAddressAsync():
@@ -20,7 +18,6 @@ RunNmOnIntermediates():
 """

 import collections
-import os
 import subprocess

 import concurrent
@@ -146,7 +143,7 @@ def _ParseOneObjectFileNmOutput(lines):
        string_addresses.append(line[:space_idx].lstrip('0') or '0')
      elif _IsRelevantObjectFileName(mangled_name):
        symbol_names.add(mangled_name)
-  return string_addresses, symbol_names
+  return symbol_names, string_addresses


 # This is a target for BulkForkAndCall().
@@ -176,14 +173,14 @@ def RunNmOnIntermediates(target, tool_prefix, output_directory):
    assert not is_archive
    path = target[0]

-  string_addresses_by_path = {}
  symbol_names_by_path = {}
+  string_addresses_by_path = {}
  while path:
    if is_archive:
      # E.g. foo/bar.a(baz.o)
      path = '%s(%s)' % (target, path)

-    string_addresses, mangled_symbol_names = _ParseOneObjectFileNmOutput(lines)
+    mangled_symbol_names, string_addresses = _ParseOneObjectFileNmOutput(lines)
    symbol_names_by_path[path] = mangled_symbol_names
    if string_addresses:
      string_addresses_by_path[path] = string_addresses

--- a/tools/binary_size/libsupersize/obj_analyzer.py
+++ b/tools/binary_size/libsupersize/obj_analyzer.py
@@ -7,6 +7,10 @@

 This file works around Python's lack of concurrency.

+_BulkObjectFileAnalyzerWorker:
+  Performs the actual work. Uses Process Pools to shard out per-object-file
+  work and then aggregates results.
+
 _BulkObjectFileAnalyzerMaster:
  Creates a subprocess and sends IPCs to it asking it to do work.

@@ -15,19 +19,17 @@ _BulkObjectFileAnalyzerSlave:
  Runs _BulkObjectFileAnalyzerWorker on a background thread in order to stay
  responsive to IPCs.

-_BulkObjectFileAnalyzerWorker:
-  Performs the actual work. Uses Process Pools to shard out per-object-file
-  work and then aggregates results.
-
 BulkObjectFileAnalyzer:
-  Alias for _BulkObjectFileAnalyzerMaster, but when SUPERSIZE_DISABLE_ASYNC=1,
-  alias for _BulkObjectFileAnalyzerWorker.
-  * AnalyzePaths: Run "nm" on all .o files to collect symbol names that exist
+  Extracts information from .o files. Alias for _BulkObjectFileAnalyzerMaster,
+  but when SUPERSIZE_DISABLE_ASYNC=1, alias for _BulkObjectFileAnalyzerWorker.
+  * AnalyzePaths(): Processes all .o files to collect symbol names that exist
    within each. Does not work with thin archives (expand them first).
-  * SortPaths: Sort results of AnalyzePaths().
-  * AnalyzeStringLiterals: Must be run after AnalyzePaths() has completed.
+  * SortPaths(): Sort results of AnalyzePaths().
+  * AnalyzeStringLiterals(): Must be run after AnalyzePaths() has completed.
    Extracts string literals from .o files, and then locates them within the
    "** merge strings" sections within an ELF's .rodata section.
+  * GetSymbolNames(): Accessor.
+  * Close(): Disposes data.

 This file can also be run stand-alone in order to test out the logic on smaller
 sample sizes.
@@ -78,36 +80,58 @@ def _MakeToolPrefixAbsolute(tool_prefix):
  return tool_prefix


+class _PathsByType:
+  def __init__(self, arch, obj):
+    self.arch = arch
+    self.obj = obj
+
+
 class _BulkObjectFileAnalyzerWorker(object):
  def __init__(self, tool_prefix, output_directory):
    self._tool_prefix = _MakeToolPrefixAbsolute(tool_prefix)
    self._output_directory = output_directory
+    self._list_of_encoded_elf_string_ranges_by_path = None
    self._paths_by_name = collections.defaultdict(list)
    self._encoded_string_addresses_by_path_chunks = []
-    self._list_of_encoded_elf_string_positions_by_path = None

-  def AnalyzePaths(self, paths):
-    def iter_job_params():
-      object_paths = []
-      for path in paths:
-        # Note: ResolveStringPieces() relies upon .a not being grouped.
-        if path.endswith('.a'):
-          yield (path,)
-        else:
-          object_paths.append(path)
-
-      BATCH_SIZE = 50  # Chosen arbitrarily.
-      for i in xrange(0, len(object_paths), BATCH_SIZE):
-        batch = object_paths[i:i + BATCH_SIZE]
-        yield (batch,)
-
-    params = list(iter_job_params())
+  def _ClassifyPaths(self, paths):
+    """Classifies |paths| (.o and .a files) by file type into separate lists.
+
+    Returns:
+      A _PathsByType instance storing classified disjoint sublists of |paths|.
+    """
+    arch_paths = []
+    obj_paths = []
+    for path in paths:
+      if path.endswith('.a'):
+        arch_paths.append(path)
+      else:
+        obj_paths.append(path)
+    return _PathsByType(arch=arch_paths, obj=obj_paths)
+
+  def _MakeBatches(self, paths, size=None):
+    if size is None:
+      # Create 1-tuples of strings.
+      return [(p,) for p in paths]
+    # Create 1-tuples of arrays of strings.
+    return [(paths[i:i + size],) for i in xrange(0, len(paths), size)]
+
+  def _DoBulkFork(self, runner, batches):
    # Order of the jobs doesn't matter since each job owns independent paths,
    # and our output is a dict where paths are the key.
-    results = concurrent.BulkForkAndCall(
-        nm.RunNmOnIntermediates, params, tool_prefix=self._tool_prefix,
+    return concurrent.BulkForkAndCall(
+        runner, batches, tool_prefix=self._tool_prefix,
        output_directory=self._output_directory)

+  def _RunNm(self, paths_by_type):
+    """Calls nm to get symbols and string addresses."""
+    # Downstream functions rely upon .a not being grouped.
+    batches = self._MakeBatches(paths_by_type.arch, None)
+    # Combine object files and Bitcode files for nm
+    BATCH_SIZE = 50  # Arbitrarily chosen.
+    batches.extend(self._MakeBatches(paths_by_type.obj, BATCH_SIZE))
+    results = self._DoBulkFork(nm.RunNmOnIntermediates, batches)
+
    # Names are still mangled.
    all_paths_by_name = self._paths_by_name
    for encoded_syms, encoded_strs in results:
@@ -118,40 +142,60 @@ class _BulkObjectFileAnalyzerWorker(object):

      if encoded_strs != concurrent.EMPTY_ENCODED_DICT:
        self._encoded_string_addresses_by_path_chunks.append(encoded_strs)
+
+  def AnalyzePaths(self, paths):
+    logging.debug('worker: AnalyzePaths() started.')
+    paths_by_type = self._ClassifyPaths(paths)
+    logging.info('File counts: {\'arch\': %d, \'obj\': %d}',
+                 len(paths_by_type.arch), len(paths_by_type.obj))
+    self._RunNm(paths_by_type)
    logging.debug('worker: AnalyzePaths() completed.')

  def SortPaths(self):
-    # Finally, demangle all names, which can result in some merging of lists.
+    # Demangle all names, which can result in some merging of lists.
    self._paths_by_name = demangle.DemangleKeysAndMergeLists(
        self._paths_by_name, self._tool_prefix)
    # Sort and uniquefy.
    for key in self._paths_by_name.iterkeys():
      self._paths_by_name[key] = sorted(set(self._paths_by_name[key]))

-  def AnalyzeStringLiterals(self, elf_path, elf_string_positions):
-    logging.debug('worker: AnalyzeStringLiterals() started.')
-    # Read string_data from elf_path, to be shared by forked processes.
+  def _ReadElfStringData(self, elf_path, elf_string_ranges):
+    # Read string_data from elf_path, to be shared with forked processes.
    address, offset, _ = string_extract.LookupElfRodataInfo(
        elf_path, self._tool_prefix)
    adjust = address - offset
-    abs_string_positions = (
-        (addr - adjust, s) for addr, s in elf_string_positions)
-    string_data = string_extract.ReadFileChunks(elf_path, abs_string_positions)
+    abs_elf_string_ranges = (
+        (addr - adjust, s) for addr, s in elf_string_ranges)
+    return string_extract.ReadFileChunks(elf_path, abs_elf_string_ranges)

+  def _GetEncodedRangesFromStringAddresses(self, string_data):
    params = ((chunk,)
        for chunk in self._encoded_string_addresses_by_path_chunks)
    # Order of the jobs doesn't matter since each job owns independent paths,
    # and our output is a dict where paths are the key.
    results = concurrent.BulkForkAndCall(
-        string_extract.ResolveStringPieces, params, string_data=string_data,
-        tool_prefix=self._tool_prefix, output_directory=self._output_directory)
-    results = list(results)
-
-    final_result = []
-    for i in xrange(len(elf_string_positions)):
-      final_result.append(
-          concurrent.JoinEncodedDictOfLists([r[i] for r in results]))
-    self._list_of_encoded_elf_string_positions_by_path = final_result
+        string_extract.ResolveStringPiecesIndirect, params,
+        string_data=string_data, tool_prefix=self._tool_prefix,
+        output_directory=self._output_directory)
+    return list(results)
+
+  def AnalyzeStringLiterals(self, elf_path, elf_string_ranges):
+    logging.debug('worker: AnalyzeStringLiterals() started.')
+    string_data = self._ReadElfStringData(elf_path, elf_string_ranges)
+
+    # [source_idx][batch_idx][section_idx] -> Encoded {path: [string_ranges]}.
+    encoded_ranges_sources = [
+      self._GetEncodedRangesFromStringAddresses(string_data),
+    ]
+    # [section_idx] -> {path: [string_ranges]}.
+    self._list_of_encoded_elf_string_ranges_by_path = []
+    # Contract [source_idx] and [batch_idx], then decode and join.
+    for section_idx in xrange(len(elf_string_ranges)):  # Fetch result.
+      t = []
+      for encoded_ranges in encoded_ranges_sources:  # [source_idx].
+        t.extend([b[section_idx] for b in encoded_ranges])  # [batch_idx].
+      self._list_of_encoded_elf_string_ranges_by_path.append(
+        concurrent.JoinEncodedDictOfLists(t))
    logging.debug('worker: AnalyzeStringLiterals() completed.')

  def GetSymbolNames(self):
@@ -159,10 +203,10 @@ class _BulkObjectFileAnalyzerWorker(object):

  def GetStringPositions(self):
    return [concurrent.DecodeDictOfLists(x, value_transform=_DecodePosition)
-            for x in self._list_of_encoded_elf_string_positions_by_path]
+            for x in self._list_of_encoded_elf_string_ranges_by_path]

  def GetEncodedStringPositions(self):
-    return self._list_of_encoded_elf_string_positions_by_path
+    return self._list_of_encoded_elf_string_ranges_by_path

  def Close(self):
    pass
@@ -198,7 +242,7 @@ class _BulkObjectFileAnalyzerMaster(object):
    else:
      # We are the child process.
      logging.root.handlers[0].setFormatter(logging.Formatter(
-          'nm: %(levelname).1s %(relativeCreated)6d %(message)s'))
+          'obj_analyzer: %(levelname).1s %(relativeCreated)6d %(message)s'))
      worker_analyzer = _BulkObjectFileAnalyzerWorker(
          self._tool_prefix, self._output_directory)
      slave = _BulkObjectFileAnalyzerSlave(worker_analyzer, child_conn)
@@ -215,8 +259,8 @@ class _BulkObjectFileAnalyzerMaster(object):
  def SortPaths(self):
    self._pipe.send((_MSG_SORT_PATHS,))

-  def AnalyzeStringLiterals(self, elf_path, string_positions):
-    self._pipe.send((_MSG_ANALYZE_STRINGS, elf_path, string_positions))
+  def AnalyzeStringLiterals(self, elf_path, elf_string_ranges):
+    self._pipe.send((_MSG_ANALYZE_STRINGS, elf_path, elf_string_ranges))

  def GetSymbolNames(self):
    self._pipe.send((_MSG_GET_SYMBOL_NAMES,))
@@ -308,7 +352,7 @@ class _BulkObjectFileAnalyzerSlave(object):
      if e.errno in (errno.EPIPE, errno.ECONNRESET):
        sys.exit(1)

-    logging.debug('nm bulk subprocess finished.')
+    logging.debug('bulk subprocess finished.')
    sys.exit(0)



--- a/tools/binary_size/libsupersize/string_extract.py
+++ b/tools/binary_size/libsupersize/string_extract.py
@@ -10,7 +10,7 @@ LookupElfRodataInfo():
 ReadFileChunks():
  Reads raw data from a file, given a list of ranges in the file.

-ResolveStringPieces():
+ResolveStringPiecesIndirect():
  BulkForkAndCall() target: Given {path: [string addresses]} and
  [raw_string_data for each string_section]:
  - Reads {path: [src_strings]}.
@@ -46,17 +46,17 @@ def LookupElfRodataInfo(elf_path, tool_prefix):
  raise AssertionError('No .rodata for command: ' + repr(args))


-def ReadFileChunks(path, positions):
-  """Returns a list of strings corresponding to |positions|.
+def ReadFileChunks(path, section_ranges):
+  """Returns a list of raw data from |path|, specified by |section_ranges|.

  Args:
-    positions: List of (offset, size).
+    section_ranges: List of (offset, size).
  """
  ret = []
-  if not positions:
+  if not section_ranges:
    return ret
  with open(path, 'rb') as f:
-    for offset, size in positions:
+    for offset, size in section_ranges:
      f.seek(offset)
      ret.append(f.read(size))
  return ret
@@ -190,9 +190,59 @@ def _IterStringLiterals(path, addresses, obj_sections):
      yield section_data[prev_offset:]


+def _AnnotateStringData(string_data, path_value_gen):
+  """Annotates each |string_data| section data with paths and ranges.
+
+  Args:
+    string_data: [raw_string_data for each string_section] from an ELF file.
+    path_value_gen: A generator of (path, value) pairs, where |path|
+      is the path to an object file and |value| is a string to annotate.
+
+  Returns:
+    [{path: [string_ranges]} for each string_section].
+  """
+  ret = [collections.defaultdict(list) for _ in string_data]
+
+  # Brute-force search ** merge strings sections in |string_data| for string
+  # values from |path_value_gen|. This is by far the slowest part of
+  # AnalyzeStringLiterals().
+  # TODO(agrieve): Pre-process |string_data| into a dict of literal->address (at
+  # least for ASCII strings).
+  for path, value in path_value_gen:
+    first_match = -1
+    first_match_dict = None
+    for target_dict, data in itertools.izip(ret, string_data):
+      # Set offset so that it will be 0 when len(value) is added to it below.
+      offset = -len(value)
+      while True:
+        offset = data.find(value, offset + len(value))
+        if offset == -1:
+          break
+        # Preferring exact matches (those following \0) over substring matches
+        # significantly increases accuracy (although shows that linker isn't
+        # being optimal).
+        if offset == 0 or data[offset - 1] == '\0':
+          break
+        if first_match == -1:
+          first_match = offset
+          first_match_dict = target_dict
+      if offset != -1:
+        break
+    if offset == -1:
+      # Exact match not found, so take suffix match if it exists.
+      offset = first_match
+      target_dict = first_match_dict
+    # Missing strings happen when optimization make them unused.
+    if offset != -1:
+      # Encode tuple as a string for easier mashalling.
+      target_dict[path].append(str(offset) + ':' + str(len(value)))
+
+  return ret
+
+
 # This is a target for BulkForkAndCall().
-def ResolveStringPieces(encoded_string_addresses_by_path, string_data,
-                        tool_prefix, output_directory):
+def ResolveStringPiecesIndirect(encoded_string_addresses_by_path, string_data,
+                                tool_prefix, output_directory):
  string_addresses_by_path = concurrent.DecodeDictOfLists(
      encoded_string_addresses_by_path)
  # Assign |target| as archive path, or a list of object paths.
@@ -208,42 +258,11 @@ def ResolveStringPieces(encoded_string_addresses_by_path, string_data,
  string_sections_by_path = _ReadStringSections(
      target, output_directory, section_positions_by_path)

-  # list of elf_positions_by_path.
-  ret = [collections.defaultdict(list) for _ in string_data]
-  # Brute-force search of strings within ** merge strings sections.
-  # This is by far the slowest part of AnalyzeStringLiterals().
-  # TODO(agrieve): Pre-process string_data into a dict of literal->address (at
-  #     least for ascii strings).
-  for path, object_addresses in string_addresses_by_path.iteritems():
-    for value in _IterStringLiterals(
-        path, object_addresses, string_sections_by_path.get(path)):
-      first_match = -1
-      first_match_dict = None
-      for target_dict, data in itertools.izip(ret, string_data):
-        # Set offset so that it will be 0 when len(value) is added to it below.
-        offset = -len(value)
-        while True:
-          offset = data.find(value, offset + len(value))
-          if offset == -1:
-            break
-          # Preferring exact matches (those following \0) over substring matches
-          # significantly increases accuracy (although shows that linker isn't
-          # being optimal).
-          if offset == 0 or data[offset - 1] == '\0':
-            break
-          if first_match == -1:
-            first_match = offset
-            first_match_dict = target_dict
-        if offset != -1:
-          break
-      if offset == -1:
-        # Exact match not found, so take suffix match if it exists.
-        offset = first_match
-        target_dict = first_match_dict
-      # Missing strings happen when optimization make them unused.
-      if offset != -1:
-        # Encode tuple as a string for easier mashalling.
-        target_dict[path].append(
-            str(offset) + ':' + str(len(value)))
+  def GeneratePathAndValues():
+    for path, object_addresses in string_addresses_by_path.iteritems():
+      for value in _IterStringLiterals(
+          path, object_addresses, string_sections_by_path.get(path)):
+        yield path, value

+  ret = _AnnotateStringData(string_data, GeneratePathAndValues())
  return [concurrent.EncodeDictOfLists(x) for x in ret]