Commit 3df5254f authored by Samuel Huang's avatar Samuel Huang Committed by Commit Bot

[Supersize] Refactor to prepare for LTO string support integration.

This CL refactors Supersize to prepare for bcanalyzer.py integration.
Details:
- Variable and function renames.
- Split _BulkObjectFileAnalyzerWorker.AnalyzePaths():
  - _ClassifyPaths(): Explicitly store .a files and .o files. For LTO
    we'll split .o into ELF and BC buckets.
  - _MakeBatches(): Reusable later for ELF and BC separately.
  - _DoBulkFork(): Reusable.
  - _RunNm(): Absorbs nm-specific code, will add alternative.
- Split _BulkObjectFileAnalyzerWorker.AnalyzeStringLiterals()
  - _ReadElfStringData(): Separation of concern.
  - _GetEncodedRangesFromStringAddresses(): ELF-specific code. Will
    add alternative.
  - Restructure how results are merged, with more comments.
- Split ResolveStringPiecesIndirect() (was ResolveStringPieces()):
  - _AnnotateStringData(): Reusable.
  - Will add alternative: ResolveStringPieces().

Bug: 723798
Change-Id: Ib8e0e1785ae11652a17d060c9629e3986df0f93a
Reviewed-on: https://chromium-review.googlesource.com/1146775
Commit-Queue: Samuel Huang <huangs@chromium.org>
Reviewed-by: default avatarSamuel Huang <huangs@chromium.org>
Reviewed-by: default avataragrieve <agrieve@chromium.org>
Cr-Commit-Position: refs/heads/master@{#577274}
parent eafc94cb
...@@ -817,8 +817,8 @@ def _ParseElfInfo(map_path, elf_path, tool_prefix, track_string_literals, ...@@ -817,8 +817,8 @@ def _ParseElfInfo(map_path, elf_path, tool_prefix, track_string_literals,
# More likely for there to be a bug in supersize than an ELF to not have a # More likely for there to be a bug in supersize than an ELF to not have a
# single string literal. # single string literal.
assert merge_string_syms assert merge_string_syms
string_positions = [(s.address, s.size) for s in merge_string_syms] string_ranges = [(s.address, s.size) for s in merge_string_syms]
bulk_analyzer.AnalyzeStringLiterals(elf_path, string_positions) bulk_analyzer.AnalyzeStringLiterals(elf_path, string_ranges)
logging.info('Stripping linker prefixes from symbol names') logging.info('Stripping linker prefixes from symbol names')
_StripLinkerAddedSymbolPrefixes(raw_symbols) _StripLinkerAddedSymbolPrefixes(raw_symbols)
......
...@@ -2,12 +2,10 @@ ...@@ -2,12 +2,10 @@
# Use of this source code is governed by a BSD-style license that can be # Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file. # found in the LICENSE file.
"""Runs nm on every .o file that comprises an ELF (plus some analysis). """Runs nm on specified .a and .o file, plus some analysis.
The design of this file is entirely to work around Python's lack of concurrency.
CollectAliasesByAddress(): CollectAliasesByAddress():
Runs "nm" on the elf to collect all symbol names. This reveals symbol names of Runs nm on the elf to collect all symbol names. This reveals symbol names of
identical-code-folded functions. identical-code-folded functions.
CollectAliasesByAddressAsync(): CollectAliasesByAddressAsync():
...@@ -20,7 +18,6 @@ RunNmOnIntermediates(): ...@@ -20,7 +18,6 @@ RunNmOnIntermediates():
""" """
import collections import collections
import os
import subprocess import subprocess
import concurrent import concurrent
...@@ -146,7 +143,7 @@ def _ParseOneObjectFileNmOutput(lines): ...@@ -146,7 +143,7 @@ def _ParseOneObjectFileNmOutput(lines):
string_addresses.append(line[:space_idx].lstrip('0') or '0') string_addresses.append(line[:space_idx].lstrip('0') or '0')
elif _IsRelevantObjectFileName(mangled_name): elif _IsRelevantObjectFileName(mangled_name):
symbol_names.add(mangled_name) symbol_names.add(mangled_name)
return string_addresses, symbol_names return symbol_names, string_addresses
# This is a target for BulkForkAndCall(). # This is a target for BulkForkAndCall().
...@@ -176,14 +173,14 @@ def RunNmOnIntermediates(target, tool_prefix, output_directory): ...@@ -176,14 +173,14 @@ def RunNmOnIntermediates(target, tool_prefix, output_directory):
assert not is_archive assert not is_archive
path = target[0] path = target[0]
string_addresses_by_path = {}
symbol_names_by_path = {} symbol_names_by_path = {}
string_addresses_by_path = {}
while path: while path:
if is_archive: if is_archive:
# E.g. foo/bar.a(baz.o) # E.g. foo/bar.a(baz.o)
path = '%s(%s)' % (target, path) path = '%s(%s)' % (target, path)
string_addresses, mangled_symbol_names = _ParseOneObjectFileNmOutput(lines) mangled_symbol_names, string_addresses = _ParseOneObjectFileNmOutput(lines)
symbol_names_by_path[path] = mangled_symbol_names symbol_names_by_path[path] = mangled_symbol_names
if string_addresses: if string_addresses:
string_addresses_by_path[path] = string_addresses string_addresses_by_path[path] = string_addresses
......
...@@ -7,6 +7,10 @@ ...@@ -7,6 +7,10 @@
This file works around Python's lack of concurrency. This file works around Python's lack of concurrency.
_BulkObjectFileAnalyzerWorker:
Performs the actual work. Uses Process Pools to shard out per-object-file
work and then aggregates results.
_BulkObjectFileAnalyzerMaster: _BulkObjectFileAnalyzerMaster:
Creates a subprocess and sends IPCs to it asking it to do work. Creates a subprocess and sends IPCs to it asking it to do work.
...@@ -15,19 +19,17 @@ _BulkObjectFileAnalyzerSlave: ...@@ -15,19 +19,17 @@ _BulkObjectFileAnalyzerSlave:
Runs _BulkObjectFileAnalyzerWorker on a background thread in order to stay Runs _BulkObjectFileAnalyzerWorker on a background thread in order to stay
responsive to IPCs. responsive to IPCs.
_BulkObjectFileAnalyzerWorker:
Performs the actual work. Uses Process Pools to shard out per-object-file
work and then aggregates results.
BulkObjectFileAnalyzer: BulkObjectFileAnalyzer:
Alias for _BulkObjectFileAnalyzerMaster, but when SUPERSIZE_DISABLE_ASYNC=1, Extracts information from .o files. Alias for _BulkObjectFileAnalyzerMaster,
alias for _BulkObjectFileAnalyzerWorker. but when SUPERSIZE_DISABLE_ASYNC=1, alias for _BulkObjectFileAnalyzerWorker.
* AnalyzePaths: Run "nm" on all .o files to collect symbol names that exist * AnalyzePaths(): Processes all .o files to collect symbol names that exist
within each. Does not work with thin archives (expand them first). within each. Does not work with thin archives (expand them first).
* SortPaths: Sort results of AnalyzePaths(). * SortPaths(): Sort results of AnalyzePaths().
* AnalyzeStringLiterals: Must be run after AnalyzePaths() has completed. * AnalyzeStringLiterals(): Must be run after AnalyzePaths() has completed.
Extracts string literals from .o files, and then locates them within the Extracts string literals from .o files, and then locates them within the
"** merge strings" sections within an ELF's .rodata section. "** merge strings" sections within an ELF's .rodata section.
* GetSymbolNames(): Accessor.
* Close(): Disposes data.
This file can also be run stand-alone in order to test out the logic on smaller This file can also be run stand-alone in order to test out the logic on smaller
sample sizes. sample sizes.
...@@ -78,36 +80,58 @@ def _MakeToolPrefixAbsolute(tool_prefix): ...@@ -78,36 +80,58 @@ def _MakeToolPrefixAbsolute(tool_prefix):
return tool_prefix return tool_prefix
class _PathsByType:
def __init__(self, arch, obj):
self.arch = arch
self.obj = obj
class _BulkObjectFileAnalyzerWorker(object): class _BulkObjectFileAnalyzerWorker(object):
def __init__(self, tool_prefix, output_directory): def __init__(self, tool_prefix, output_directory):
self._tool_prefix = _MakeToolPrefixAbsolute(tool_prefix) self._tool_prefix = _MakeToolPrefixAbsolute(tool_prefix)
self._output_directory = output_directory self._output_directory = output_directory
self._list_of_encoded_elf_string_ranges_by_path = None
self._paths_by_name = collections.defaultdict(list) self._paths_by_name = collections.defaultdict(list)
self._encoded_string_addresses_by_path_chunks = [] self._encoded_string_addresses_by_path_chunks = []
self._list_of_encoded_elf_string_positions_by_path = None
def AnalyzePaths(self, paths): def _ClassifyPaths(self, paths):
def iter_job_params(): """Classifies |paths| (.o and .a files) by file type into separate lists.
object_paths = []
Returns:
A _PathsByType instance storing classified disjoint sublists of |paths|.
"""
arch_paths = []
obj_paths = []
for path in paths: for path in paths:
# Note: ResolveStringPieces() relies upon .a not being grouped.
if path.endswith('.a'): if path.endswith('.a'):
yield (path,) arch_paths.append(path)
else: else:
object_paths.append(path) obj_paths.append(path)
return _PathsByType(arch=arch_paths, obj=obj_paths)
BATCH_SIZE = 50 # Chosen arbitrarily. def _MakeBatches(self, paths, size=None):
for i in xrange(0, len(object_paths), BATCH_SIZE): if size is None:
batch = object_paths[i:i + BATCH_SIZE] # Create 1-tuples of strings.
yield (batch,) return [(p,) for p in paths]
# Create 1-tuples of arrays of strings.
return [(paths[i:i + size],) for i in xrange(0, len(paths), size)]
params = list(iter_job_params()) def _DoBulkFork(self, runner, batches):
# Order of the jobs doesn't matter since each job owns independent paths, # Order of the jobs doesn't matter since each job owns independent paths,
# and our output is a dict where paths are the key. # and our output is a dict where paths are the key.
results = concurrent.BulkForkAndCall( return concurrent.BulkForkAndCall(
nm.RunNmOnIntermediates, params, tool_prefix=self._tool_prefix, runner, batches, tool_prefix=self._tool_prefix,
output_directory=self._output_directory) output_directory=self._output_directory)
def _RunNm(self, paths_by_type):
"""Calls nm to get symbols and string addresses."""
# Downstream functions rely upon .a not being grouped.
batches = self._MakeBatches(paths_by_type.arch, None)
# Combine object files and Bitcode files for nm
BATCH_SIZE = 50 # Arbitrarily chosen.
batches.extend(self._MakeBatches(paths_by_type.obj, BATCH_SIZE))
results = self._DoBulkFork(nm.RunNmOnIntermediates, batches)
# Names are still mangled. # Names are still mangled.
all_paths_by_name = self._paths_by_name all_paths_by_name = self._paths_by_name
for encoded_syms, encoded_strs in results: for encoded_syms, encoded_strs in results:
...@@ -118,40 +142,60 @@ class _BulkObjectFileAnalyzerWorker(object): ...@@ -118,40 +142,60 @@ class _BulkObjectFileAnalyzerWorker(object):
if encoded_strs != concurrent.EMPTY_ENCODED_DICT: if encoded_strs != concurrent.EMPTY_ENCODED_DICT:
self._encoded_string_addresses_by_path_chunks.append(encoded_strs) self._encoded_string_addresses_by_path_chunks.append(encoded_strs)
def AnalyzePaths(self, paths):
logging.debug('worker: AnalyzePaths() started.')
paths_by_type = self._ClassifyPaths(paths)
logging.info('File counts: {\'arch\': %d, \'obj\': %d}',
len(paths_by_type.arch), len(paths_by_type.obj))
self._RunNm(paths_by_type)
logging.debug('worker: AnalyzePaths() completed.') logging.debug('worker: AnalyzePaths() completed.')
def SortPaths(self): def SortPaths(self):
# Finally, demangle all names, which can result in some merging of lists. # Demangle all names, which can result in some merging of lists.
self._paths_by_name = demangle.DemangleKeysAndMergeLists( self._paths_by_name = demangle.DemangleKeysAndMergeLists(
self._paths_by_name, self._tool_prefix) self._paths_by_name, self._tool_prefix)
# Sort and uniquefy. # Sort and uniquefy.
for key in self._paths_by_name.iterkeys(): for key in self._paths_by_name.iterkeys():
self._paths_by_name[key] = sorted(set(self._paths_by_name[key])) self._paths_by_name[key] = sorted(set(self._paths_by_name[key]))
def AnalyzeStringLiterals(self, elf_path, elf_string_positions): def _ReadElfStringData(self, elf_path, elf_string_ranges):
logging.debug('worker: AnalyzeStringLiterals() started.') # Read string_data from elf_path, to be shared with forked processes.
# Read string_data from elf_path, to be shared by forked processes.
address, offset, _ = string_extract.LookupElfRodataInfo( address, offset, _ = string_extract.LookupElfRodataInfo(
elf_path, self._tool_prefix) elf_path, self._tool_prefix)
adjust = address - offset adjust = address - offset
abs_string_positions = ( abs_elf_string_ranges = (
(addr - adjust, s) for addr, s in elf_string_positions) (addr - adjust, s) for addr, s in elf_string_ranges)
string_data = string_extract.ReadFileChunks(elf_path, abs_string_positions) return string_extract.ReadFileChunks(elf_path, abs_elf_string_ranges)
def _GetEncodedRangesFromStringAddresses(self, string_data):
params = ((chunk,) params = ((chunk,)
for chunk in self._encoded_string_addresses_by_path_chunks) for chunk in self._encoded_string_addresses_by_path_chunks)
# Order of the jobs doesn't matter since each job owns independent paths, # Order of the jobs doesn't matter since each job owns independent paths,
# and our output is a dict where paths are the key. # and our output is a dict where paths are the key.
results = concurrent.BulkForkAndCall( results = concurrent.BulkForkAndCall(
string_extract.ResolveStringPieces, params, string_data=string_data, string_extract.ResolveStringPiecesIndirect, params,
tool_prefix=self._tool_prefix, output_directory=self._output_directory) string_data=string_data, tool_prefix=self._tool_prefix,
results = list(results) output_directory=self._output_directory)
return list(results)
final_result = []
for i in xrange(len(elf_string_positions)): def AnalyzeStringLiterals(self, elf_path, elf_string_ranges):
final_result.append( logging.debug('worker: AnalyzeStringLiterals() started.')
concurrent.JoinEncodedDictOfLists([r[i] for r in results])) string_data = self._ReadElfStringData(elf_path, elf_string_ranges)
self._list_of_encoded_elf_string_positions_by_path = final_result
# [source_idx][batch_idx][section_idx] -> Encoded {path: [string_ranges]}.
encoded_ranges_sources = [
self._GetEncodedRangesFromStringAddresses(string_data),
]
# [section_idx] -> {path: [string_ranges]}.
self._list_of_encoded_elf_string_ranges_by_path = []
# Contract [source_idx] and [batch_idx], then decode and join.
for section_idx in xrange(len(elf_string_ranges)): # Fetch result.
t = []
for encoded_ranges in encoded_ranges_sources: # [source_idx].
t.extend([b[section_idx] for b in encoded_ranges]) # [batch_idx].
self._list_of_encoded_elf_string_ranges_by_path.append(
concurrent.JoinEncodedDictOfLists(t))
logging.debug('worker: AnalyzeStringLiterals() completed.') logging.debug('worker: AnalyzeStringLiterals() completed.')
def GetSymbolNames(self): def GetSymbolNames(self):
...@@ -159,10 +203,10 @@ class _BulkObjectFileAnalyzerWorker(object): ...@@ -159,10 +203,10 @@ class _BulkObjectFileAnalyzerWorker(object):
def GetStringPositions(self): def GetStringPositions(self):
return [concurrent.DecodeDictOfLists(x, value_transform=_DecodePosition) return [concurrent.DecodeDictOfLists(x, value_transform=_DecodePosition)
for x in self._list_of_encoded_elf_string_positions_by_path] for x in self._list_of_encoded_elf_string_ranges_by_path]
def GetEncodedStringPositions(self): def GetEncodedStringPositions(self):
return self._list_of_encoded_elf_string_positions_by_path return self._list_of_encoded_elf_string_ranges_by_path
def Close(self): def Close(self):
pass pass
...@@ -198,7 +242,7 @@ class _BulkObjectFileAnalyzerMaster(object): ...@@ -198,7 +242,7 @@ class _BulkObjectFileAnalyzerMaster(object):
else: else:
# We are the child process. # We are the child process.
logging.root.handlers[0].setFormatter(logging.Formatter( logging.root.handlers[0].setFormatter(logging.Formatter(
'nm: %(levelname).1s %(relativeCreated)6d %(message)s')) 'obj_analyzer: %(levelname).1s %(relativeCreated)6d %(message)s'))
worker_analyzer = _BulkObjectFileAnalyzerWorker( worker_analyzer = _BulkObjectFileAnalyzerWorker(
self._tool_prefix, self._output_directory) self._tool_prefix, self._output_directory)
slave = _BulkObjectFileAnalyzerSlave(worker_analyzer, child_conn) slave = _BulkObjectFileAnalyzerSlave(worker_analyzer, child_conn)
...@@ -215,8 +259,8 @@ class _BulkObjectFileAnalyzerMaster(object): ...@@ -215,8 +259,8 @@ class _BulkObjectFileAnalyzerMaster(object):
def SortPaths(self): def SortPaths(self):
self._pipe.send((_MSG_SORT_PATHS,)) self._pipe.send((_MSG_SORT_PATHS,))
def AnalyzeStringLiterals(self, elf_path, string_positions): def AnalyzeStringLiterals(self, elf_path, elf_string_ranges):
self._pipe.send((_MSG_ANALYZE_STRINGS, elf_path, string_positions)) self._pipe.send((_MSG_ANALYZE_STRINGS, elf_path, elf_string_ranges))
def GetSymbolNames(self): def GetSymbolNames(self):
self._pipe.send((_MSG_GET_SYMBOL_NAMES,)) self._pipe.send((_MSG_GET_SYMBOL_NAMES,))
...@@ -308,7 +352,7 @@ class _BulkObjectFileAnalyzerSlave(object): ...@@ -308,7 +352,7 @@ class _BulkObjectFileAnalyzerSlave(object):
if e.errno in (errno.EPIPE, errno.ECONNRESET): if e.errno in (errno.EPIPE, errno.ECONNRESET):
sys.exit(1) sys.exit(1)
logging.debug('nm bulk subprocess finished.') logging.debug('bulk subprocess finished.')
sys.exit(0) sys.exit(0)
......
...@@ -10,7 +10,7 @@ LookupElfRodataInfo(): ...@@ -10,7 +10,7 @@ LookupElfRodataInfo():
ReadFileChunks(): ReadFileChunks():
Reads raw data from a file, given a list of ranges in the file. Reads raw data from a file, given a list of ranges in the file.
ResolveStringPieces(): ResolveStringPiecesIndirect():
BulkForkAndCall() target: Given {path: [string addresses]} and BulkForkAndCall() target: Given {path: [string addresses]} and
[raw_string_data for each string_section]: [raw_string_data for each string_section]:
- Reads {path: [src_strings]}. - Reads {path: [src_strings]}.
...@@ -46,17 +46,17 @@ def LookupElfRodataInfo(elf_path, tool_prefix): ...@@ -46,17 +46,17 @@ def LookupElfRodataInfo(elf_path, tool_prefix):
raise AssertionError('No .rodata for command: ' + repr(args)) raise AssertionError('No .rodata for command: ' + repr(args))
def ReadFileChunks(path, positions): def ReadFileChunks(path, section_ranges):
"""Returns a list of strings corresponding to |positions|. """Returns a list of raw data from |path|, specified by |section_ranges|.
Args: Args:
positions: List of (offset, size). section_ranges: List of (offset, size).
""" """
ret = [] ret = []
if not positions: if not section_ranges:
return ret return ret
with open(path, 'rb') as f: with open(path, 'rb') as f:
for offset, size in positions: for offset, size in section_ranges:
f.seek(offset) f.seek(offset)
ret.append(f.read(size)) ret.append(f.read(size))
return ret return ret
...@@ -190,33 +190,25 @@ def _IterStringLiterals(path, addresses, obj_sections): ...@@ -190,33 +190,25 @@ def _IterStringLiterals(path, addresses, obj_sections):
yield section_data[prev_offset:] yield section_data[prev_offset:]
# This is a target for BulkForkAndCall(). def _AnnotateStringData(string_data, path_value_gen):
def ResolveStringPieces(encoded_string_addresses_by_path, string_data, """Annotates each |string_data| section data with paths and ranges.
tool_prefix, output_directory):
string_addresses_by_path = concurrent.DecodeDictOfLists(
encoded_string_addresses_by_path)
# Assign |target| as archive path, or a list of object paths.
any_path = next(string_addresses_by_path.iterkeys())
target = _ExtractArchivePath(any_path)
if not target:
target = string_addresses_by_path.keys()
# Run readelf to find location of .rodata within the .o files. Args:
section_positions_by_path = _LookupStringSectionPositions( string_data: [raw_string_data for each string_section] from an ELF file.
target, tool_prefix, output_directory) path_value_gen: A generator of (path, value) pairs, where |path|
# Load the .rodata sections (from object files) as strings. is the path to an object file and |value| is a string to annotate.
string_sections_by_path = _ReadStringSections(
target, output_directory, section_positions_by_path)
# list of elf_positions_by_path. Returns:
[{path: [string_ranges]} for each string_section].
"""
ret = [collections.defaultdict(list) for _ in string_data] ret = [collections.defaultdict(list) for _ in string_data]
# Brute-force search of strings within ** merge strings sections.
# This is by far the slowest part of AnalyzeStringLiterals(). # Brute-force search ** merge strings sections in |string_data| for string
# TODO(agrieve): Pre-process string_data into a dict of literal->address (at # values from |path_value_gen|. This is by far the slowest part of
# least for ascii strings). # AnalyzeStringLiterals().
for path, object_addresses in string_addresses_by_path.iteritems(): # TODO(agrieve): Pre-process |string_data| into a dict of literal->address (at
for value in _IterStringLiterals( # least for ASCII strings).
path, object_addresses, string_sections_by_path.get(path)): for path, value in path_value_gen:
first_match = -1 first_match = -1
first_match_dict = None first_match_dict = None
for target_dict, data in itertools.izip(ret, string_data): for target_dict, data in itertools.izip(ret, string_data):
...@@ -243,7 +235,34 @@ def ResolveStringPieces(encoded_string_addresses_by_path, string_data, ...@@ -243,7 +235,34 @@ def ResolveStringPieces(encoded_string_addresses_by_path, string_data,
# Missing strings happen when optimization make them unused. # Missing strings happen when optimization make them unused.
if offset != -1: if offset != -1:
# Encode tuple as a string for easier mashalling. # Encode tuple as a string for easier mashalling.
target_dict[path].append( target_dict[path].append(str(offset) + ':' + str(len(value)))
str(offset) + ':' + str(len(value)))
return ret
# This is a target for BulkForkAndCall().
def ResolveStringPiecesIndirect(encoded_string_addresses_by_path, string_data,
tool_prefix, output_directory):
string_addresses_by_path = concurrent.DecodeDictOfLists(
encoded_string_addresses_by_path)
# Assign |target| as archive path, or a list of object paths.
any_path = next(string_addresses_by_path.iterkeys())
target = _ExtractArchivePath(any_path)
if not target:
target = string_addresses_by_path.keys()
# Run readelf to find location of .rodata within the .o files.
section_positions_by_path = _LookupStringSectionPositions(
target, tool_prefix, output_directory)
# Load the .rodata sections (from object files) as strings.
string_sections_by_path = _ReadStringSections(
target, output_directory, section_positions_by_path)
def GeneratePathAndValues():
for path, object_addresses in string_addresses_by_path.iteritems():
for value in _IterStringLiterals(
path, object_addresses, string_sections_by_path.get(path)):
yield path, value
ret = _AnnotateStringData(string_data, GeneratePathAndValues())
return [concurrent.EncodeDictOfLists(x) for x in ret] return [concurrent.EncodeDictOfLists(x) for x in ret]
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment