Commit 3df5254f authored by Samuel Huang's avatar Samuel Huang Committed by Commit Bot

[Supersize] Refactor to prepare for LTO string support integration.

This CL refactors Supersize to prepare for bcanalyzer.py integration.
Details:
- Variable and function renames.
- Split _BulkObjectFileAnalyzerWorker.AnalyzePaths():
  - _ClassifyPaths(): Explicitly store .a files and .o files. For LTO
    we'll split .o into ELF and BC buckets.
  - _MakeBatches(): Reusable later for ELF and BC separately.
  - _DoBulkFork(): Reusable.
  - _RunNm(): Absorbs nm-specific code, will add alternative.
- Split _BulkObjectFileAnalyzerWorker.AnalyzeStringLiterals()
  - _ReadElfStringData(): Separation of concern.
  - _GetEncodedRangesFromStringAddresses(): ELF-specific code. Will
    add alternative.
  - Restructure how results are merged, with more comments.
- Split ResolveStringPiecesIndirect() (was ResolveStringPieces()):
  - _AnnotateStringData(): Reusable.
  - Will add alternative: ResolveStringPieces().

Bug: 723798
Change-Id: Ib8e0e1785ae11652a17d060c9629e3986df0f93a
Reviewed-on: https://chromium-review.googlesource.com/1146775
Commit-Queue: Samuel Huang <huangs@chromium.org>
Reviewed-by: default avatarSamuel Huang <huangs@chromium.org>
Reviewed-by: default avataragrieve <agrieve@chromium.org>
Cr-Commit-Position: refs/heads/master@{#577274}
parent eafc94cb
......@@ -817,8 +817,8 @@ def _ParseElfInfo(map_path, elf_path, tool_prefix, track_string_literals,
# More likely for there to be a bug in supersize than an ELF to not have a
# single string literal.
assert merge_string_syms
string_positions = [(s.address, s.size) for s in merge_string_syms]
bulk_analyzer.AnalyzeStringLiterals(elf_path, string_positions)
string_ranges = [(s.address, s.size) for s in merge_string_syms]
bulk_analyzer.AnalyzeStringLiterals(elf_path, string_ranges)
logging.info('Stripping linker prefixes from symbol names')
_StripLinkerAddedSymbolPrefixes(raw_symbols)
......
......@@ -2,12 +2,10 @@
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
"""Runs nm on every .o file that comprises an ELF (plus some analysis).
The design of this file is entirely to work around Python's lack of concurrency.
"""Runs nm on specified .a and .o file, plus some analysis.
CollectAliasesByAddress():
Runs "nm" on the elf to collect all symbol names. This reveals symbol names of
Runs nm on the elf to collect all symbol names. This reveals symbol names of
identical-code-folded functions.
CollectAliasesByAddressAsync():
......@@ -20,7 +18,6 @@ RunNmOnIntermediates():
"""
import collections
import os
import subprocess
import concurrent
......@@ -146,7 +143,7 @@ def _ParseOneObjectFileNmOutput(lines):
string_addresses.append(line[:space_idx].lstrip('0') or '0')
elif _IsRelevantObjectFileName(mangled_name):
symbol_names.add(mangled_name)
return string_addresses, symbol_names
return symbol_names, string_addresses
# This is a target for BulkForkAndCall().
......@@ -176,14 +173,14 @@ def RunNmOnIntermediates(target, tool_prefix, output_directory):
assert not is_archive
path = target[0]
string_addresses_by_path = {}
symbol_names_by_path = {}
string_addresses_by_path = {}
while path:
if is_archive:
# E.g. foo/bar.a(baz.o)
path = '%s(%s)' % (target, path)
string_addresses, mangled_symbol_names = _ParseOneObjectFileNmOutput(lines)
mangled_symbol_names, string_addresses = _ParseOneObjectFileNmOutput(lines)
symbol_names_by_path[path] = mangled_symbol_names
if string_addresses:
string_addresses_by_path[path] = string_addresses
......
......@@ -7,6 +7,10 @@
This file works around Python's lack of concurrency.
_BulkObjectFileAnalyzerWorker:
Performs the actual work. Uses Process Pools to shard out per-object-file
work and then aggregates results.
_BulkObjectFileAnalyzerMaster:
Creates a subprocess and sends IPCs to it asking it to do work.
......@@ -15,19 +19,17 @@ _BulkObjectFileAnalyzerSlave:
Runs _BulkObjectFileAnalyzerWorker on a background thread in order to stay
responsive to IPCs.
_BulkObjectFileAnalyzerWorker:
Performs the actual work. Uses Process Pools to shard out per-object-file
work and then aggregates results.
BulkObjectFileAnalyzer:
Alias for _BulkObjectFileAnalyzerMaster, but when SUPERSIZE_DISABLE_ASYNC=1,
alias for _BulkObjectFileAnalyzerWorker.
* AnalyzePaths: Run "nm" on all .o files to collect symbol names that exist
Extracts information from .o files. Alias for _BulkObjectFileAnalyzerMaster,
but when SUPERSIZE_DISABLE_ASYNC=1, alias for _BulkObjectFileAnalyzerWorker.
* AnalyzePaths(): Processes all .o files to collect symbol names that exist
within each. Does not work with thin archives (expand them first).
* SortPaths: Sort results of AnalyzePaths().
* AnalyzeStringLiterals: Must be run after AnalyzePaths() has completed.
* SortPaths(): Sort results of AnalyzePaths().
* AnalyzeStringLiterals(): Must be run after AnalyzePaths() has completed.
Extracts string literals from .o files, and then locates them within the
"** merge strings" sections within an ELF's .rodata section.
* GetSymbolNames(): Accessor.
* Close(): Disposes data.
This file can also be run stand-alone in order to test out the logic on smaller
sample sizes.
......@@ -78,36 +80,58 @@ def _MakeToolPrefixAbsolute(tool_prefix):
return tool_prefix
class _PathsByType:
def __init__(self, arch, obj):
self.arch = arch
self.obj = obj
class _BulkObjectFileAnalyzerWorker(object):
def __init__(self, tool_prefix, output_directory):
self._tool_prefix = _MakeToolPrefixAbsolute(tool_prefix)
self._output_directory = output_directory
self._list_of_encoded_elf_string_ranges_by_path = None
self._paths_by_name = collections.defaultdict(list)
self._encoded_string_addresses_by_path_chunks = []
self._list_of_encoded_elf_string_positions_by_path = None
def AnalyzePaths(self, paths):
def iter_job_params():
object_paths = []
for path in paths:
# Note: ResolveStringPieces() relies upon .a not being grouped.
if path.endswith('.a'):
yield (path,)
else:
object_paths.append(path)
BATCH_SIZE = 50 # Chosen arbitrarily.
for i in xrange(0, len(object_paths), BATCH_SIZE):
batch = object_paths[i:i + BATCH_SIZE]
yield (batch,)
params = list(iter_job_params())
def _ClassifyPaths(self, paths):
"""Classifies |paths| (.o and .a files) by file type into separate lists.
Returns:
A _PathsByType instance storing classified disjoint sublists of |paths|.
"""
arch_paths = []
obj_paths = []
for path in paths:
if path.endswith('.a'):
arch_paths.append(path)
else:
obj_paths.append(path)
return _PathsByType(arch=arch_paths, obj=obj_paths)
def _MakeBatches(self, paths, size=None):
if size is None:
# Create 1-tuples of strings.
return [(p,) for p in paths]
# Create 1-tuples of arrays of strings.
return [(paths[i:i + size],) for i in xrange(0, len(paths), size)]
def _DoBulkFork(self, runner, batches):
# Order of the jobs doesn't matter since each job owns independent paths,
# and our output is a dict where paths are the key.
results = concurrent.BulkForkAndCall(
nm.RunNmOnIntermediates, params, tool_prefix=self._tool_prefix,
return concurrent.BulkForkAndCall(
runner, batches, tool_prefix=self._tool_prefix,
output_directory=self._output_directory)
def _RunNm(self, paths_by_type):
"""Calls nm to get symbols and string addresses."""
# Downstream functions rely upon .a not being grouped.
batches = self._MakeBatches(paths_by_type.arch, None)
# Combine object files and Bitcode files for nm
BATCH_SIZE = 50 # Arbitrarily chosen.
batches.extend(self._MakeBatches(paths_by_type.obj, BATCH_SIZE))
results = self._DoBulkFork(nm.RunNmOnIntermediates, batches)
# Names are still mangled.
all_paths_by_name = self._paths_by_name
for encoded_syms, encoded_strs in results:
......@@ -118,40 +142,60 @@ class _BulkObjectFileAnalyzerWorker(object):
if encoded_strs != concurrent.EMPTY_ENCODED_DICT:
self._encoded_string_addresses_by_path_chunks.append(encoded_strs)
def AnalyzePaths(self, paths):
logging.debug('worker: AnalyzePaths() started.')
paths_by_type = self._ClassifyPaths(paths)
logging.info('File counts: {\'arch\': %d, \'obj\': %d}',
len(paths_by_type.arch), len(paths_by_type.obj))
self._RunNm(paths_by_type)
logging.debug('worker: AnalyzePaths() completed.')
def SortPaths(self):
# Finally, demangle all names, which can result in some merging of lists.
# Demangle all names, which can result in some merging of lists.
self._paths_by_name = demangle.DemangleKeysAndMergeLists(
self._paths_by_name, self._tool_prefix)
# Sort and uniquefy.
for key in self._paths_by_name.iterkeys():
self._paths_by_name[key] = sorted(set(self._paths_by_name[key]))
def AnalyzeStringLiterals(self, elf_path, elf_string_positions):
logging.debug('worker: AnalyzeStringLiterals() started.')
# Read string_data from elf_path, to be shared by forked processes.
def _ReadElfStringData(self, elf_path, elf_string_ranges):
# Read string_data from elf_path, to be shared with forked processes.
address, offset, _ = string_extract.LookupElfRodataInfo(
elf_path, self._tool_prefix)
adjust = address - offset
abs_string_positions = (
(addr - adjust, s) for addr, s in elf_string_positions)
string_data = string_extract.ReadFileChunks(elf_path, abs_string_positions)
abs_elf_string_ranges = (
(addr - adjust, s) for addr, s in elf_string_ranges)
return string_extract.ReadFileChunks(elf_path, abs_elf_string_ranges)
def _GetEncodedRangesFromStringAddresses(self, string_data):
params = ((chunk,)
for chunk in self._encoded_string_addresses_by_path_chunks)
# Order of the jobs doesn't matter since each job owns independent paths,
# and our output is a dict where paths are the key.
results = concurrent.BulkForkAndCall(
string_extract.ResolveStringPieces, params, string_data=string_data,
tool_prefix=self._tool_prefix, output_directory=self._output_directory)
results = list(results)
final_result = []
for i in xrange(len(elf_string_positions)):
final_result.append(
concurrent.JoinEncodedDictOfLists([r[i] for r in results]))
self._list_of_encoded_elf_string_positions_by_path = final_result
string_extract.ResolveStringPiecesIndirect, params,
string_data=string_data, tool_prefix=self._tool_prefix,
output_directory=self._output_directory)
return list(results)
def AnalyzeStringLiterals(self, elf_path, elf_string_ranges):
logging.debug('worker: AnalyzeStringLiterals() started.')
string_data = self._ReadElfStringData(elf_path, elf_string_ranges)
# [source_idx][batch_idx][section_idx] -> Encoded {path: [string_ranges]}.
encoded_ranges_sources = [
self._GetEncodedRangesFromStringAddresses(string_data),
]
# [section_idx] -> {path: [string_ranges]}.
self._list_of_encoded_elf_string_ranges_by_path = []
# Contract [source_idx] and [batch_idx], then decode and join.
for section_idx in xrange(len(elf_string_ranges)): # Fetch result.
t = []
for encoded_ranges in encoded_ranges_sources: # [source_idx].
t.extend([b[section_idx] for b in encoded_ranges]) # [batch_idx].
self._list_of_encoded_elf_string_ranges_by_path.append(
concurrent.JoinEncodedDictOfLists(t))
logging.debug('worker: AnalyzeStringLiterals() completed.')
def GetSymbolNames(self):
......@@ -159,10 +203,10 @@ class _BulkObjectFileAnalyzerWorker(object):
def GetStringPositions(self):
return [concurrent.DecodeDictOfLists(x, value_transform=_DecodePosition)
for x in self._list_of_encoded_elf_string_positions_by_path]
for x in self._list_of_encoded_elf_string_ranges_by_path]
def GetEncodedStringPositions(self):
return self._list_of_encoded_elf_string_positions_by_path
return self._list_of_encoded_elf_string_ranges_by_path
def Close(self):
pass
......@@ -198,7 +242,7 @@ class _BulkObjectFileAnalyzerMaster(object):
else:
# We are the child process.
logging.root.handlers[0].setFormatter(logging.Formatter(
'nm: %(levelname).1s %(relativeCreated)6d %(message)s'))
'obj_analyzer: %(levelname).1s %(relativeCreated)6d %(message)s'))
worker_analyzer = _BulkObjectFileAnalyzerWorker(
self._tool_prefix, self._output_directory)
slave = _BulkObjectFileAnalyzerSlave(worker_analyzer, child_conn)
......@@ -215,8 +259,8 @@ class _BulkObjectFileAnalyzerMaster(object):
def SortPaths(self):
self._pipe.send((_MSG_SORT_PATHS,))
def AnalyzeStringLiterals(self, elf_path, string_positions):
self._pipe.send((_MSG_ANALYZE_STRINGS, elf_path, string_positions))
def AnalyzeStringLiterals(self, elf_path, elf_string_ranges):
self._pipe.send((_MSG_ANALYZE_STRINGS, elf_path, elf_string_ranges))
def GetSymbolNames(self):
self._pipe.send((_MSG_GET_SYMBOL_NAMES,))
......@@ -308,7 +352,7 @@ class _BulkObjectFileAnalyzerSlave(object):
if e.errno in (errno.EPIPE, errno.ECONNRESET):
sys.exit(1)
logging.debug('nm bulk subprocess finished.')
logging.debug('bulk subprocess finished.')
sys.exit(0)
......
......@@ -10,7 +10,7 @@ LookupElfRodataInfo():
ReadFileChunks():
Reads raw data from a file, given a list of ranges in the file.
ResolveStringPieces():
ResolveStringPiecesIndirect():
BulkForkAndCall() target: Given {path: [string addresses]} and
[raw_string_data for each string_section]:
- Reads {path: [src_strings]}.
......@@ -46,17 +46,17 @@ def LookupElfRodataInfo(elf_path, tool_prefix):
raise AssertionError('No .rodata for command: ' + repr(args))
def ReadFileChunks(path, positions):
"""Returns a list of strings corresponding to |positions|.
def ReadFileChunks(path, section_ranges):
"""Returns a list of raw data from |path|, specified by |section_ranges|.
Args:
positions: List of (offset, size).
section_ranges: List of (offset, size).
"""
ret = []
if not positions:
if not section_ranges:
return ret
with open(path, 'rb') as f:
for offset, size in positions:
for offset, size in section_ranges:
f.seek(offset)
ret.append(f.read(size))
return ret
......@@ -190,9 +190,59 @@ def _IterStringLiterals(path, addresses, obj_sections):
yield section_data[prev_offset:]
def _AnnotateStringData(string_data, path_value_gen):
"""Annotates each |string_data| section data with paths and ranges.
Args:
string_data: [raw_string_data for each string_section] from an ELF file.
path_value_gen: A generator of (path, value) pairs, where |path|
is the path to an object file and |value| is a string to annotate.
Returns:
[{path: [string_ranges]} for each string_section].
"""
ret = [collections.defaultdict(list) for _ in string_data]
# Brute-force search ** merge strings sections in |string_data| for string
# values from |path_value_gen|. This is by far the slowest part of
# AnalyzeStringLiterals().
# TODO(agrieve): Pre-process |string_data| into a dict of literal->address (at
# least for ASCII strings).
for path, value in path_value_gen:
first_match = -1
first_match_dict = None
for target_dict, data in itertools.izip(ret, string_data):
# Set offset so that it will be 0 when len(value) is added to it below.
offset = -len(value)
while True:
offset = data.find(value, offset + len(value))
if offset == -1:
break
# Preferring exact matches (those following \0) over substring matches
# significantly increases accuracy (although shows that linker isn't
# being optimal).
if offset == 0 or data[offset - 1] == '\0':
break
if first_match == -1:
first_match = offset
first_match_dict = target_dict
if offset != -1:
break
if offset == -1:
# Exact match not found, so take suffix match if it exists.
offset = first_match
target_dict = first_match_dict
# Missing strings happen when optimization make them unused.
if offset != -1:
# Encode tuple as a string for easier mashalling.
target_dict[path].append(str(offset) + ':' + str(len(value)))
return ret
# This is a target for BulkForkAndCall().
def ResolveStringPieces(encoded_string_addresses_by_path, string_data,
tool_prefix, output_directory):
def ResolveStringPiecesIndirect(encoded_string_addresses_by_path, string_data,
tool_prefix, output_directory):
string_addresses_by_path = concurrent.DecodeDictOfLists(
encoded_string_addresses_by_path)
# Assign |target| as archive path, or a list of object paths.
......@@ -208,42 +258,11 @@ def ResolveStringPieces(encoded_string_addresses_by_path, string_data,
string_sections_by_path = _ReadStringSections(
target, output_directory, section_positions_by_path)
# list of elf_positions_by_path.
ret = [collections.defaultdict(list) for _ in string_data]
# Brute-force search of strings within ** merge strings sections.
# This is by far the slowest part of AnalyzeStringLiterals().
# TODO(agrieve): Pre-process string_data into a dict of literal->address (at
# least for ascii strings).
for path, object_addresses in string_addresses_by_path.iteritems():
for value in _IterStringLiterals(
path, object_addresses, string_sections_by_path.get(path)):
first_match = -1
first_match_dict = None
for target_dict, data in itertools.izip(ret, string_data):
# Set offset so that it will be 0 when len(value) is added to it below.
offset = -len(value)
while True:
offset = data.find(value, offset + len(value))
if offset == -1:
break
# Preferring exact matches (those following \0) over substring matches
# significantly increases accuracy (although shows that linker isn't
# being optimal).
if offset == 0 or data[offset - 1] == '\0':
break
if first_match == -1:
first_match = offset
first_match_dict = target_dict
if offset != -1:
break
if offset == -1:
# Exact match not found, so take suffix match if it exists.
offset = first_match
target_dict = first_match_dict
# Missing strings happen when optimization make them unused.
if offset != -1:
# Encode tuple as a string for easier mashalling.
target_dict[path].append(
str(offset) + ':' + str(len(value)))
def GeneratePathAndValues():
for path, object_addresses in string_addresses_by_path.iteritems():
for value in _IterStringLiterals(
path, object_addresses, string_sections_by_path.get(path)):
yield path, value
ret = _AnnotateStringData(string_data, GeneratePathAndValues())
return [concurrent.EncodeDictOfLists(x) for x in ret]
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment