Work around bad XML output by dexdump.

The version of dexdump in build-tools 30.0.1 includes more information than previous versions in its output, which has revealed that it doesn't encode its output as valid XML in all cases; Java strings with control characters or nulls end up emitted literally in the XML output, which breaks ElementTree's parser. b/161925303 has been filed internally to track fixing this in dexdump itself. Since we only need to be able to extract a few specific things from the dexdump output in our tooling, just replace invalid characters in the output with the Unicode replacement character (as Python does when using the 'replace' error handler in encoding) before parsing it. Bug: 1106471 Change-Id: Id1c3a40e5ce91125dbee9fdf1923383d02314f55 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2314986Reviewed-by: Andrew Grieve <agrieve@chromium.org> Auto-Submit: Richard Coles <torne@chromium.org> Commit-Queue: Richard Coles <torne@chromium.org> Cr-Commit-Position: refs/heads/master@{#791232}

Work around bad XML output by dexdump.
The version of dexdump in build-tools 30.0.1 includes more information than previous versions in its output, which has revealed that it doesn't encode its output as valid XML in all cases; Java strings with control characters or nulls end up emitted literally in the XML output, which breaks ElementTree's parser. b/161925303 has been filed internally to track fixing this in dexdump itself. Since we only need to be able to extract a few specific things from the dexdump output in our tooling, just replace invalid characters in the output with the Unicode replacement character (as Python does when using the 'replace' error handler in encoding) before parsing it. Bug: 1106471 Change-Id: Id1c3a40e5ce91125dbee9fdf1923383d02314f55 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2314986Reviewed-by: Andrew Grieve <agrieve@chromium.org> Auto-Submit: Richard Coles <torne@chromium.org> Commit-Queue: Richard Coles <torne@chromium.org> Cr-Commit-Position: refs/heads/master@{#791232}
24c982a7 · Torne (Richard Coles) · Commit Bot · 59d5c5ba · 24c982a7
Commit 24c982a7 authored Jul 23, 2020 by Torne (Richard Coles) Committed by Commit Bot Jul 23, 2020
Hide whitespace changes
Inline Side-by-side

Showing with 11 additions and 1 deletion

build/android/pylib/utils/dexdump.py build/android/pylib/utils/dexdump.py +11 -1

No files found.
--- a/build/android/pylib/utils/dexdump.py
+++ b/build/android/pylib/utils/dexdump.py
@@ -3,6 +3,7 @@
 # found in the LICENSE file.
 import os
+import re
 import shutil
 import tempfile
 from xml.etree import ElementTree
@@ -37,7 +38,16 @@ def Dump(apk_path):
    cmd_helper.RunCmd(['unzip', apk_path, 'classes.dex'], cwd=dexfile_dir)
    dexfile = os.path.join(dexfile_dir, 'classes.dex')
    output_xml = cmd_helper.GetCmdOutput([DEXDUMP_PATH, '-l', 'xml', dexfile])
-    return _ParseRootNode(ElementTree.fromstring(output_xml))
+    # Dexdump doesn't escape its XML output very well; decode it as utf-8 with
+    # invalid sequences replaced, then remove forbidden characters and
+    # re-encode it (as etree expects a byte string as input so it can figure
+    # out the encoding itself from the XML declaration)
+    BAD_XML_CHARS = re.compile(
+        u'[\x00-\x08\x0b-\x0c\x0e-\x1f\x7f-\x84\x86-\x9f' +
+        u'\ud800-\udfff\ufdd0-\ufddf\ufffe-\uffff]')
+    decoded_xml = output_xml.decode('utf-8', 'replace')
+    clean_xml = BAD_XML_CHARS.sub(u'\ufffd', decoded_xml)
+    return _ParseRootNode(ElementTree.fromstring(clean_xml.encode('utf-8')))
  finally:
    shutil.rmtree(dexfile_dir)