PDF a11y: Don't break paragraphs on font changes
The current heuristics always creates a new paragraph whenever the font-size changes, which is undesirable. Simply removing the size check causes many of the existing tests to fail because the size-change is often the only reason that we have multiple paragraphs at all. The problem is our PDF file has too few lines on each page to compute reasonable line & paragraph size thresholds. So this change required changing the heuristics. The new heuristic is as follows: 1. We keep track of the top & bottom of the current line, as weighted averages of the (recent) text boxes on the line. 2. When we encounter a new text box, if it significantly overlaps the top-to-bottom range, it's considered part of the same line. 3. If we are starting a new line, we also check the paragraph threshold to see if we should also start a new paragraph. If the paragraph threshold couldn't be computed (because there weren't enough lines on the page), we compare against the line size. We also introduce the `PDFExtensionAccessibilityTextExtractionTest` test suite. These tests are like the tree-dump tests, but they dump raw text content, split into lines and paragraphs. (Compared to tree-dump tests, this approach allows us to test the kNextOnLine and kPreviousOnLine attributes are correct.) Bug: 985604 Change-Id: Idfce6edfef42580e7fac4d8a7753c82495c15bd1 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/1743032 Commit-Queue: Ian Prest <iapres@microsoft.com> Reviewed-by:Lei Zhang <thestig@chromium.org> Reviewed-by:
Kevin Babbitt <kbabbitt@microsoft.com> Cr-Commit-Position: refs/heads/master@{#690153}
Showing
This diff is collapsed.
This diff was suppressed by a .gitattributes entry.
This diff was suppressed by a .gitattributes entry.
This diff was suppressed by a .gitattributes entry.
This diff was suppressed by a .gitattributes entry.
Please register or sign in to comment