Commit aa1173a0 authored by Alex Turner's avatar Alex Turner Committed by Commit Bot

Update filter_many.sh script to correctly handle long rules

Currently the filter_many script uses xargs to aggregate the matched
rules (and their frequencies) from multiple processes together. Some
easylist rules are very long (e.g. >13k chars). However, Linux only
ensures that writes to pipes are atomic up to PIPE_BUF = 4096 bytes,
causing lines to be (rarely) interleaved.

We thus change the script to have each process write to an independent
temporary file, which are later aggregated serially.

Bug: 1039730
Change-Id: If1b13a49134d238710a6e345fc274b5759fb2092
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/1990133
Commit-Queue: Alex Turner <alexmt@chromium.org>
Reviewed-by: default avatarJosh Karlin <jkarlin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#729535}
parent c3ac85ff
...@@ -15,7 +15,8 @@ ...@@ -15,7 +15,8 @@
# easylist_indexed > sorted_list # easylist_indexed > sorted_list
# The number of processes you want to run in parallel. 8 is reasonable for a # The number of processes you want to run in parallel. 8 is reasonable for a
# typical machine. 80 is good for a powerful workstation. # typical machine. 80 is good for a powerful workstation. If 0 is specified,
# this script uses as many as possible.
PROCESS_COUNT=$1 PROCESS_COUNT=$1
# The path to the directory that contains gzip files of resource requests from # The path to the directory that contains gzip files of resource requests from
...@@ -28,13 +29,22 @@ FILTER_TOOL=$3 ...@@ -28,13 +29,22 @@ FILTER_TOOL=$3
# The path to the indexed easylist file. # The path to the indexed easylist file.
EASYLIST=$4 EASYLIST=$4
# Create temporary directory.
TEMP_DIR=$(mktemp -d)
# For each gzip file: # For each gzip file:
ls $GZIP_PATH/*.gz | ls $GZIP_PATH/*.gz |
# Unzip the file and count the number of times each rule matches. # In parallel, unzip the file and count the number of times each rule matches.
# Do this in parallel. # The results are saved to independent temporary files to ensure that writes
# aren't interleaved mid-rule.
xargs -t -I {} -P $PROCESS_COUNT \ xargs -t -I {} -P $PROCESS_COUNT \
sh -c "gunzip -c '{}' | $FILTER_TOOL --ruleset=$EASYLIST match_rules" | sh -c "gunzip -c {} | \
$FILTER_TOOL --ruleset=$EASYLIST match_rules \
> \$(mktemp $TEMP_DIR/output.XXXXXXXXXX)"
# Aggregate the results from those files.
cat $TEMP_DIR/output.* |
# Sort the results by filter rule. # Sort the results by filter rule.
sort -k 2 | sort -k 2 |
...@@ -45,3 +55,6 @@ awk 'NR>1 && rule!=$2 {print count,rule; count=0} {count+=$1} {rule=$2} \ ...@@ -45,3 +55,6 @@ awk 'NR>1 && rule!=$2 {print count,rule; count=0} {count+=$1} {rule=$2} \
# Sort the output in descending order by match count. # Sort the output in descending order by match count.
sort -n -r sort -n -r
# Delete the temporary folder.
rm -rf $TEMP_DIR
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment