1
0
Fork 0
mirror of https://github.com/mozilla/pdf.js.git synced 2025-04-22 16:18:08 +02:00

Handle toUnicode cmaps that omit leading zeros in hex encoded UTF-16 (issue 18099)

Add unit test to check compatability with such cmaps

In the PDF in issue 18099. the toUnicode cmap had a line to map the glyph char codes from 00 to 7F to the corresponding code points. The syntax to map a range of char codes to a range of unicode code points is
<start_char_code> <end_char_code> <start_unicode_codepoint>
As the unicode code points are supposed to be given in UTF-16 BE, the PDF's line SHOULD have probably read
<00> <7F> <0000>
Instead it omitted two leading zeros from the UTF-16 like this
<00> <7F> <00>
This confused PDF.js into mapping these character codes to the UTF-16 characters with the corresponding HIGH bytes (01 became \u0100, 02 became \u0200, et cetera), which ended up turning latin text in the PDF into chinese when it was copied
I'm not sure if the PDF spec actually allows PDFs to do this, but since there's at least one PDF in the wild that does and other PDF readers read it correctly, PDF.js should probably support this
This commit is contained in:
alexcat3 2024-07-05 13:04:11 -04:00
parent e777ae2258
commit 1c364422a6
4 changed files with 21 additions and 0 deletions

View file

@ -3852,6 +3852,11 @@ class PartialEvaluator {
map[charCode] = String.fromCodePoint(token);
return;
}
// Add back omitted leading zeros on odd length tokens
// (fixes issue #18099)
if (token.length % 2 !== 0) {
token = "\u0000" + token;
}
const str = [];
for (let k = 0; k < token.length; k += 2) {
const w1 = (token.charCodeAt(k) << 8) | token.charCodeAt(k + 1);