Handle toUnicode cmaps that omit leading zeros in hex encoded UTF-16 (issue 18099)

Add unit test to check compatability with such cmaps In the PDF in issue 18099. the toUnicode cmap had a line to map the glyph char codes from 00 to 7F to the corresponding code points. The syntax to map a range of char codes to a range of unicode code points is <start_char_code> <end_char_code> <start_unicode_codepoint> As the unicode code points are supposed to be given in UTF-16 BE, the PDF's line SHOULD have probably read <00> <7F> <0000> Instead it omitted two leading zeros from the UTF-16 like this <00> <7F> <00> This confused PDF.js into mapping these character codes to the UTF-16 characters with the corresponding HIGH bytes (01 became \u0100, 02 became \u0200, et cetera), which ended up turning latin text in the PDF into chinese when it was copied I'm not sure if the PDF spec actually allows PDFs to do this, but since there's at least one PDF in the wild that does and other PDF readers read it correctly, PDF.js should probably support this
2025-04-22 16:18:08 +02:00 · 2024-07-05 13:04:11 -04:00 · 2024-07-05 13:04:11 -04:00 · 1c364422a6
commit 1c364422a6
parent e777ae2258
4 changed files with 21 additions and 0 deletions
--- a/src/core/evaluator.js
+++ b/src/core/evaluator.js
@ -3852,6 +3852,11 @@ class PartialEvaluator {
            map[charCode] = String.fromCodePoint(token);
            return;
          }
+          // Add back omitted leading zeros on odd length tokens
+          // (fixes issue #18099)
+          if (token.length % 2 !== 0) {
+            token = "\u0000" + token;
+          }
          const str = [];
          for (let k = 0; k < token.length; k += 2) {
            const w1 = (token.charCodeAt(k) << 8) | token.charCodeAt(k + 1);