mirror of
https://github.com/mozilla/pdf.js.git
synced 2025-04-20 15:18:08 +02:00
Handle toUnicode cmaps that omit leading zeros in hex encoded UTF-16 (issue 18099)
Add unit test to check compatability with such cmaps In the PDF in issue 18099. the toUnicode cmap had a line to map the glyph char codes from 00 to 7F to the corresponding code points. The syntax to map a range of char codes to a range of unicode code points is <start_char_code> <end_char_code> <start_unicode_codepoint> As the unicode code points are supposed to be given in UTF-16 BE, the PDF's line SHOULD have probably read <00> <7F> <0000> Instead it omitted two leading zeros from the UTF-16 like this <00> <7F> <00> This confused PDF.js into mapping these character codes to the UTF-16 characters with the corresponding HIGH bytes (01 became \u0100, 02 became \u0200, et cetera), which ended up turning latin text in the PDF into chinese when it was copied I'm not sure if the PDF spec actually allows PDFs to do this, but since there's at least one PDF in the wild that does and other PDF readers read it correctly, PDF.js should probably support this
This commit is contained in:
parent
e777ae2258
commit
1c364422a6
4 changed files with 21 additions and 0 deletions
1
test/pdfs/.gitignore
vendored
1
test/pdfs/.gitignore
vendored
|
@ -653,3 +653,4 @@
|
|||
!bug1539074.1.pdf
|
||||
!issue18305.pdf
|
||||
!issue18360.pdf
|
||||
!issue18099_reduced.pdf
|
||||
|
|
BIN
test/pdfs/issue18099_reduced.pdf
Normal file
BIN
test/pdfs/issue18099_reduced.pdf
Normal file
Binary file not shown.
|
@ -3419,6 +3419,21 @@ Caron Broadcasting, Inc., an Ohio corporation (“Lessee”).`)
|
|||
await loadingTask.destroy();
|
||||
});
|
||||
|
||||
it("gets text content, correctly handling documents with toUnicode cmaps that omit leading zeros on hex-encoded UTF-16", async function () {
|
||||
const loadingTask = getDocument(
|
||||
buildGetDocumentParams("issue18099_reduced.pdf")
|
||||
);
|
||||
const pdfDoc = await loadingTask.promise;
|
||||
const pdfPage = await pdfDoc.getPage(1);
|
||||
const { items } = await pdfPage.getTextContent({
|
||||
disableNormalization: true,
|
||||
});
|
||||
const text = mergeText(items);
|
||||
expect(text).toEqual("Hello world!");
|
||||
|
||||
await loadingTask.destroy();
|
||||
});
|
||||
|
||||
it("gets text content, and check that out-of-page text is not present (bug 1755201)", async function () {
|
||||
if (isNodeJS) {
|
||||
pending("Linked test-cases are not supported in Node.js.");
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue