This extends PR 13796 to also handle the case where sub-streams contain invalid data, i.e. anything that isn't a Stream, however please note that in these cases there's no guarantee that we'll render the page "correctly".
Note that Adobe Reader, i.e. the PDF reference implementation, cannot render the last page of the referenced PDF document.
It fixes#18956.
In the patch #18029, for performance reasons and because I thought it was useless, I deliberately chose to not fill the mask
with the backdrop color when it's full black: it was a bad idea.
So in this patch we always add the backdrop color to the mask.
The PDF document is clearly corrupt, since it has /FontFile2 entries that are Dictionaries which obviously isn't correct.
While there's obviously no guarantee that things will look perfect this way, actually rendering the text at all should be an improvement in general.
The code parses the /RBGroups entry in the OC configuration dict and adds the property `rbGroups' to instances of the OptionalContentGroup class. rbGroups takes an array of Sets, where each Set instance represents an RB group the OptionalContentGroup instance is a member of. Such a Set instance contains all OCG ids within the corresponding RB group. RB groups an OCG is associated with are processed when its visibility is set to true, as required by the PDF spec.
Given that the sub-title of that document is "Public domain texts for young people." and that the images have clear sources at the end of the document, it should (hopefully) be OK to add it to the repository rather than relying on a linked test-case.
Given the following excerpt from the [Wikipedia article](https://en.wikipedia.org/wiki/Gill_Sans), mapping this to Helvetica should hopefully be fine:
> It has been described as "the British Helvetica" because of its lasting popularity in British design.
The idea is to insert a span in the text layer with an aria-role set to img
and use the bounding box provided by the attribute field in the tag dict in
order to have non-null dimensions for the image to make it "visible".
Given that we handle non-embedded Calibri fonts as "mapped to standard font", we really ought to be able to use the same glyph mapping as for an actual standard font.
Note that this actually improves consistency in the code, given how we already handle such fonts if they happen to be of the `CIDFontType2` type; see b47c7eca83/src/core/fonts.js (L1186-L1190)
According to the PDF specification these destinations should have a zoom parameter, which may however be `null`, but it shouldn't be omitted; please see https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf#G11.2095870
Hence we try to work-around bad PDF generators by making the zoom parameter optional when validating explicit destinations in both the worker and the viewer.
Switching to an editing mode can be asynchronous (e.g. if an editable annotation exists on a
visible page), so we must add a new editor only when the page rendering is done.
This functionality is purposely limited to development mode and GENERIC builds, since it's unnecessary in e.g. the *built-in* Firefox PDF Viewer, and will only be used when a `<base>`-element is actually present.
*Please note:* We also have tests in mozilla-central that will *indirectly* ensure that relative filter-URLs work as intended in the Firefox PDF Viewer, see https://searchfox.org/mozilla-central/source/toolkit/components/pdfjs/test/browser_pdfjs_filters.js
---
To test that the issue is fixed, the following code can be used:
```html
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<base href=".">
<title>base href (issue 18406)</title>
</head>
<body>
<ul>
<li>Place this code in a file, named `base_href.html`, in the root of the PDF.js repository</li>
<li>Run <pre>npx gulp dist-install</pre></li>
<li>Run <pre>npx gulp server</pre></li>
<li>Open <a href="http://localhost:8888/base_href.html">http://localhost:8888/base_href.html</a> in a browser</li>
<li>Compare rendering with <a href="http://localhost:8888/web/viewer.html?file=/test/pdfs/issue16287.pdf">http://localhost:8888/web/viewer.html?file=/test/pdfs/issue16287.pdf</a></li>
</ul>
<canvas id="the-canvas" style="border: 1px solid black; direction: ltr;"></canvas>
<script src="/node_modules/pdfjs-dist/build/pdf.mjs" type="module"></script>
<script id="script" type="module">
//
// If absolute URL from the remote server is provided, configure the CORS
// header on that server.
//
const url = '/test/pdfs/issue16287.pdf';
//
// The workerSrc property shall be specified.
//
pdfjsLib.GlobalWorkerOptions.workerSrc =
'/node_modules/pdfjs-dist/build/pdf.worker.mjs';
//
// Asynchronous download PDF
//
const loadingTask = pdfjsLib.getDocument(url);
const pdf = await loadingTask.promise;
//
// Fetch the first page
//
const page = await pdf.getPage(1);
const scale = 1.5;
const viewport = page.getViewport({ scale });
// Support HiDPI-screens.
const outputScale = window.devicePixelRatio || 1;
//
// Prepare canvas using PDF page dimensions
//
const canvas = document.getElementById("the-canvas");
const context = canvas.getContext("2d");
canvas.width = Math.floor(viewport.width * outputScale);
canvas.height = Math.floor(viewport.height * outputScale);
canvas.style.width = Math.floor(viewport.width) + "px";
canvas.style.height = Math.floor(viewport.height) + "px";
const transform = outputScale !== 1
? [outputScale, 0, 0, outputScale, 0, 0]
: null;
//
// Render PDF page into canvas context
//
const renderContext = {
canvasContext: context,
transform,
viewport,
};
page.render(renderContext);
</script>
</body>
</html>
```
According to the PDF specification these destinations should have a coordinate parameter, which may however be `null`, but it shouldn't be omitted; please see https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf#G11.2095870
Hence we try to work-around bad PDF generators by making the coordinate parameter optional when validating explicit destinations in both the worker and the viewer.
Add unit test to check compatability with such cmaps
In the PDF in issue 18099. the toUnicode cmap had a line to map the glyph char codes from 00 to 7F to the corresponding code points. The syntax to map a range of char codes to a range of unicode code points is
<start_char_code> <end_char_code> <start_unicode_codepoint>
As the unicode code points are supposed to be given in UTF-16 BE, the PDF's line SHOULD have probably read
<00> <7F> <0000>
Instead it omitted two leading zeros from the UTF-16 like this
<00> <7F> <00>
This confused PDF.js into mapping these character codes to the UTF-16 characters with the corresponding HIGH bytes (01 became \u0100, 02 became \u0200, et cetera), which ended up turning latin text in the PDF into chinese when it was copied
I'm not sure if the PDF spec actually allows PDFs to do this, but since there's at least one PDF in the wild that does and other PDF readers read it correctly, PDF.js should probably support this
There's no specification for that (even if it's possible to have an idea from
the xfa specs) so we just want to hide them in order to avoid to display something
wrong.