1
0
Fork 0
mirror of https://github.com/mozilla/pdf.js.git synced 2025-04-26 01:58:06 +02:00

[api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303)

*This patch basically extends the approach from PR 10392, by also checking the last page.*

Currently, in e.g. the `Catalog.numPages`-getter, we're simply assuming that if the /Pages-tree has an *integer* /Count entry it must also be correct/valid.
As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser).

Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the *entire* /Pages-tree and essentially counting the pages.
To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the *entire* /Pages-tree to determine the number of pages.

Unfortunately these changes will have a number of *somewhat* negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug.
 - This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the *last* page of the PDF documents.
 - For poorly generated PDF documents, where the entire /Pages-tree only has *one* level, we'll unfortunately need to fetch/parse the *entire* /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of *some* long PDF documents,
 - This will affect the `disableAutoFetch = true` mode negatively, since we now need to fetch/parse more data during document initialization. While the `disableAutoFetch = true` mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost.

As one *small* additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value).

Fixes two of the issues listed in issue 14303, namely the `poppler-67295-0.pdf` and `poppler-85140-0.pdf` documents.
This commit is contained in:
Jonas Jenwald 2021-11-25 18:34:11 +01:00
parent 9a1e27efc5
commit d0c4bbd828
8 changed files with 215 additions and 16 deletions

View file

@ -50,6 +50,7 @@ import {
getInheritableProperty,
isWhiteSpace,
MissingDataException,
PageDictMissingException,
validateCSSFont,
XRefEntryException,
XRefParseException,
@ -787,7 +788,9 @@ class PDFDocument {
get numPages() {
let num = 0;
if (this.xfaFactory) {
if (this.catalog.hasActualNumPages) {
num = this.catalog.numPages;
} else if (this.xfaFactory) {
// num is a Promise.
num = this.xfaFactory.getNumPages();
} else if (this.linearization) {
@ -1331,8 +1334,13 @@ class PDFDocument {
return promise;
}
checkFirstPage() {
return this.getPage(0).catch(async reason => {
async checkFirstPage(recoveryMode = false) {
if (recoveryMode) {
return;
}
try {
await this.getPage(0);
} catch (reason) {
if (reason instanceof XRefEntryException) {
// Clear out the various caches to ensure that we haven't stored any
// inconsistent and/or incorrect state, since that could easily break
@ -1342,7 +1350,60 @@ class PDFDocument {
throw new XRefParseException();
}
});
}
}
async checkLastPage(recoveryMode = false) {
this.catalog.setActualNumPages(); // Ensure that it's always reset.
let numPages;
try {
await Promise.all([
this.pdfManager.ensureDoc("xfaFactory"),
this.pdfManager.ensureDoc("linearization"),
this.pdfManager.ensureCatalog("numPages"),
]);
if (this.xfaFactory) {
return; // The Page count is always calculated for XFA-documents.
} else if (this.linearization) {
numPages = this.linearization.numPages;
} else {
numPages = this.catalog.numPages;
}
if (numPages === 1) {
return;
} else if (!Number.isInteger(numPages)) {
throw new FormatError("Page count is not an integer.");
}
await this.getPage(numPages - 1);
} catch (reason) {
warn(`checkLastPage - invalid /Pages tree /Count: ${numPages}.`);
// Clear out the various caches to ensure that we haven't stored any
// inconsistent and/or incorrect state, since that could easily break
// subsequent `this.getPage` calls.
await this.cleanup();
let pageIndex = 1; // The first page was already loaded.
while (true) {
try {
await this.getPage(pageIndex, /* skipCount = */ true);
} catch (reasonLoop) {
if (reasonLoop instanceof PageDictMissingException) {
break;
}
if (reasonLoop instanceof XRefEntryException) {
if (!recoveryMode) {
throw new XRefParseException();
}
break;
}
}
pageIndex++;
}
this.catalog.setActualNumPages(pageIndex);
}
}
fontFallback(id, handler) {