1
0
Fork 0
mirror of https://github.com/mozilla/pdf.js.git synced 2025-04-26 10:08:06 +02:00
pdf.js/src/core/image_utils.js

188 lines
4.6 KiB
JavaScript
Raw Normal View History

/* Copyright 2019 Mozilla Foundation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/* eslint no-var: error */
import { assert, info, shadow, unreachable } from "../shared/util.js";
Attempt to cache repeated images at the document, rather than the page, level (issue 11878) Currently image resources, as opposed to e.g. font resources, are handled exclusively on a page-specific basis. Generally speaking this makes sense, since pages are separate from each other, however there's PDF documents where many (or even all) pages actually references exactly the same image resources (through the XRef table). Hence, in some cases, we're decoding the *same* images over and over for every page which is obviously slow and wasting both CPU and memory resources better used elsewhere.[1] Obviously we cannot simply treat all image resources as-if they're used throughout the entire PDF document, since that would end up increasing memory usage too much.[2] However, by introducing a `GlobalImageCache` in the worker we can track image resources that appear on more than one page. Hence we can switch image resources from being page-specific to being document-specific, once the image resource has been seen on more than a certain number of pages. In many cases, such as e.g. the referenced issue, this patch will thus lead to reduced memory usage for image resources. Scrolling through all pages of the document, there's now only a few main-thread copies of the same image data, as opposed to one for each rendered page (i.e. there could theoretically be *twenty* copies of the image data). While this obviously benefit both CPU and memory usage in this case, for *very* large image data this patch *may* possibly increase persistent main-thread memory usage a tiny bit. Thus to avoid negatively affecting memory usage too much in general, particularly on the main-thread, the `GlobalImageCache` will *only* cache a certain number of image resources at the document level and simply fallback to the default behaviour. Unfortunately the asynchronous nature of the code, with ranged/streamed loading of data, actually makes all of this much more complicated than if all data could be assumed to be immediately available.[3] *Please note:* The patch will lead to *small* movement in some existing test-cases, since we're now using the built-in PDF.js JPEG decoder more. This was done in order to simplify the overall implementation, especially on the main-thread, by limiting it to only the `OPS.paintImageXObject` operator. --- [1] There's e.g. PDF documents that use the same image as background on all pages. [2] Given that data stored in the `commonObjs`, on the main-thread, are only cleared manually through `PDFDocumentProxy.cleanup`. This as opposed to data stored in the `objs` of each page, which is automatically removed when the page is cleaned-up e.g. by being evicted from the cache in the default viewer. [3] If the latter case were true, we could simply check for repeat images *before* parsing started and thus avoid handling *any* duplicate image resources.
2020-05-18 14:17:56 +02:00
import { RefSetCache } from "./primitives.js";
class BaseLocalCache {
Improve the *local* image caching in `PartialEvaluator.getOperatorList` Currently the local `imageCache`, as used in `PartialEvaluator.getOperatorList`, will miss certain cases of repeated images because the caching is *only* done by name (usually using a format such as e.g. "Im0", "Im1", ...). However, in some PDF documents the `/XObject` dictionaries many contain hundreds (or even thousands) of distinctly named images, despite them referring to only a handful of actual image objects (via the XRef table). With these changes we'll now cache *local* images using both name and (where applicable) reference, thus improving re-usage of images resources even further. This patch was tested using the PDF file from [bug 857031](https://bugzilla.mozilla.org/show_bug.cgi?id=857031), i.e. https://bug857031.bmoattachments.org/attachment.cgi?id=732270, with the following manifest file: ``` [ { "id": "bug857031", "file": "../web/pdfs/bug857031.pdf", "md5": "", "rounds": 250, "lastPage": 1, "type": "eq" } ] ``` which gave the following results when comparing this patch against the `master` branch: ``` -- Grouped By browser, page, stat -- browser | page | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05) ------- | ---- | ------------ | ----- | ------------ | ----------- | --- | ----- | ------------- firefox | 0 | Overall | 250 | 2749 | 2656 | -93 | -3.38 | faster firefox | 0 | Page Request | 250 | 3 | 4 | 1 | 50.14 | slower firefox | 0 | Rendering | 250 | 2746 | 2652 | -94 | -3.44 | faster ``` While this is certainly an improvement, since we now avoid re-parsing ~1000 images on the first page, all of the image resources are small enough that the total rendering time doesn't improve that much in this particular case. In pathological cases, such as e.g. the PDF document in issue 4958, the improvements with this patch can be very significant. Looking for example at page 2, from issue 4958, the rendering time drops from ~60 seconds with `master` to ~30 seconds with this patch (obviously still slow, but it really showcases the potential of this patch nicely). Finally, note that there's also potential for additional improvements by re-using `LocalImageCache` instances for e.g. /XObject data of the `Form`-type. However, given that recent changes in this area I purposely didn't want to complicate *this* patch more than necessary.
2020-05-23 13:55:31 +02:00
constructor() {
if (this.constructor === BaseLocalCache) {
unreachable("Cannot initialize BaseLocalCache.");
}
Improve the *local* image caching in `PartialEvaluator.getOperatorList` Currently the local `imageCache`, as used in `PartialEvaluator.getOperatorList`, will miss certain cases of repeated images because the caching is *only* done by name (usually using a format such as e.g. "Im0", "Im1", ...). However, in some PDF documents the `/XObject` dictionaries many contain hundreds (or even thousands) of distinctly named images, despite them referring to only a handful of actual image objects (via the XRef table). With these changes we'll now cache *local* images using both name and (where applicable) reference, thus improving re-usage of images resources even further. This patch was tested using the PDF file from [bug 857031](https://bugzilla.mozilla.org/show_bug.cgi?id=857031), i.e. https://bug857031.bmoattachments.org/attachment.cgi?id=732270, with the following manifest file: ``` [ { "id": "bug857031", "file": "../web/pdfs/bug857031.pdf", "md5": "", "rounds": 250, "lastPage": 1, "type": "eq" } ] ``` which gave the following results when comparing this patch against the `master` branch: ``` -- Grouped By browser, page, stat -- browser | page | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05) ------- | ---- | ------------ | ----- | ------------ | ----------- | --- | ----- | ------------- firefox | 0 | Overall | 250 | 2749 | 2656 | -93 | -3.38 | faster firefox | 0 | Page Request | 250 | 3 | 4 | 1 | 50.14 | slower firefox | 0 | Rendering | 250 | 2746 | 2652 | -94 | -3.44 | faster ``` While this is certainly an improvement, since we now avoid re-parsing ~1000 images on the first page, all of the image resources are small enough that the total rendering time doesn't improve that much in this particular case. In pathological cases, such as e.g. the PDF document in issue 4958, the improvements with this patch can be very significant. Looking for example at page 2, from issue 4958, the rendering time drops from ~60 seconds with `master` to ~30 seconds with this patch (obviously still slow, but it really showcases the potential of this patch nicely). Finally, note that there's also potential for additional improvements by re-using `LocalImageCache` instances for e.g. /XObject data of the `Form`-type. However, given that recent changes in this area I purposely didn't want to complicate *this* patch more than necessary.
2020-05-23 13:55:31 +02:00
this._nameRefMap = new Map();
this._imageMap = new Map();
this._imageCache = new RefSetCache();
}
getByName(name) {
const ref = this._nameRefMap.get(name);
if (ref) {
return this.getByRef(ref);
}
return this._imageMap.get(name) || null;
}
getByRef(ref) {
return this._imageCache.get(ref) || null;
}
set(name, ref, data) {
unreachable("Abstract method `set` called.");
}
}
class LocalImageCache extends BaseLocalCache {
Improve the *local* image caching in `PartialEvaluator.getOperatorList` Currently the local `imageCache`, as used in `PartialEvaluator.getOperatorList`, will miss certain cases of repeated images because the caching is *only* done by name (usually using a format such as e.g. "Im0", "Im1", ...). However, in some PDF documents the `/XObject` dictionaries many contain hundreds (or even thousands) of distinctly named images, despite them referring to only a handful of actual image objects (via the XRef table). With these changes we'll now cache *local* images using both name and (where applicable) reference, thus improving re-usage of images resources even further. This patch was tested using the PDF file from [bug 857031](https://bugzilla.mozilla.org/show_bug.cgi?id=857031), i.e. https://bug857031.bmoattachments.org/attachment.cgi?id=732270, with the following manifest file: ``` [ { "id": "bug857031", "file": "../web/pdfs/bug857031.pdf", "md5": "", "rounds": 250, "lastPage": 1, "type": "eq" } ] ``` which gave the following results when comparing this patch against the `master` branch: ``` -- Grouped By browser, page, stat -- browser | page | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05) ------- | ---- | ------------ | ----- | ------------ | ----------- | --- | ----- | ------------- firefox | 0 | Overall | 250 | 2749 | 2656 | -93 | -3.38 | faster firefox | 0 | Page Request | 250 | 3 | 4 | 1 | 50.14 | slower firefox | 0 | Rendering | 250 | 2746 | 2652 | -94 | -3.44 | faster ``` While this is certainly an improvement, since we now avoid re-parsing ~1000 images on the first page, all of the image resources are small enough that the total rendering time doesn't improve that much in this particular case. In pathological cases, such as e.g. the PDF document in issue 4958, the improvements with this patch can be very significant. Looking for example at page 2, from issue 4958, the rendering time drops from ~60 seconds with `master` to ~30 seconds with this patch (obviously still slow, but it really showcases the potential of this patch nicely). Finally, note that there's also potential for additional improvements by re-using `LocalImageCache` instances for e.g. /XObject data of the `Form`-type. However, given that recent changes in this area I purposely didn't want to complicate *this* patch more than necessary.
2020-05-23 13:55:31 +02:00
set(name, ref = null, data) {
if (!name) {
throw new Error('LocalImageCache.set - expected "name" argument.');
}
if (ref) {
if (this._imageCache.has(ref)) {
return;
}
this._nameRefMap.set(name, ref);
this._imageCache.put(ref, data);
return;
}
// name
if (this._imageMap.has(name)) {
return;
}
this._imageMap.set(name, data);
}
}
class LocalColorSpaceCache extends BaseLocalCache {
set(name = null, ref = null, data) {
if (!name && !ref) {
throw new Error(
'LocalColorSpaceCache.set - expected "name" and/or "ref" argument.'
);
}
if (ref) {
if (this._imageCache.has(ref)) {
return;
}
if (name) {
// Optional when `ref` is defined.
this._nameRefMap.set(name, ref);
}
this._imageCache.put(ref, data);
return;
}
// name
if (this._imageMap.has(name)) {
return;
}
this._imageMap.set(name, data);
}
}
Attempt to cache repeated images at the document, rather than the page, level (issue 11878) Currently image resources, as opposed to e.g. font resources, are handled exclusively on a page-specific basis. Generally speaking this makes sense, since pages are separate from each other, however there's PDF documents where many (or even all) pages actually references exactly the same image resources (through the XRef table). Hence, in some cases, we're decoding the *same* images over and over for every page which is obviously slow and wasting both CPU and memory resources better used elsewhere.[1] Obviously we cannot simply treat all image resources as-if they're used throughout the entire PDF document, since that would end up increasing memory usage too much.[2] However, by introducing a `GlobalImageCache` in the worker we can track image resources that appear on more than one page. Hence we can switch image resources from being page-specific to being document-specific, once the image resource has been seen on more than a certain number of pages. In many cases, such as e.g. the referenced issue, this patch will thus lead to reduced memory usage for image resources. Scrolling through all pages of the document, there's now only a few main-thread copies of the same image data, as opposed to one for each rendered page (i.e. there could theoretically be *twenty* copies of the image data). While this obviously benefit both CPU and memory usage in this case, for *very* large image data this patch *may* possibly increase persistent main-thread memory usage a tiny bit. Thus to avoid negatively affecting memory usage too much in general, particularly on the main-thread, the `GlobalImageCache` will *only* cache a certain number of image resources at the document level and simply fallback to the default behaviour. Unfortunately the asynchronous nature of the code, with ranged/streamed loading of data, actually makes all of this much more complicated than if all data could be assumed to be immediately available.[3] *Please note:* The patch will lead to *small* movement in some existing test-cases, since we're now using the built-in PDF.js JPEG decoder more. This was done in order to simplify the overall implementation, especially on the main-thread, by limiting it to only the `OPS.paintImageXObject` operator. --- [1] There's e.g. PDF documents that use the same image as background on all pages. [2] Given that data stored in the `commonObjs`, on the main-thread, are only cleared manually through `PDFDocumentProxy.cleanup`. This as opposed to data stored in the `objs` of each page, which is automatically removed when the page is cleaned-up e.g. by being evicted from the cache in the default viewer. [3] If the latter case were true, we could simply check for repeat images *before* parsing started and thus avoid handling *any* duplicate image resources.
2020-05-18 14:17:56 +02:00
class GlobalImageCache {
static get NUM_PAGES_THRESHOLD() {
return shadow(this, "NUM_PAGES_THRESHOLD", 2);
}
static get MAX_IMAGES_TO_CACHE() {
return shadow(this, "MAX_IMAGES_TO_CACHE", 10);
}
constructor() {
if (
typeof PDFJSDev === "undefined" ||
PDFJSDev.test("!PRODUCTION || TESTING")
) {
assert(
GlobalImageCache.NUM_PAGES_THRESHOLD > 1,
"GlobalImageCache - invalid NUM_PAGES_THRESHOLD constant."
);
}
this._refCache = new RefSetCache();
this._imageCache = new RefSetCache();
}
shouldCache(ref, pageIndex) {
const pageIndexSet = this._refCache.get(ref);
const numPages = pageIndexSet
? pageIndexSet.size + (pageIndexSet.has(pageIndex) ? 0 : 1)
: 1;
if (numPages < GlobalImageCache.NUM_PAGES_THRESHOLD) {
return false;
}
if (
!this._imageCache.has(ref) &&
this._imageCache.size >= GlobalImageCache.MAX_IMAGES_TO_CACHE
) {
return false;
}
return true;
}
addPageIndex(ref, pageIndex) {
let pageIndexSet = this._refCache.get(ref);
if (!pageIndexSet) {
pageIndexSet = new Set();
this._refCache.put(ref, pageIndexSet);
}
pageIndexSet.add(pageIndex);
}
getData(ref, pageIndex) {
const pageIndexSet = this._refCache.get(ref);
if (!pageIndexSet) {
Attempt to cache repeated images at the document, rather than the page, level (issue 11878) Currently image resources, as opposed to e.g. font resources, are handled exclusively on a page-specific basis. Generally speaking this makes sense, since pages are separate from each other, however there's PDF documents where many (or even all) pages actually references exactly the same image resources (through the XRef table). Hence, in some cases, we're decoding the *same* images over and over for every page which is obviously slow and wasting both CPU and memory resources better used elsewhere.[1] Obviously we cannot simply treat all image resources as-if they're used throughout the entire PDF document, since that would end up increasing memory usage too much.[2] However, by introducing a `GlobalImageCache` in the worker we can track image resources that appear on more than one page. Hence we can switch image resources from being page-specific to being document-specific, once the image resource has been seen on more than a certain number of pages. In many cases, such as e.g. the referenced issue, this patch will thus lead to reduced memory usage for image resources. Scrolling through all pages of the document, there's now only a few main-thread copies of the same image data, as opposed to one for each rendered page (i.e. there could theoretically be *twenty* copies of the image data). While this obviously benefit both CPU and memory usage in this case, for *very* large image data this patch *may* possibly increase persistent main-thread memory usage a tiny bit. Thus to avoid negatively affecting memory usage too much in general, particularly on the main-thread, the `GlobalImageCache` will *only* cache a certain number of image resources at the document level and simply fallback to the default behaviour. Unfortunately the asynchronous nature of the code, with ranged/streamed loading of data, actually makes all of this much more complicated than if all data could be assumed to be immediately available.[3] *Please note:* The patch will lead to *small* movement in some existing test-cases, since we're now using the built-in PDF.js JPEG decoder more. This was done in order to simplify the overall implementation, especially on the main-thread, by limiting it to only the `OPS.paintImageXObject` operator. --- [1] There's e.g. PDF documents that use the same image as background on all pages. [2] Given that data stored in the `commonObjs`, on the main-thread, are only cleared manually through `PDFDocumentProxy.cleanup`. This as opposed to data stored in the `objs` of each page, which is automatically removed when the page is cleaned-up e.g. by being evicted from the cache in the default viewer. [3] If the latter case were true, we could simply check for repeat images *before* parsing started and thus avoid handling *any* duplicate image resources.
2020-05-18 14:17:56 +02:00
return null;
}
if (pageIndexSet.size < GlobalImageCache.NUM_PAGES_THRESHOLD) {
return null;
}
if (!this._imageCache.has(ref)) {
return null;
}
// Ensure that we keep track of all pages containing the image reference.
pageIndexSet.add(pageIndex);
return this._imageCache.get(ref);
}
setData(ref, data) {
if (!this._refCache.has(ref)) {
throw new Error(
'GlobalImageCache.setData - expected "addPageIndex" to have been called.'
);
}
if (this._imageCache.has(ref)) {
return;
}
if (this._imageCache.size >= GlobalImageCache.MAX_IMAGES_TO_CACHE) {
info(
"GlobalImageCache.setData - ignoring image above MAX_IMAGES_TO_CACHE."
);
return;
}
this._imageCache.put(ref, data);
}
clear(onlyData = false) {
if (!onlyData) {
this._refCache.clear();
}
Attempt to cache repeated images at the document, rather than the page, level (issue 11878) Currently image resources, as opposed to e.g. font resources, are handled exclusively on a page-specific basis. Generally speaking this makes sense, since pages are separate from each other, however there's PDF documents where many (or even all) pages actually references exactly the same image resources (through the XRef table). Hence, in some cases, we're decoding the *same* images over and over for every page which is obviously slow and wasting both CPU and memory resources better used elsewhere.[1] Obviously we cannot simply treat all image resources as-if they're used throughout the entire PDF document, since that would end up increasing memory usage too much.[2] However, by introducing a `GlobalImageCache` in the worker we can track image resources that appear on more than one page. Hence we can switch image resources from being page-specific to being document-specific, once the image resource has been seen on more than a certain number of pages. In many cases, such as e.g. the referenced issue, this patch will thus lead to reduced memory usage for image resources. Scrolling through all pages of the document, there's now only a few main-thread copies of the same image data, as opposed to one for each rendered page (i.e. there could theoretically be *twenty* copies of the image data). While this obviously benefit both CPU and memory usage in this case, for *very* large image data this patch *may* possibly increase persistent main-thread memory usage a tiny bit. Thus to avoid negatively affecting memory usage too much in general, particularly on the main-thread, the `GlobalImageCache` will *only* cache a certain number of image resources at the document level and simply fallback to the default behaviour. Unfortunately the asynchronous nature of the code, with ranged/streamed loading of data, actually makes all of this much more complicated than if all data could be assumed to be immediately available.[3] *Please note:* The patch will lead to *small* movement in some existing test-cases, since we're now using the built-in PDF.js JPEG decoder more. This was done in order to simplify the overall implementation, especially on the main-thread, by limiting it to only the `OPS.paintImageXObject` operator. --- [1] There's e.g. PDF documents that use the same image as background on all pages. [2] Given that data stored in the `commonObjs`, on the main-thread, are only cleared manually through `PDFDocumentProxy.cleanup`. This as opposed to data stored in the `objs` of each page, which is automatically removed when the page is cleaned-up e.g. by being evicted from the cache in the default viewer. [3] If the latter case were true, we could simply check for repeat images *before* parsing started and thus avoid handling *any* duplicate image resources.
2020-05-18 14:17:56 +02:00
this._imageCache.clear();
}
}
export { LocalImageCache, LocalColorSpaceCache, GlobalImageCache };