Automatically detect links in the text content of a file and automatically
generate link annotations at the appropriate locations to achieve
automatic link detection and hyperlinking.
When a dash separates two digits, it's very likely to not be a hyphen
inserted to split a word into two lines (e.g. "par\n-ser"), but rather
either a minus sign, a range, or a date. For example, in the tracemonkey
PDF there is `2008-02` (a date) split across two lines.
Preserving the dash, similarly to how we do for compound words, allows
searches for "2008-02" to find a match.
It helps to slightly decrease memory use in reducing the number of created arrays.
In searching for "a" in pdf.pdf, the time spent in getOriginalIndex is decreased by
around 30%.
This patch allows embedders of PDF.js to provide custom match
logic for seaching in PDFs. This is done by subclassing the
PDFFindController class and overriding the `match` method.
`match` is called once per PDF page, receives as parameters the
search query, the page contents, and the page index, and returns
an array of { index, length } objects representing the search
results.
Allow apps with supportsIntegratedFind to better monitor the find state.
A recognized use case is the Firefox findbar, its "not found" sound must
consider `entireWord` and only make noise when it is off.
See related implementation in
https://hg.mozilla.org/mozilla-central/rev/16b902cbcf26
This change can help if we have to move the implementation from cpp to jsm.
This replaces our custom `PromiseCapability`-class with the new native `Promise.withResolvers()` functionality, which does *almost* the same thing[1]; please see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/withResolvers
The only difference is that `PromiseCapability` also had a `settled`-getter, which was however not widely used and the call-sites can either be removed or re-factored to avoid it. In particular:
- In `src/display/api.js` we can tweak the `PDFObjects`-class to use a "special" initial data-value and just compare against that, in order to replace the `settled`-state.
- In `web/app.js` we change the only case to manually track the `settled`-state, which should hopefully be OK given how this is being used.
- In `web/pdf_outline_viewer.js` we can remove the `settled`-checks, since the code should work just fine without it. The only thing that could potentially happen is that we try to `resolve` a Promise multiple times, which is however *not* a problem since the value of a Promise cannot be changed once fulfilled or rejected.
- In `web/pdf_viewer.js` we can remove the `settled`-checks, since the code should work fine without them:
- For the `_onePageRenderedCapability` case the `settled`-check is used in a `EventBus`-listener which is *removed* on its first (valid) invocation.
- For the `_pagesCapability` case the `settled`-check is used in a print-related helper that works just fine with "only" the other checks.
- In `test/unit/api_spec.js` we can change the few relevant cases to manually track the `settled`-state, since this is both simple and *test-only* code.
---
[1] In browsers/environments that lack native support, note [the compatibility data](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/withResolvers#browser_compatibility), it'll be polyfilled via the `core-js` library (but only in `legacy` builds).
This manually ignores some cases where the resulting auto-formatting would not, as far as I'm concerned, constitute a readability improvement or where we'd just end up with more overall indentation.
Please see https://eslint.org/docs/latest/rules/arrow-body-style
The way that this was implemented in PR 10217 has always bothered me slightly, since the `isPageVisible`-method that I introduced there always felt quite out-of-place in the `IPDFLinkService`-implementations.
Hence this is instead replaced by a callback-function in `PDFFindController`, to handle the page-visibility checks. Note that since the `PDFViewer`-constructor always sets this callback-function, e.g. the viewer-component examples still work as-is.
Some arabic chars like \ufe94 could be searched in a pdf, hence it must be normalized
when creating the search query. So to avoid to duplicate the normalization code,
everything is moved in the find controller.
The previous code to normalize text was using NFKC but with a hardcoded map, hence it
has been replaced by the use of normalize("NFKC") (it helps to reduce the bundle size
by 30kb).
In playing with this \ufe94 char, I noticed that the bidi algorithm wasn't taking into
account some RTL unicode ranges, the generated font wasn't embedding the mapping this
char and the unicode ranges in the OS/2 table weren't up-to-date.
When normalized some chars can be replaced by several ones and it induced to have
some extra chars in the text layer. To avoid any regression, when copying some text
from the text layer, a copied string is normalized (NFKC) before being put in the
clipboard (it works like this in either Acrobat or Chrome).
*Please note:* This patch only extends the `PDFFindController` implementation itself to support this functionality, however it's *purposely* not exposed in the default viewer.
This replaces the previous `phraseSearch`-parameter, and a `query`-string will now always be interpreted as a phrase-search.
To enable searching for individual words, the `query`-parameter must instead consist of an Array of strings. This way it's now also possible to combine phrase/word searches, with a `query`-parameter looking something like `["Lorem ipsum", "foo", "bar"]` which will search for the phrase "Lorem ipsum" *and* the words "foo" respectively "bar".
At the beginning of a search we can an update can be triggered with 0 over 0
found matches.
In the GeckoView context, we can't update the finder whenever we want but only
when it has been required.
In GeckoView, on an event, a callback must be executed with the result of an action,
but the callback can be used only one time.
So for each FindInPage event, we must trigger only one matches count update.
Adding some logging with `console.{time, timeEnd}` around all the constant definitions at the top of the `web/pdf_find_controller.js` file, I noticed that computing `DIACRITICS_EXCEPTION_STR` took close to half the total time.
My first idea was just to try and make it slightly more efficient, by reducing the amount of iterations and intermediate allocations. However, with this constant only being used during "match diacritics" searches it thus seemed like a good candidate for lazy initialization.
*Please note:* Given that this is a micro optimization, I fully understand if the patch is rejected.
This was moved into the `src/display/`-folder in PR 15110, for the initial editor-a11y patch. However, with the changes in PR 15237 we're again only using `binarySearchFirstItem` in the `web/`-folder and it thus seem reasonable to move it back there.
The primary reason for moving it back is that `binarySearchFirstItem` is currently exposed in the public API, and we always want to avoid that unless it's either PDF-related functionality or code that simply must be shared between the `src/`- and `web/`-folders. In this case, `binarySearchFirstItem` is a general helper function that doesn't really satisfy either of those alternatives.
- In the annotationEditorLayer, reorder the editors in the DOM according
the position of the elements on the screen;
- add an aria-owns attribute on the "nearest" element in the text layer
which points to the added editor.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1771477;
- hangul contains some syllables which are decomposed when using NFD, hence
the text must be correctly shifted in case it contains some of them.
Currently we're resolving the Promises in the `_extractTextPromises` Array with the page-index, despite that not really being necessary since the Promises in the Array are explicitly inserted in the correct order.
Furthermore, we can replace the standard `for`-loop with a `for...of`-loop which results in ever so slightly more compact code.
- it aims to fix#14562;
- 'X-\n' were not correctly positioned;
- when X is a diacritic (e.g. in "sä-\n", which is decomposed into "sa¨-\n") we must handle both things:
- diacritics on the one hand;
- "-\n" on the other hand.
- get original index in using a dichotomic seach instead of a linear one;
- normalize the text in using NFD;
- convert the query string into a RegExp;
- replace whitespaces in the query with \s+;
- handle hyphens at eol use to break a word;
- add some \s* around punctuation signs
With these changes, we'll now *always* replace all whitespaces with standard spaces (0x20). This behaviour is already, since many years, the default in both the viewer and the browser-tests.
This patch, first of all, removes circular dependencies in the TypeScript definitions. Secondly, it also moves `RenderingStates` into `web/ui_utils.js` to break another type-dependency and directly use the `XfaLayerBuilder` during XFA-printing.
Finally, note that this patch *slightly* reduces the size of the default viewer (e.g. in the `MOZCENTRAL` build) by not having to bundle code which is completely unused.
This patch circumvents the issues seen when trying to update TypeScript to version `4.5`, by "simply" fixing the broken/missing JSDocs and `typedef`s such that `gulp typestest` now passes.
As always, given that I don't really know anything about TypeScript, I cannot tell if this is a "correct" and/or proper way of doing things; we'll need TypeScript users to help out with testing!
*Please note:* I'm sorry about the size of this patch, but given how intertwined all of this unfortunately is it just didn't seem easy to split this into smaller parts.
However, one good thing about this TypeScript update is that it helped uncover a number of pre-existing bugs in our JSDocs comments.
Looking at the code, I do have to agree with the point made in issue 12731 about it being unexpected/unhelpful that the `PDFFindController.executeCommand`-method isn't directly usable with the "find"-event.
The reason for it being this way is, as so often, for historical reasons: The `executeCommand`-method was added (just) prior to the introduction of the `EventBus` in the viewer.
Obviously we cannot simply change the existing `PDFFindController.executeCommand`-method, since that'd be a breaking change in code which has existed for over five years.
Initially I figured that we could simply add a new method in `PDFFindController` that'd accept the state from the "find"-event, however after thinking about this and looking through the use-cases in the default viewer I settled on a slightly different approach: Let the `PDFFindController` just listen for the "find"-event (on the `EventBus`-instance) directly instead, which also removes one level of (unneeded) indirection during searching in the default viewer.
For GENERIC builds of the PDF.js library, the old `PDFFindController.executeCommand`-method is still available with a deprecation warning.
Note that as far as I can tell, this is *not* a regression but rather a bug which has existed since basically "forever".
**In order to reproduce this easily:**
- Open the viewer.
- Set the zoom level to `400%`,
- Search for "expression".
The problem here is that when scrolling matches into view, we're scrolling to the start of the *containing* `textLayer` element rather than the start of the highlighted match itself.[1] When the entire width (or at least most) of the page is visible in the viewer, that doesn't really matter though which is likely why this bug has gone unnoticed for so long.[2]
Given that the highlighted match can be placed anywhere, e.g. even at the very end, within its `textLayer` element it's quite easy to see why the current implementation becomes a problem at higher zoom levels. All of this is then *further* exacerbated by `PDFFindController.scrollMatchIntoView` using a negative left offset, to ensure that the current match has some (visible) context available once scrolled into view.
In order to address this long-standing bug, we'll determine the (left) offset of the `selected` match and use that to modify the final position scrolled to in `PDFFindController.scrollMatchIntoView` such that the match is visible regardless of zoom level.
---
[1] Unfortunately we cannot directly scroll to the `selected` match, since it's not absolutely positioned and changing that would cause other bugs/regressions (note recent patches in that area).
[2] I did actually stumble upon this problem a little while ago, while working on PR 13482, but forgot to look into this again until I saw the new issue.
This patch is a rebased *and* refactored version of PR 9448, such that it applies cleanly given that `PDFFindController` has changed since that PR was opened; obviously keeping the original author information intact.
This patch will thus ensure that e.g. fractions, and other things that we normalize before searching, will still be highlighted correctly in the textLayer.
Furthermore, this patch also adds basic unit-tests for this functionality.
*Note:* The `[api-minor]` tag is added, since third-party implementations of the `PDFFindController` must now always use the `pageMatchesLength` property to get accurate length information (see the `web/text_layer_builder.js` changes).
Co-authored-by: Ross Johnson <ross@mazira.com>
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
Please note that these changes were done automatically, using `gulp lint --fix`.
Given that the major version number was increased, there's a fair number of (primarily whitespace) changes; please see https://prettier.io/blog/2020/03/21/2.0.0.html
In order to reduce the size of these changes somewhat, this patch maintains the old "arrowParens" style for now (once mozilla-central updates Prettier we can simply choose the same formatting, assuming it will differ here).