1
0
Fork 0
mirror of https://github.com/mozilla/pdf.js.git synced 2025-04-19 06:38:07 +02:00
Commit graph

151 commits

Author SHA1 Message Date
Jonas Jenwald
a2d15ceb84 Remove a few eslint-disable statements in the web/ folder
These cases could be easily re-written to avoid having to disable ESLint rules.
2025-02-17 13:40:09 +01:00
Ujjwal Sharma
61ba1ea48c Enable automatic URL linking
Automatically detect links in the text content of a file and automatically
generate link annotations at the appropriate locations to achieve
automatic link detection and hyperlinking.
2025-02-05 16:56:54 +01:00
Nicolò Ribaudo
8358ab63b3
Allow searching for number-number on two lines
When a dash separates two digits, it's very likely to not be a hyphen
inserted to split a word into two lines (e.g. "par\n-ser"), but rather
either a minus sign, a range, or a date. For example, in the tracemonkey
PDF there is `2008-02` (a date) split across two lines.

Preserving the dash, similarly to how we do for compound words, allows
searches for "2008-02" to find a match.
2025-01-15 14:23:04 +01:00
Calixte Denizet
94d53d5b45 Very slightly improve the performance when searching in a pdf
It helps to slightly decrease memory use in reducing the number of created arrays.
In searching for "a" in pdf.pdf, the time spent in getOriginalIndex is decreased by
around 30%.
2024-11-28 17:37:19 +01:00
Calixte Denizet
aa9503e51f Correctly compute the mapping between text and normalized text when it contains a compound word on two lines
It fixes #19120.

The original text doesn't contain the cr so we must take that into account.
2024-11-28 15:56:04 +01:00
Calixte Denizet
06f9d8002d Consider foo-\nBar as a compound word
Fixes #18693.
2024-09-11 15:01:54 +02:00
Nicolò Ribaudo
f051597e23
Allow specifying custom match logic in PDFFindController
This patch allows embedders of PDF.js to provide custom match
logic for seaching in PDFs. This is done by subclassing the
PDFFindController class and overriding the `match` method.

`match` is called once per PDF page, receives as parameters the
search query, the page contents, and the page index, and returns
an array of { index, length } objects representing the search
results.
2024-08-13 10:45:57 +02:00
bootleq
890c567eca Expose entireWord in updateFindControlState
Allow apps with supportsIntegratedFind to better monitor the find state.

A recognized use case is the Firefox findbar, its "not found" sound must
consider `entireWord` and only make noise when it is off.

See related implementation in
https://hg.mozilla.org/mozilla-central/rev/16b902cbcf26

This change can help if we have to move the implementation from cpp to jsm.
2024-06-21 13:12:59 +08:00
Jonas Jenwald
e4d0e84802 [api-minor] Replace the PromiseCapability with Promise.withResolvers()
This replaces our custom `PromiseCapability`-class with the new native `Promise.withResolvers()` functionality, which does *almost* the same thing[1]; please see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/withResolvers

The only difference is that `PromiseCapability` also had a `settled`-getter, which was however not widely used and the call-sites can either be removed or re-factored to avoid it. In particular:
 - In `src/display/api.js` we can tweak the `PDFObjects`-class to use a "special" initial data-value and just compare against that, in order to replace the `settled`-state.
 - In `web/app.js` we change the only case to manually track the `settled`-state, which should hopefully be OK given how this is being used.
 - In `web/pdf_outline_viewer.js` we can remove the `settled`-checks, since the code should work just fine without it. The only thing that could potentially happen is that we try to `resolve` a Promise multiple times, which is however *not* a problem since the value of a Promise cannot be changed once fulfilled or rejected.
 - In `web/pdf_viewer.js` we can remove the `settled`-checks, since the code should work fine without them:
     - For the `_onePageRenderedCapability` case the `settled`-check is used in a `EventBus`-listener which is *removed* on its first (valid) invocation.
     - For the `_pagesCapability` case the `settled`-check is used in a print-related helper that works just fine with "only" the other checks.
 - In `test/unit/api_spec.js` we can change the few relevant cases to manually track the `settled`-state, since this is both simple and *test-only* code.

---
[1] In browsers/environments that lack native support, note [the compatibility data](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/withResolvers#browser_compatibility), it'll be polyfilled via the `core-js` library (but only in `legacy` builds).
2024-04-01 11:42:37 +02:00
Jonas Jenwald
f9a384d711 Enable the arrow-body-style ESLint rule
This manually ignores some cases where the resulting auto-formatting would not, as far as I'm concerned, constitute a readability improvement or where we'd just end up with more overall indentation.

Please see https://eslint.org/docs/latest/rules/arrow-body-style
2024-01-21 16:20:55 +01:00
Jonas Jenwald
f87ec67ab1 [api-major] Remove various deprecated functionality and options 2023-09-23 17:44:09 +02:00
Jonas Jenwald
c018070e80 Enable the no-lonely-if ESLint rule
These changes were mostly done automatically, using `gulp lint --fix`, and only a few spots with comments needed manual tweaking; please see https://eslint.org/docs/latest/rules/no-lonely-if
2023-07-21 20:10:44 +02:00
Jonas Jenwald
d0bf505312 Re-factor the isPageVisible-handling in the find-controller (PR 10217 follow-up)
The way that this was implemented in PR 10217 has always bothered me slightly, since the `isPageVisible`-method that I introduced there always felt quite out-of-place in the `IPDFLinkService`-implementations.
Hence this is instead replaced by a callback-function in `PDFFindController`, to handle the page-visibility checks. Note that since the `PDFViewer`-constructor always sets this callback-function, e.g. the viewer-component examples still work as-is.
2023-05-26 13:59:39 +02:00
Jonas Jenwald
317abd6d07 Change the createPromiseCapability helper function into a PromiseCapability class
This is not only slightly more compact, but it also simplifies the handling of the `settled` getter.
2023-04-29 13:43:24 +02:00
Sebastien Corbin
d18b9ee472 Update type documentations for #16307 and #16359 2023-04-28 09:28:21 +02:00
Calixte Denizet
117bbf7cd9 [api-minor] Don't normalize the text used in the text layer.
Some arabic chars like \ufe94 could be searched in a pdf, hence it must be normalized
when creating the search query. So to avoid to duplicate the normalization code,
everything is moved in the find controller.
The previous code to normalize text was using NFKC but with a hardcoded map, hence it
has been replaced by the use of normalize("NFKC") (it helps to reduce the bundle size
by 30kb).
In playing with this \ufe94 char, I noticed that the bidi algorithm wasn't taking into
account some RTL unicode ranges, the generated font wasn't embedding the mapping this
char and the unicode ranges in the OS/2 table weren't up-to-date.

When normalized some chars can be replaced by several ones and it induced to have
some extra chars in the text layer. To avoid any regression, when copying some text
from the text layer, a copied string is normalized (NFKC) before being put in the
clipboard (it works like this in either Acrobat or Chrome).
2023-04-17 14:31:23 +02:00
Jonas Jenwald
0e19c3a120 [api-minor] Add support, in PDFFindController, for mixing phrase/word searches (issue 7442)
*Please note:* This patch only extends the `PDFFindController` implementation itself to support this functionality, however it's *purposely* not exposed in the default viewer.

This replaces the previous `phraseSearch`-parameter, and a `query`-string will now always be interpreted as a phrase-search.
To enable searching for individual words, the `query`-parameter must instead consist of an Array of strings. This way it's now also possible to combine phrase/word searches, with a `query`-parameter looking something like `["Lorem ipsum", "foo", "bar"]` which will search for the phrase "Lorem ipsum" *and* the words "foo" respectively "bar".
2023-04-15 13:32:37 +02:00
Calixte Denizet
d8795f9f8f Fix search of numbers inside fractions 2023-04-11 20:57:26 +02:00
Jonas Jenwald
1fc09f0235 Enable the unicorn/prefer-string-replace-all ESLint plugin rule
Note that the `replaceAll` method still requires that a *global* regular expression is used, however by using this method it's immediately obvious when looking at the code that all occurrences will be replaced; please see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replaceAll#parameters

Please find additional details at https://github.com/sindresorhus/eslint-plugin-unicorn/blob/main/docs/rules/prefer-string-replace-all.md
2023-03-23 12:57:10 +01:00
Jonas Jenwald
e981badb94 Mark updateMatchesCountOnProgress, in the PDFFindControllerOptions, as optional (issue 15990) 2023-03-13 16:46:33 +01:00
Calixte Denizet
07b094729e Fix search in pdf a containing some UTF-32 characters (bug 1820909)
Some chars were supposed to have a length equals to 1 but UTF-32 chars
can be longuer.
2023-03-09 15:03:01 +01:00
Calixte Denizet
fc7d74385f Don't replace an eol by a whitespace when the last char is a Katakana-Hiragana diacritic 2023-02-16 11:31:58 +01:00
Calixte Denizet
dc94b750de [GV] Avoid to update the finder when the results aren't complete
At the beginning of a search we can an update can be triggered with 0 over 0
found matches.
In the GeckoView context, we can't update the finder whenever we want but only
when it has been required.
2023-01-20 18:13:16 +01:00
Calixte Denizet
661f425934 [GV] Add an option in the find controller to update matches count only when the last page is reached (bug 1803188).
In GeckoView, on an event, a callback must be executed with the result of an action,
but the callback can be used only one time.
So for each FindInPage event, we must trigger only one matches count update.
2023-01-06 10:56:26 +01:00
Calixte Denizet
69c88477a9 Avoid an infinite loop when searching for a single diacritic 2023-01-02 12:27:07 +01:00
Calixte Denizet
ea1995991b Don't add an extra space after a Katakana or a Hiragana at the eol when searching 2022-11-29 10:46:48 +01:00
Jonas Jenwald
176e8f0ddc Initialize the find-related DIACRITICS_EXCEPTION_STR constant lazily
Adding some logging with `console.{time, timeEnd}` around all the constant definitions at the top of the `web/pdf_find_controller.js` file, I noticed that computing `DIACRITICS_EXCEPTION_STR` took close to half the total time.
My first idea was just to try and make it slightly more efficient, by reducing the amount of iterations and intermediate allocations. However, with this constant only being used during "match diacritics" searches it thus seemed like a good candidate for lazy initialization.

*Please note:* Given that this is a micro optimization, I fully understand if the patch is rejected.
2022-11-15 12:46:16 +01:00
Calixte Denizet
2be64d63e1 Normalize fullwidth, halfwidth and circled chars when searching 2022-11-14 19:27:51 +01:00
Calixte Denizet
6c6f6fb2b8 Don't replace cr by a white space when the last char on the line is an ideographic char 2022-09-04 14:21:05 +02:00
Jonas Jenwald
0024165f1f Move binarySearchFirstItem back to the web/-folder (PR 15237 follow-up)
This was moved into the `src/display/`-folder in PR 15110, for the initial editor-a11y patch. However, with the changes in PR 15237 we're again only using `binarySearchFirstItem` in the `web/`-folder and it thus seem reasonable to move it back there.
The primary reason for moving it back is that `binarySearchFirstItem` is currently exposed in the public API, and we always want to avoid that unless it's either PDF-related functionality or code that simply must be shared between the `src/`- and `web/`-folders. In this case, `binarySearchFirstItem` is a general helper function that doesn't really satisfy either of those alternatives.
2022-08-14 11:38:17 +02:00
Calixte Denizet
624b26e1de [Editor] Improve a11y for newly added element (#15109)
- In the annotationEditorLayer, reorder the editors in the DOM according
  the position of the elements on the screen;
- add an aria-owns attribute on the "nearest" element in the text layer
  which points to the added editor.
2022-07-19 18:52:17 +02:00
Calixte Denizet
c7afce4210 Support Hangul syllables when searching some text (bug 1771477)
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1771477;
- hangul contains some syllables which are decomposed when using NFD, hence
  the text must be correctly shifted in case it contains some of them.
2022-05-28 16:50:03 +02:00
Jonas Jenwald
61a52e8043 Convert all "private" methods in PDFFindController into proper ones
Given that none of these methods are/were ever intended to be called manually, we can now enforce this with modern class-features.
2022-03-19 12:26:03 +01:00
Jonas Jenwald
cc1bca6268 Slightly simplify the PDFFindController._extractText method
Currently we're resolving the Promises in the `_extractTextPromises` Array with the page-index, despite that not really being necessary since the Promises in the Array are explicitly inserted in the correct order.
Furthermore, we can replace the standard `for`-loop with a `for...of`-loop which results in ever so slightly more compact code.
2022-03-19 12:13:29 +01:00
Jonas Jenwald
38d30f3be5 Remove the deprecated PDFFindController.executeCommand method
This *partially* reverts commit fa8c0ef616, since it's now been included in two official releases.
2022-03-02 11:23:14 +01:00
Calixte Denizet
18f4e560ae [Search] Some matches were incorrectly shifted because of some '-\n'
- it aims to fix #14562;
- 'X-\n' were not correctly positioned;
- when X is a diacritic (e.g. in "sä-\n", which is decomposed into "sa¨-\n") we must handle both things:
  - diacritics on the one hand;
  - "-\n" on the other hand.
2022-02-14 10:12:33 +01:00
Calixte Denizet
1f41028fcb Support search with or without diacritics (bug 1508345, bug 916883, bug 1651113)
- get original index in using a dichotomic seach instead of a linear one;
  - normalize the text in using NFD;
  - convert the query string into a RegExp;
  - replace whitespaces in the query with \s+;
  - handle hyphens at eol use to break a word;
  - add some \s* around punctuation signs
2022-02-03 15:42:55 +01:00
Jonas Jenwald
403baa7bba [api-minor] Remove the normalizeWhitespace option in the PDFPageProxy.{getTextContent, streamTextContent} methods (issue 14519, PR 14428 follow-up)
With these changes, we'll now *always* replace all whitespaces with standard spaces (0x20). This behaviour is already, since many years, the default in both the viewer and the browser-tests.
2022-02-03 09:17:22 +01:00
Jonas Jenwald
e19020c028 Move the Default{...}LayerFactory into a new web/default_factory.js file
This patch, first of all, removes circular dependencies in the TypeScript definitions. Secondly, it also moves `RenderingStates` into `web/ui_utils.js` to break another type-dependency and directly use the `XfaLayerBuilder` during XFA-printing.
Finally, note that this patch *slightly* reduces the size of the default viewer (e.g. in the `MOZCENTRAL` build) by not having to bundle code which is completely unused.
2021-12-15 23:17:08 +01:00
Jonas Jenwald
e0dba504d2 Fix broken/missing JSDocs and typedefs, to allow updating TypeScript to the latest version (issue 14342)
This patch circumvents the issues seen when trying to update TypeScript to version `4.5`, by "simply" fixing the broken/missing JSDocs and `typedef`s such that `gulp typestest` now passes.
As always, given that I don't really know anything about TypeScript, I cannot tell if this is a "correct" and/or proper way of doing things; we'll need TypeScript users to help out with testing!

*Please note:* I'm sorry about the size of this patch, but given how intertwined all of this unfortunately is it just didn't seem easy to split this into smaller parts.
However, one good thing about this TypeScript update is that it helped uncover a number of pre-existing bugs in our JSDocs comments.
2021-12-15 23:14:25 +01:00
Jonas Jenwald
fa8c0ef616 [api-minor] Change PDFFindController to use the "find"-event directly (issue 12731)
Looking at the code, I do have to agree with the point made in issue 12731 about it being unexpected/unhelpful that the `PDFFindController.executeCommand`-method isn't directly usable with the "find"-event.
The reason for it being this way is, as so often, for historical reasons: The `executeCommand`-method was added (just) prior to the introduction of the `EventBus` in the viewer.

Obviously we cannot simply change the existing `PDFFindController.executeCommand`-method, since that'd be a breaking change in code which has existed for over five years.
Initially I figured that we could simply add a new method in `PDFFindController` that'd accept the state from the "find"-event, however after thinking about this and looking through the use-cases in the default viewer I settled on a slightly different approach: Let the `PDFFindController` just listen for the "find"-event (on the `EventBus`-instance) directly instead, which also removes one level of (unneeded) indirection during searching in the default viewer.

For GENERIC builds of the PDF.js library, the old `PDFFindController.executeCommand`-method is still available with a deprecation warning.
2021-10-16 10:36:22 +02:00
Jonas Jenwald
b42120bdb0 Take the position of the selected element into account when scrolling matches (issue 13596)
Note that as far as I can tell, this is *not* a regression but rather a bug which has existed since basically "forever".

**In order to reproduce this easily:**
 - Open the viewer.
 - Set the zoom level to `400%`,
 - Search for "expression".

The problem here is that when scrolling matches into view, we're scrolling to the start of the *containing* `textLayer` element rather than the start of the highlighted match itself.[1] When the entire width (or at least most) of the page is visible in the viewer, that doesn't really matter though which is likely why this bug has gone unnoticed for so long.[2]
Given that the highlighted match can be placed anywhere, e.g. even at the very end, within its `textLayer` element it's quite easy to see why the current implementation becomes a problem at higher zoom levels. All of this is then *further* exacerbated by `PDFFindController.scrollMatchIntoView` using a negative left offset, to ensure that the current match has some (visible) context available once scrolled into view.

In order to address this long-standing bug, we'll determine the (left) offset of the `selected` match and use that to modify the final position scrolled to in `PDFFindController.scrollMatchIntoView` such that the match is visible regardless of zoom level.

---
[1] Unfortunately we cannot directly scroll to the `selected` match, since it's not absolutely positioned and changing that would cause other bugs/regressions (note recent patches in that area).

[2] I did actually stumble upon this problem a little while ago, while working on PR 13482, but forgot to look into this again until I saw the new issue.
2021-06-21 11:49:33 +02:00
Tim van der Meij
ed0990ab6f
Merge pull request #13492 from MMeent/patch-1
Add normalization for Hyphen -> Hyphen-minus
2021-06-04 21:00:02 +02:00
MMeent
3631121841
Add normalization for Hyphen -> Hyphen-minus
Previously these two characters were not searchable interchangably, even when Hyphen-Minus is being changed to Hyphen in some text to PDF pipelines.
2021-06-04 15:54:52 +02:00
Jonas Jenwald
29e6930bb6 Fix scrolling of search results in documents with marked content (bug 1714183)
This regressed in PR 13171, since the `span`s with the marked content identifiers interfere with scrolling of search results.
2021-06-03 12:41:51 +02:00
Tim van der Meij
ff393d6e96
Convert the pendingFindMatches member, in web/pdf_find_controller.js, from an object to a set
We only want to track page numbers instead of actual data, so using a
set conveys that intention more clearly and is slightly more efficient.
2021-04-05 19:33:53 +02:00
Jonas Jenwald
bc13932ac1 Use more optional chaining in the web/-folder (PR 12961 follow-up)
I overlooked these cases previously, but there's no reason why optional chaining (and nullish coalescing) cannot be used here as well.
2021-03-07 16:20:52 +01:00
Ross Johnson
6dae2677d5 [api-minor] Highlight search results correctly for normalized text (PR 9448)
This patch is a rebased *and* refactored version of PR 9448, such that it applies cleanly given that `PDFFindController` has changed since that PR was opened; obviously keeping the original author information intact.

This patch will thus ensure that e.g. fractions, and other things that we normalize before searching, will still be highlighted correctly in the textLayer.

Furthermore, this patch also adds basic unit-tests for this functionality.

*Note:* The `[api-minor]` tag is added, since third-party implementations of the `PDFFindController` must now always use the `pageMatchesLength` property to get accurate length information (see the `web/text_layer_builder.js` changes).

Co-authored-by: Ross Johnson <ross@mazira.com>
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
2021-01-12 18:08:08 +01:00
DesWurstes
72f48ee089 Return the query with the findcontrols 2020-08-20 11:18:43 +01:00
Jonas Jenwald
426945b480 Update Prettier to version 2.0
Please note that these changes were done automatically, using `gulp lint --fix`.

Given that the major version number was increased, there's a fair number of (primarily whitespace) changes; please see https://prettier.io/blog/2020/03/21/2.0.0.html
In order to reduce the size of these changes somewhat, this patch maintains the old "arrowParens" style for now (once mozilla-central updates Prettier we can simply choose the same formatting, assuming it will differ here).
2020-04-14 12:28:14 +02:00