Search Config Question: Dealing with (lack of) spaces in OCR text

ebellempire · August 22, 2023, 4:42pm

I’m using version 8.7.4, installed manually on an existing website. The PDFs on this site are historical documents with OCR’d text that generally does not have spaces between words. So searching for a phrase basically never works. See example below.

Single word search

Note the OCR text preview

Is there a configuration that can help with this issue (e.g. by turning spaces into an optional wildcard)? I know that the core PDFjs viewer, as implemented in Firefox, does not have this limitation, so maybe there’s a well-known way to deal with this.

Below is my code, which is pretty close to default.

  const pdf = document.querySelector("[data-pdf]");
  if(pdf){
    WebViewer({
      path: 'xxx',
      licenseKey: "xxx",
      initialDoc: pdf.dataset.pdf,
      disabledElements: [
        'toolsHeader',
        'ribbonsDropdown',
        'toggleNotesButton',
        'printButton',
        'ribbons',
        'contextMenuPopup'
      ]
    }, document.getElementById('viewer')).then(instance => {
      const docViewer = instance.docViewer;
      const annotManager = instance.annotManager;
      instance.UI.setTheme('dark');
    });
  }

Thanks!

NOTE: Continuing with some additional screenshots below (new users can only include one file upload per post).

ebellempire · August 22, 2023, 4:43pm

Searching a phrase w/o spaces

Note the OCR text preview

ebellempire · August 22, 2023, 4:44pm

Searching a phrase with spaces

ebellempire · August 22, 2023, 4:46pm

Same document search in Firefox

Works as expected in PDFjs, as implemented in Firefox

darian.chen · August 23, 2023, 6:47pm

Hi there,

You would have to implement your own regex search.

Here is a link to a useful guide on searching: Basic searching | Documentation

Thank you.

Best Regards,
Darian Chen

Topic		Replies	Views
Highlight annotation is broken for some pdf Technical Support	8	818	January 20, 2021
Search doesn't work in some scenarios (possibly font-size related) Bug Reports pdfjs-express	1	292	May 14, 2021
Search result not finding a word listed in a document Technical Support	8	560	February 8, 2022
Saving of formulas as citation and some space not working Bug Reports	3	495	January 20, 2021
Searching Feature is not working on some file Technical Support	5	98	February 14, 2024

Search Config Question: Dealing with (lack of) spaces in OCR text

Single word search

Searching a phrase w/o spaces

Searching a phrase with spaces

Same document search in Firefox

Related topics