Open Source Document Management System | OpenKM - [SOLVED] OCR not extracting text from PDFs, but scanned correctly

[SOLVED] OCR not extracting text from PDFs, but scanned correctly

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

6 posts

6 posts

[SOLVED] OCR not extracting text from PDFs, but scanned correctly

#52641 by LauryMenton
Sun Jul 18, 2021 6:19 pm

Hello, I have a problem with the text extractor from OpenKM (CE) not extracting the text from any PDF correctly. But it process all of them correctly (checked from Administration panel).

As an example, this is one of the documents that was successfully analyzed by OCR, but the 'NDC_TEXT' column does not have any content actually.

The PDF (from above), and the other ones I have uploaded to OpenKM, are all OCR compatible and fully searchables.

I left as default the list of all included text extractor (registered.text.extractors):

Code: Select all

com.openkm.extractor.PlainTextExtractor
com.openkm.extractor.MsWordTextExtractor
com.openkm.extractor.MsExcelTextExtractor
com.openkm.extractor.MsPowerPointTextExtractor
com.openkm.extractor.OpenOfficeTextExtractor
com.openkm.extractor.RTFTextExtractor
com.openkm.extractor.HTMLTextExtractor
com.openkm.extractor.XMLTextExtractor
com.openkm.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

And also, left without anyvalue the system.ocr. But I have tried adding tesseract (value=/usr/bin/tesseract ${fileIn} ${fileOut} -l spa), but as this is used for images, I disabled that field again (it didn't extract any text from the test I have made).

Any ideas?

- - - - - - - - - - - - - - - - -

Some useful information:
- Docker installation (v6.3.11).

Last edited by LauryMenton on Mon Jul 26, 2021 3:21 pm, edited 1 time in total.

Username

LauryMenton

Rank

Senior Boarder

Posts

Joined

Sun Apr 07, 2019 12:23 am

Re: OCR not extracting text from PDFs, but scanned correctly

#52655 by jllort
Thu Jul 22, 2021 2:18 pm

Go to administration > tools > check text extractor -> check if from there the application extract the text
https://docs.openkm.com/kcenter/view/ok ... ction.html

Username

jllort

Rank

Moderator

Posts

12053

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR not extracting text from PDFs, but scanned correctly

#52656 by LauryMenton
Thu Jul 22, 2021 2:52 pm

Already checked, forgot to mention it.

After telling the UUID, path and uploading it, nothing is shown. Just the grid below but without any value set.

The text extractor seems to be: com.openkm.extractor.AbbyTextExtractor, as shown there.

Username

LauryMenton

Rank

Senior Boarder

Posts

Joined

Sun Apr 07, 2019 12:23 am

Re: OCR not extracting text from PDFs, but scanned correctly

#52661 by jllort
Sun Jul 25, 2021 8:37 am

* The test extractor used should be tesseract.
* Ensure system.ocr is right configured ( check tesseract from command line )
* Restart OpenKM and try again with Check text extractor in the administration

If it does not work, share some screenshots :
* system.ocr configuration screen
* administration check text extration screen
* the document are trying to process from administration > check text extraction

Username

jllort

Rank

Moderator

Posts

12053

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR not extracting text from PDFs, but scanned correctly

#52666 by LauryMenton
Mon Jul 26, 2021 3:21 pm

The OCR is now working properly.

I just needed to, as you suggested to me:
1. Configure 'tesseract' as the primary OCR tool. (system.ocr)
2. Install missing language pack for 'spanish', as I thought it was already installed/included during installation of OpenKM.

Solved. And thanks again!

Username

LauryMenton

Rank

Senior Boarder

Posts

Joined

Sun Apr 07, 2019 12:23 am

Re: [SOLVED] OCR not extracting text from PDFs, but scanned correctly

#52679 by jllort
Sat Jul 31, 2021 10:46 am

in case of tesseract 4 you use several dictionaries with parameter -l eng+spa ( the ocr engine try to identify the language of the document and apply one of the selected dictionaries )

Username

jllort

Rank

Moderator

Posts

12053

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Page 1 of 1
6 posts

Return to “Usage”

Display:

Sort by:

Jump to: