Open Source Document Management System

PDF Search

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

5 posts

5 posts

PDF Search

#52900 by farkinid2
Mon Oct 04, 2021 5:50 am

Long time OpenKM user here. I've deployed OpenKM community in my own office for a while now but have never encountered this problem.

Using a VM, I've deployed a test OpenKM Community server as a test bed. Currently running OpenKM 6.3.11. The system is operating as intended except with regards to PDF files.

I've uploaded a couple of PDF files into the system but none of these PDF files have been successfully indexed. The files are a mix of scanned documents, print to pdf type documents and scanned documents which have been converted to fonts via tesseract (manually). At this point the search function works for all docx, xlsx, txt files. For PDF no text has been successfully extracted.

All files have already been processed in the text extractor (no files in queue). I've attached a screenshot of the list of words extracted for a test file as well as a sample pdf file.

On a side note, if we were to subscribe to OpenKm online but we have very large scanned pdf files to process, what sort of limitations would be we facing? For example some user's files are scanned documents totaling approximately 800mb per file. There are approximately 50,000 files of varying sizes

Attachments

pdf_test.pdf

Test pdf file
(43.05 KiB) Downloaded 150 times

No words extracted
pdf_no_text.JPG (53.07 KiB) Viewed 1626 times

Username

farkinid2

Rank

Fresh Boarder

Posts

Joined

Tue Jun 20, 2017 4:14 am

Re: PDF Search

#52912 by farkinid2
Thu Oct 07, 2021 10:38 pm

Just a quick note. I've managed to resolve the situation.

Went to Utilities -> Plugins -> Text Extractor -> Disabled cuneiform text extractor