Page 1 of 1

PDF Search

PostPosted:Mon Oct 04, 2021 5:50 am
by farkinid2
Long time OpenKM user here. I've deployed OpenKM community in my own office for a while now but have never encountered this problem.

Using a VM, I've deployed a test OpenKM Community server as a test bed. Currently running OpenKM 6.3.11. The system is operating as intended except with regards to PDF files.

I've uploaded a couple of PDF files into the system but none of these PDF files have been successfully indexed. The files are a mix of scanned documents, print to pdf type documents and scanned documents which have been converted to fonts via tesseract (manually). At this point the search function works for all docx, xlsx, txt files. For PDF no text has been successfully extracted.

All files have already been processed in the text extractor (no files in queue). I've attached a screenshot of the list of words extracted for a test file as well as a sample pdf file.

On a side note, if we were to subscribe to OpenKm online but we have very large scanned pdf files to process, what sort of limitations would be we facing? For example some user's files are scanned documents totaling approximately 800mb per file. There are approximately 50,000 files of varying sizes

Re: PDF Search

PostPosted:Thu Oct 07, 2021 10:38 pm
by farkinid2
Just a quick note. I've managed to resolve the situation.

Went to Utilities -> Plugins -> Text Extractor -> Disabled cuneiform text extractor

Re: PDF Search

PostPosted:Sat Oct 09, 2021 6:32 pm
by jllort
Have updated this section in the documentation https://docs.openkm.com/kcenter/view/ok ... ngine.html because other users had the same issue ( take a look at the warning section at the top ).

Re: PDF Search

PostPosted:Sat Oct 16, 2021 12:24 pm
by saleem55
I have same problem
can not extract text from pdf

Re: PDF Search

PostPosted:Sat Oct 16, 2021 12:43 pm
by saleem55
resolved .
disable force pdf OCR
solution was in one of the resolved topic