Page 1 of 1

Searches show no results

PostPosted:Fri Jul 21, 2017 4:49 pm
by xtrailrunner
I have installed openKM 6.3.4 on top of Linux Mint 17.3.
As the next step I uploaded several documents into different folders with keywords, categories and notes (as user admin).
But I cannot see a preview of the documents and searches show no results.
I added German lexer to the OpenKM.cfg file and rebuild indexes but still no results when searching.
Am I missing something or should I use a user other than admin to upload the documents ?
Any advice is welcome.
Regards Juergen

Re: Searches show no results

PostPosted:Sat Jul 22, 2017 10:00 am
by jllort
If you are trying to search by contents, take in mind documents go into queue ( Administration / stats / pending extraction queue ) and are processed each 5 minutes ( cron tab task named "text extraction" takes control on it ).

Focus the attention in a single document, with this procedure can check text extraction process ( https://docs.openkm.com/kcenter/view/ok ... ction.html -> copy document uuid, paste there, and execute. With it you can check text extraction process and the extracted text ).

With database query you can check if a document is yet extracted or not ( https://docs.openkm.com/kcenter/view/ok ... query.html ), the query for it is:
Code: Select all
select * from OKM_NODE_DOCUMENT WHERE NBS_UUID ="HERE THE UUID OF YOUR DOCUMENT";
The field NDC_TEXT_EXTRACTED = 'T' or 'F' indicate if the document has been processed or not ( true / false )

About preview, please do not merge several topics in the same post, add a new topic for it, thanks.

Re: Searches show no results

PostPosted:Sun May 06, 2018 2:43 pm
by xtrailrunner
Thanks a lot!
So far I have uploaded scanned documents in PDF format but obviously the PDF contains only images per page (which file type I don't know).

After adding Tesseract OCR engine to my configuration I could extract key words but still cannot find all documents using a specific keyword. So I used the function "check extraction" for a document I could not find. Because of the bad quality of the text extraction (German) the keyword was not identified in the text.
What should I do:
- replace the engine by another one
- add an additonal engine.
Will quality of extraction improve if I scan my documents to PNG or JPG ?
A disadvantage would be that a multi-page document would end up as multiple files to upload.
What recommendations you would give in such a situation ?
Regards Juergen

Re: Searches show no results

PostPosted:Mon May 07, 2018 11:36 am
by jllort
I suggest making some test.
1- Scan document with 300-600 as png ( then check the results ) also tiff multipage
2- When you succeed in the previous step you have two options -> try scanning directly to PDF or first to images and then to pdf

You can also check our scanner tool for it, take a look in our OpenKM download section https://www.openkm.com/en/download.html

When you discover the right format you can go ahead with all. About changing OCR engine, tesseract is the only what really works in open source world, with other OCR engine you should add some cost. We had make some test in the past with ocr4linux ( https://www.ocr4linux.com/en:start ) what goes really right, but where you have a restriction of the number of pages that can be processed ( it depends on your license ).

Re: Searches show no results

PostPosted:Mon May 07, 2018 3:31 pm
by xtrailrunner
Thanks. OpenKM scanner seems to be available only for Windows. Because I'm using openKM on Linux I'm not sure if I could integrate it running the sanner in a virtual machine with Windows.
Regards Juergen

Re: Searches show no results

PostPosted:Tue May 08, 2018 3:41 pm
by jllort
You can get OpenKM Working into VM although is not the best scenario. Configuring there the scanner is another history :)

Some clarification is not so important the tool as discover what is the right configuration for scanning, that's why I suggested making some testing with several image formats etc... to isolate what is the best configuration for you if exists one. You can begin from the top 600 ppp with png format. If the OCR engine does not work with this configuration, forget the open source and you mandatory must go ahead for commercial one.