Hello,
I'm trying to implement openkm as a dms for scanned documents. I scan my documents with a Xerox workstation, which already does OCR and creates PDF with a text overlay.
If I open one of those PDFs I can copy text into a text editor and it is well formatted and according to the OCR capabilities of this Xerox device (I attached a sample PDF from which I can copy the following text into an editor)
Can somebody point me a direction whats going wrong here?
Cheers,
Henrik
I'm trying to implement openkm as a dms for scanned documents. I scan my documents with a Xerox workstation, which already does OCR and creates PDF with a text overlay.
If I open one of those PDFs I can copy text into a text editor and it is well formatted and according to the OCR capabilities of this Xerox device (I attached a sample PDF from which I can copy the following text into an editor)
Code: Select all
Unfortunately if I upload this PDF into openkm and let the indexer run, there spaces between every character:
The quick brown fox jumps over the lazy dog
Code: Select all
Therefore the content is not really searchable.mysql> select * from OKM_NODE_DOCUMENT;
+-----------------+-----------------+-----------------+----------------+--------------+---------------------+-------------+-----------+-----------+------------+-----------------+------------+------------------------------------------------------------------------+--------------------+-----------+--------------------------------------+
| NDC_CHECKED_OUT | NDC_CIPHER_NAME | NDC_DESCRIPTION | NDC_ENCRYPTION | NDC_LANGUAGE | NDC_LAST_MODIFIED | NLK_CREATED | NLK_OWNER | NLK_TOKEN | NDC_LOCKED | NDC_MIME_TYPE | NDC_SIGNED | NDC_TEXT | NDC_TEXT_EXTRACTED | NDC_TITLE | NBS_UUID |
+-----------------+-----------------+-----------------+----------------+--------------+---------------------+-------------+-----------+-----------+------------+-----------------+------------+------------------------------------------------------------------------+--------------------+-----------+--------------------------------------+
| F | NULL | NULL | F | cs | 2018-10-28 21:25:14 | NULL | NULL | NULL | F | application/pdf | F | T h e q u i c k b r o w n f o x j u m p s o v e r t h e l a z y d o g
| T | | 9a459065-e493-4d95-8641-e1d84ed97dbb |
+-----------------+-----------------+-----------------+----------------+--------------+---------------------+-------------+-----------+-----------+------------+-----------------+------------+------------------------------------------------------------------------+--------------------+-----------+--------------------------------------+
Can somebody point me a direction whats going wrong here?
Cheers,
Henrik
Attachments
(7.02 KiB) Downloaded 144 times