• Uploading images - extracting problem

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #31259  by Aku
 
Hey,

can you test this 3 documents in your OpenKM?
jpg:
datei_bunt.jpg
datei_bunt.jpg (456.57 KiB) Viewed 6195 times
png:
datei_bunt.png
datei_bunt.png (2.34 MiB) Viewed 6195 times
tif:
datei_bunt.tif
datei_bunt.tif (327.88 KiB) Viewed 6195 times
After i uploaded them, i can't search for text.

I have installad tesseract "tesseract-ocr-setup-3.02.02.exe"
-> OK - C:\OpenKM-OCR\Tesseract-OCR\tesseract.exe (it's correct)
-> PDF, word, excel, txt is working correctly.
-> I can see the preview without errors.


I get following error with this files:
Code: Select all
2015-02-12 14:02:05,295 [Thread-122] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/Scans/datei_bunt.tif': Too few text extracted
2015-02-12 14:02:05,373 [Thread-122] INFO  com.openkm.extractor.TextExtractorWorker- processSerial.Working on {docUuid=cfb84da2-e46f-4461-b051-ef422b06170b, docPath=/okm:root/Scans/datei_bunt.png, docVerUuid=fe506f55-1fe0-4959-a7fe-d6cf1dfc88d4, date=Thu Feb 12 14:01:03 CET 2015}
2015-02-12 14:02:10,802 [Thread-122] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/Scans/datei_bunt.png': Too few text extracted
2015-02-12 14:02:10,849 [Thread-122] INFO  com.openkm.extractor.TextExtractorWorker- processSerial.Working on {docUuid=d42f77ce-62c6-43fe-96cc-f8462ebe2bd2, docPath=/okm:root/Scans/datei_bunt.jpg, docVerUuid=60f97e6f-4af6-42cf-9941-17a5c468b68c, date=Thu Feb 12 14:01:02 CET 2015}
2015-02-12 14:02:16,091 [Thread-122] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/Scans/datei_bunt.jpg': Too few text extracted

André
 #31284  by jllort
 
I've tested in our online demo ( demo.openkm.com ) and it's going right ( you can take a look here http://demo.openkm.com/OpenKM/index.jsp ... 287f3484a2 ) do some search by content like felsmann.

write here your tesseract configuration parameters and also tell us your openkm version.
 #31304  by Aku
 
Hello,

i also tested it in your online environment. Following url's to this three files (If the documents aren't deleted :D ):
- http://demo.openkm.com/OpenKM/index.jsp ... 9f9d1ede4a (png successfully)
- http://demo.openkm.com/OpenKM/index.jsp ... 8f1983de26 (jpg successfully)
- http://demo.openkm.com/OpenKM/index.jsp ... 79bb37247d (tiff not successfully)


I've been waiting for 10 minutes because I do not know how long the interval from TextExtractor.
I found the following files with the search-string "felsmann": png & jpg
- tiff is missing in your online environment.


My tesseract configuration: C:\OpenKM-OCR\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}
OpenKM version: Version 6.3.0 (build: 8156) With Community Extension


André :)
 #31342  by Aku
 
We generate these files by a scanner import (black / white)


There are two possibilities, how we import documents from a scanner.

First:
- We import them as pdf-files.

Second:
- We import them as tif-files.


The text is detected in the pdf-file, but not in the tif-file. :(
- I have tried different documents... (always the same problem)

And then i get this error:
Code: Select all
2015-02-19 10:58:15,378 [http-bio-0.0.0.0-8080-exec-8] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/unverteilt/2015-02-19T20150219081221.tif': Too few text extracted
André
 #31367  by jllort
 
Can you upload here the tiff file and the tif file into pdf, I want to do some test with it.
 #31394  by Aku
 
Hello,

Here is the PDF file that I have created with our scanner:
(24.79 KiB) Downloaded 332 times
Here is the TIF file that I have created with our scanner:
file.tif
file.tif (62.86 KiB) Viewed 6134 times

The pdf is working.
tif: "Too few text extracted"

André
 #31406  by Aku
 
Now I have a problem with tiff files which have not been imported with the scanner.
- I think it's the same problem, like the tiff or other image-files.
Code: Select all
2015-02-24 08:46:03,224 [http-bio-0.0.0.0-8080-exec-2] WARN  com.openkm.extractor.PdfTextExtractor- PDF does not contains text layer
Does OpenKM only detect text, if files contain a textlayer?

André
 #31415  by Aku
 
I have found the problem.

I had to use in the configuration instead of "com.openkm.extractor.CuneiformTextExtractor" this "com.openkm.extractor.Tesseract3TextExtractor"


Thanks. :)
 #31444  by jllort
 
OK, Thanks for the clue, I will try to keep in mind for other posts. On latest releases we've merged all into a wrapper and will not be necessary choose the class, automatically depending the system.ocr value choose automatically the correct class. I think it's still not available at community ( I'm not sure if version 6.3 comes with it ).

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.