Page 1 of 1

Uploading images - extracting problem

PostPosted:Thu Feb 12, 2015 1:06 pm
by Aku
Hey,

can you test this 3 documents in your OpenKM?
jpg:
datei_bunt.jpg
datei_bunt.jpg (456.57 KiB) Viewed 6207 times
png:
datei_bunt.png
datei_bunt.png (2.34 MiB) Viewed 6207 times
tif:
datei_bunt.tif
datei_bunt.tif (327.88 KiB) Viewed 6207 times
After i uploaded them, i can't search for text.

I have installad tesseract "tesseract-ocr-setup-3.02.02.exe"
-> OK - C:\OpenKM-OCR\Tesseract-OCR\tesseract.exe (it's correct)
-> PDF, word, excel, txt is working correctly.
-> I can see the preview without errors.


I get following error with this files:
Code: Select all
2015-02-12 14:02:05,295 [Thread-122] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/Scans/datei_bunt.tif': Too few text extracted
2015-02-12 14:02:05,373 [Thread-122] INFO  com.openkm.extractor.TextExtractorWorker- processSerial.Working on {docUuid=cfb84da2-e46f-4461-b051-ef422b06170b, docPath=/okm:root/Scans/datei_bunt.png, docVerUuid=fe506f55-1fe0-4959-a7fe-d6cf1dfc88d4, date=Thu Feb 12 14:01:03 CET 2015}
2015-02-12 14:02:10,802 [Thread-122] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/Scans/datei_bunt.png': Too few text extracted
2015-02-12 14:02:10,849 [Thread-122] INFO  com.openkm.extractor.TextExtractorWorker- processSerial.Working on {docUuid=d42f77ce-62c6-43fe-96cc-f8462ebe2bd2, docPath=/okm:root/Scans/datei_bunt.jpg, docVerUuid=60f97e6f-4af6-42cf-9941-17a5c468b68c, date=Thu Feb 12 14:01:02 CET 2015}
2015-02-12 14:02:16,091 [Thread-122] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/Scans/datei_bunt.jpg': Too few text extracted

André

Re: Uploading images - extracting problem

PostPosted:Sat Feb 14, 2015 9:14 am
by jllort
I've tested in our online demo ( demo.openkm.com ) and it's going right ( you can take a look here http://demo.openkm.com/OpenKM/index.jsp ... 287f3484a2 ) do some search by content like felsmann.

write here your tesseract configuration parameters and also tell us your openkm version.

Re: Uploading images - extracting problem

PostPosted:Mon Feb 16, 2015 8:54 am
by Aku
Hello,

i also tested it in your online environment. Following url's to this three files (If the documents aren't deleted :D ):
- http://demo.openkm.com/OpenKM/index.jsp ... 9f9d1ede4a (png successfully)
- http://demo.openkm.com/OpenKM/index.jsp ... 8f1983de26 (jpg successfully)
- http://demo.openkm.com/OpenKM/index.jsp ... 79bb37247d (tiff not successfully)


I've been waiting for 10 minutes because I do not know how long the interval from TextExtractor.
I found the following files with the search-string "felsmann": png & jpg
- tiff is missing in your online environment.


My tesseract configuration: C:\OpenKM-OCR\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}
OpenKM version: Version 6.3.0 (build: 8156) With Community Extension


André :)

Re: Uploading images - extracting problem

PostPosted:Mon Feb 16, 2015 1:43 pm
by Aku
Now the tif-file will be displayed in your demo environment after some hours. :D

Re: Uploading images - extracting problem

PostPosted:Wed Feb 18, 2015 10:16 pm
by jllort
how do you generate this tiff file

Re: Uploading images - extracting problem

PostPosted:Thu Feb 19, 2015 7:16 am
by Aku
We generate these files by a scanner import (black / white)


There are two possibilities, how we import documents from a scanner.

First:
- We import them as pdf-files.

Second:
- We import them as tif-files.


The text is detected in the pdf-file, but not in the tif-file. :(
- I have tried different documents... (always the same problem)

And then i get this error:
Code: Select all
2015-02-19 10:58:15,378 [http-bio-0.0.0.0-8080-exec-8] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/unverteilt/2015-02-19T20150219081221.tif': Too few text extracted
André

Re: Uploading images - extracting problem

PostPosted:Sat Feb 21, 2015 8:54 am
by jllort
Can you upload here the tiff file and the tif file into pdf, I want to do some test with it.

Re: Uploading images - extracting problem

PostPosted:Mon Feb 23, 2015 7:42 am
by Aku
Hello,

Here is the PDF file that I have created with our scanner:
(24.79 KiB) Downloaded 332 times
Here is the TIF file that I have created with our scanner:
file.tif
file.tif (62.86 KiB) Viewed 6146 times

The pdf is working.
tif: "Too few text extracted"

André

Re: Uploading images - extracting problem

PostPosted:Tue Feb 24, 2015 7:54 am
by Aku
Now I have a problem with tiff files which have not been imported with the scanner.
- I think it's the same problem, like the tiff or other image-files.
Code: Select all
2015-02-24 08:46:03,224 [http-bio-0.0.0.0-8080-exec-2] WARN  com.openkm.extractor.PdfTextExtractor- PDF does not contains text layer
Does OpenKM only detect text, if files contain a textlayer?

André

Re: Uploading images - extracting problem

PostPosted:Wed Feb 25, 2015 7:15 am
by Aku
I have found the problem.

I had to use in the configuration instead of "com.openkm.extractor.CuneiformTextExtractor" this "com.openkm.extractor.Tesseract3TextExtractor"


Thanks. :)

Re: Uploading images - extracting problem

PostPosted:Fri Feb 27, 2015 4:34 pm
by jllort
OK, Thanks for the clue, I will try to keep in mind for other posts. On latest releases we've merged all into a wrapper and will not be necessary choose the class, automatically depending the system.ocr value choose automatically the correct class. I think it's still not available at community ( I'm not sure if version 6.3 comes with it ).

Re: Uploading images - extracting problem

PostPosted:Mon Mar 02, 2015 5:55 am
by Aku
okey, thanks for your reply :)