Open Source Document Management System | OpenKM - Uploading images

Uploading images - extracting problem

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

12 posts

12 posts

Uploading images - extracting problem

#31259 by Aku
Thu Feb 12, 2015 1:06 pm

Hey,

can you test this 3 documents in your OpenKM?
jpg:

datei_bunt.jpg (456.57 KiB) Viewed 7113 times

png:

datei_bunt.png (2.34 MiB) Viewed 7113 times

tif:

datei_bunt.tif (327.88 KiB) Viewed 7113 times

After i uploaded them, i can't search for text.

I have installad tesseract "tesseract-ocr-setup-3.02.02.exe"
-> OK - C:\OpenKM-OCR\Tesseract-OCR\tesseract.exe (it's correct)
-> PDF, word, excel, txt is working correctly.
-> I can see the preview without errors.

I get following error with this files:

Code: Select all

2015-02-12 14:02:05,295 [Thread-122] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/Scans/datei_bunt.tif': Too few text extracted
2015-02-12 14:02:05,373 [Thread-122] INFO  com.openkm.extractor.TextExtractorWorker- processSerial.Working on {docUuid=cfb84da2-e46f-4461-b051-ef422b06170b, docPath=/okm:root/Scans/datei_bunt.png, docVerUuid=fe506f55-1fe0-4959-a7fe-d6cf1dfc88d4, date=Thu Feb 12 14:01:03 CET 2015}
2015-02-12 14:02:10,802 [Thread-122] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/Scans/datei_bunt.png': Too few text extracted
2015-02-12 14:02:10,849 [Thread-122] INFO  com.openkm.extractor.TextExtractorWorker- processSerial.Working on {docUuid=d42f77ce-62c6-43fe-96cc-f8462ebe2bd2, docPath=/okm:root/Scans/datei_bunt.jpg, docVerUuid=60f97e6f-4af6-42cf-9941-17a5c468b68c, date=Thu Feb 12 14:01:02 CET 2015}
2015-02-12 14:02:16,091 [Thread-122] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/Scans/datei_bunt.jpg': Too few text extracted

André

Username

Aku

Rank

Senior Boarder

Posts

Joined

Mon Jan 12, 2015 6:12 am

Re: Uploading images - extracting problem

#31284 by jllort
Sat Feb 14, 2015 9:14 am

I've tested in our online demo ( demo.openkm.com ) and it's going right ( you can take a look here http://demo.openkm.com/OpenKM/index.jsp ... 287f3484a2 ) do some search by content like felsmann.

write here your tesseract configuration parameters and also tell us your openkm version.

Username

jllort

Rank

Moderator

Posts

12185

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Uploading images - extracting problem

#31304 by Aku
Mon Feb 16, 2015 8:54 am

Hello,

i also tested it in your online environment. Following url's to this three files (If the documents aren't deleted

):
- http://demo.openkm.com/OpenKM/index.jsp ... 9f9d1ede4a (png successfully)
- http://demo.openkm.com/OpenKM/index.jsp ... 8f1983de26 (jpg successfully)
- http://demo.openkm.com/OpenKM/index.jsp ... 79bb37247d (tiff not successfully)

I've been waiting for 10 minutes because I do not know how long the interval from TextExtractor.
I found the following files with the search-string "felsmann": png & jpg
- tiff is missing in your online environment.

My tesseract configuration: C:\OpenKM-OCR\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}
OpenKM version: Version 6.3.0 (build: 8156) With Community Extension

André

Username

Aku

Rank

Senior Boarder

Posts

Joined

Mon Jan 12, 2015 6:12 am

Re: Uploading images - extracting problem

#31310 by Aku
Mon Feb 16, 2015 1:43 pm

Now the tif-file will be displayed in your demo environment after some hours.

Username

Aku

Rank

Senior Boarder

Posts

Joined

Mon Jan 12, 2015 6:12 am

Re: Uploading images - extracting problem

#31334 by jllort
Wed Feb 18, 2015 10:16 pm

how do you generate this tiff file

Username

jllort

Rank

Moderator

Posts

12185

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Uploading images - extracting problem

#31342 by Aku
Thu Feb 19, 2015 7:16 am

We generate these files by a scanner import (black / white)

There are two possibilities, how we import documents from a scanner.

First:
- We import them as pdf-files.

Second:
- We import them as tif-files.

The text is detected in the pdf-file, but not in the tif-file.

- I have tried different documents... (always the same problem)

And then i get this error:

Code: Select all

2015-02-19 10:58:15,378 [http-bio-0.0.0.0-8080-exec-8] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/unverteilt/2015-02-19T20150219081221.tif': Too few text extracted

André

Username

Aku

Rank

Senior Boarder

Posts

Joined

Mon Jan 12, 2015 6:12 am

Re: Uploading images - extracting problem

#31367 by jllort
Sat Feb 21, 2015 8:54 am

Can you upload here the tiff file and the tif file into pdf, I want to do some test with it.

Username

jllort

Rank

Moderator

Posts

12185

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Uploading images - extracting problem

#31394 by Aku
Mon Feb 23, 2015 7:42 am

Hello,

Here is the PDF file that I have created with our scanner:

file.pdf

(24.79 KiB) Downloaded 352 times

Here is the TIF file that I have created with our scanner:

file.tif (62.86 KiB) Viewed 7052 times

The pdf is working.
tif: "Too few text extracted"

André

Username

Aku

Rank

Senior Boarder

Posts

Joined

Mon Jan 12, 2015 6:12 am

Re: Uploading images - extracting problem

#31406 by Aku
Tue Feb 24, 2015 7:54 am

Now I have a problem with tiff files which have not been imported with the scanner.
- I think it's the same problem, like the tiff or other image-files.

Code: Select all

2015-02-24 08:46:03,224 [http-bio-0.0.0.0-8080-exec-2] WARN  com.openkm.extractor.PdfTextExtractor- PDF does not contains text layer

Does OpenKM only detect text, if files contain a textlayer?

André

Username

Aku

Rank

Senior Boarder

Posts

Joined

Mon Jan 12, 2015 6:12 am

Re: Uploading images - extracting problem

#31415 by Aku
Wed Feb 25, 2015 7:15 am

I have found the problem.

I had to use in the configuration instead of "com.openkm.extractor.CuneiformTextExtractor" this "com.openkm.extractor.Tesseract3TextExtractor"

Thanks.

Username

Aku

Rank

Senior Boarder

Posts

Joined

Mon Jan 12, 2015 6:12 am

Re: Uploading images - extracting problem

#31444 by jllort
Fri Feb 27, 2015 4:34 pm

OK, Thanks for the clue, I will try to keep in mind for other posts. On latest releases we've merged all into a wrapper and will not be necessary choose the class, automatically depending the system.ocr value choose automatically the correct class. I think it's still not available at community ( I'm not sure if version 6.3 comes with it ).

Username

jllort

Rank

Moderator

Posts

12185

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Uploading images - extracting problem

#31477 by Aku
Mon Mar 02, 2015 5:55 am

okey, thanks for your reply

Username

Aku

Rank

Senior Boarder

Posts

Joined

Mon Jan 12, 2015 6:12 am

Page 1 of 1
12 posts

Return to “Configuration”

Display:

Sort by:

Jump to: