Open Source Document Management System | OpenKM - PDF Indexing not working (inserts spaces between characters)

PDF Indexing not working (inserts spaces between characters)

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

4 posts

4 posts

PDF Indexing not working (inserts spaces between characters)

#46982 by henrik
Sun Oct 28, 2018 9:37 pm

Hello,
I'm trying to implement openkm as a dms for scanned documents. I scan my documents with a Xerox workstation, which already does OCR and creates PDF with a text overlay.
If I open one of those PDFs I can copy text into a text editor and it is well formatted and according to the OCR capabilities of this Xerox device (I attached a sample PDF from which I can copy the following text into an editor)

Code: Select all

 The quick brown fox jumps over the lazy dog

Unfortunately if I upload this PDF into openkm and let the indexer run, there spaces between every character:

Code: Select all

mysql> select * from OKM_NODE_DOCUMENT;
+-----------------+-----------------+-----------------+----------------+--------------+---------------------+-------------+-----------+-----------+------------+-----------------+------------+------------------------------------------------------------------------+--------------------+-----------+--------------------------------------+
| NDC_CHECKED_OUT | NDC_CIPHER_NAME | NDC_DESCRIPTION | NDC_ENCRYPTION | NDC_LANGUAGE | NDC_LAST_MODIFIED   | NLK_CREATED | NLK_OWNER | NLK_TOKEN | NDC_LOCKED | NDC_MIME_TYPE   | NDC_SIGNED | NDC_TEXT                                                               | NDC_TEXT_EXTRACTED | NDC_TITLE | NBS_UUID                             |
+-----------------+-----------------+-----------------+----------------+--------------+---------------------+-------------+-----------+-----------+------------+-----------------+------------+------------------------------------------------------------------------+--------------------+-----------+--------------------------------------+
| F               | NULL            | NULL            | F              | cs           | 2018-10-28 21:25:14 | NULL        | NULL      | NULL      | F          | application/pdf | F          | T h e q u i c k b r o w n f o x j u m p s o v e r t h e l a z y d o g
 | T                  |           | 9a459065-e493-4d95-8641-e1d84ed97dbb |
+-----------------+-----------------+-----------------+----------------+--------------+---------------------+-------------+-----------+-----------+------------+-----------------+------------+------------------------------------------------------------------------+--------------------+-----------+--------------------------------------+

Therefore the content is not really searchable.

Can somebody point me a direction whats going wrong here?

Cheers,
Henrik

Attachments

Xerox Scan_28102018232245.PDF

(7.02 KiB) Downloaded 341 times

Username

henrik

Rank

Fresh Boarder

Posts

Joined

Sun Oct 28, 2018 9:28 pm

Re: PDF Indexing not working (inserts spaces between characters)

#46992 by jllort
Wed Oct 31, 2018 4:13 pm

The reason why it happens is in the text extractor used. In the first screenshot from OpenKM community it is used the class named PdfTextExtractor

support_479.png (25.58 KiB) Viewed 3553 times

The second screenshot is from professional edition what comes with more text extractor option and there I used PdfLayerTextExtractor:

support_479.png (25.58 KiB) Viewed 3553 times

Take a look at community source code, there should be added new text extractor for directly get text from layer or modify the existing PdfTextExtractor to check if the layer is available and do not process the document with ocr engine ( what I think it is what is happening now )
https://github.com/openkm/document-mana ... /extractor

Attachments

support_479.png (25.58 KiB) Viewed 3553 times

Username

jllort

Rank

Moderator

Posts

12129

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: PDF Indexing not working (inserts spaces between characters)

#46994 by henrik
Wed Oct 31, 2018 9:45 pm

Thanks for pointing me into the absolute right direction.
I derusted my ancient Java knowledge, had a look at Apache Pdfbox and could totally recreate the issue. I also figured out, that if you set the spacingtolerance to -1 the issue disappears for the Xerox PDFs unfortunately with -1 it is present for all "regular" PDFs with text inside, so I wrote a small snipplet to evaluate the spacingtolerance and extract the correctly spaced text for every PDF (that I tested...)

Code: Select all

	
	private static String pdfextract(String filepath){
		try {
			FileInputStream s1 = new FileInputStream(filepath);
			PDFParser parser = new PDFParser(s1);
			parser.parse();
			PDDocument document = parser.getPDDocument();
	        PDFTextStripper stripper = new PDFTextStripper();
	        stripper.setSortByPosition(true);
			stripper.setLineSeparator("\n");
			String testspacing=stripper.getText(document);
			int spacecounter = 0, charcounter = 0;
			for( int i=0; i<testspacing.length(); i++ ) {
				if(testspacing.charAt(i) == ' ' ) {
					spacecounter++;
				} else {
					charcounter++;
				}
			}
			System.out.println("Spaces: " + spacecounter + " Characters:" + charcounter);
			if (spacecounter*1.2 > charcounter ) {
				stripper.setSpacingTolerance(-1);
				stripper.getText(document);
			}
			CharArrayWriter writer = new CharArrayWriter();
			stripper.writeText(document, writer);
			String st = writer.toString().trim();
			return("TextStripped: "+ st);
			
		} catch(IOException e) {
			e.printStackTrace();
			return "";
		}		
	}

This code checks if the amount of spaces +20% is higher as the character count (so the spacing problem is present) and then sets the spacingtolerance to -1.
I tested it on 100 random PDFs from mixed origins and a couple documents from the mentioned Xerox scanner and it works quite nice.
Unfortunately I'm neither a software developer nor a java specialist so this code is messy and the idea of character counting is probably far away from performance especially on big PDFs.
It would be great if this could be a prototype for the developers how to fix this issue and/or if somebody could tell me, how to patch this into my running instance (I could life with this solution

).

Cheers,
Henrik

Username

henrik

Rank

Fresh Boarder

Posts

Joined

Sun Oct 28, 2018 9:28 pm

Re: PDF Indexing not working (inserts spaces between characters)

#47002 by jllort
Thu Nov 01, 2018 10:54 am

Go to our download section https://www.openkm.com/en/download.html -> we have prepared development environment based in a virtual machine, You have a video where we explain how installing it. From there I think we be easy to apply changes, really the environment has been thought as a quick starting point for playing with the code.

Username

jllort

Rank

Moderator

Posts

12129

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Page 1 of 1
4 posts

Return to “Configuration”

Display:

Sort by:

Jump to: