• PDF Indexing not working (inserts spaces between characters)

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #46982  by henrik
 
Hello,
I'm trying to implement openkm as a dms for scanned documents. I scan my documents with a Xerox workstation, which already does OCR and creates PDF with a text overlay.
If I open one of those PDFs I can copy text into a text editor and it is well formatted and according to the OCR capabilities of this Xerox device (I attached a sample PDF from which I can copy the following text into an editor)
Code: Select all
 The quick brown fox jumps over the lazy dog
Unfortunately if I upload this PDF into openkm and let the indexer run, there spaces between every character:
Code: Select all
mysql> select * from OKM_NODE_DOCUMENT;
+-----------------+-----------------+-----------------+----------------+--------------+---------------------+-------------+-----------+-----------+------------+-----------------+------------+------------------------------------------------------------------------+--------------------+-----------+--------------------------------------+
| NDC_CHECKED_OUT | NDC_CIPHER_NAME | NDC_DESCRIPTION | NDC_ENCRYPTION | NDC_LANGUAGE | NDC_LAST_MODIFIED   | NLK_CREATED | NLK_OWNER | NLK_TOKEN | NDC_LOCKED | NDC_MIME_TYPE   | NDC_SIGNED | NDC_TEXT                                                               | NDC_TEXT_EXTRACTED | NDC_TITLE | NBS_UUID                             |
+-----------------+-----------------+-----------------+----------------+--------------+---------------------+-------------+-----------+-----------+------------+-----------------+------------+------------------------------------------------------------------------+--------------------+-----------+--------------------------------------+
| F               | NULL            | NULL            | F              | cs           | 2018-10-28 21:25:14 | NULL        | NULL      | NULL      | F          | application/pdf | F          | T h e q u i c k b r o w n f o x j u m p s o v e r t h e l a z y d o g
 | T                  |           | 9a459065-e493-4d95-8641-e1d84ed97dbb |
+-----------------+-----------------+-----------------+----------------+--------------+---------------------+-------------+-----------+-----------+------------+-----------------+------------+------------------------------------------------------------------------+--------------------+-----------+--------------------------------------+
Therefore the content is not really searchable.

Can somebody point me a direction whats going wrong here?

Cheers,
Henrik
Attachments
(7.02 KiB) Downloaded 142 times
 #46992  by jllort
 
The reason why it happens is in the text extractor used. In the first screenshot from OpenKM community it is used the class named PdfTextExtractor
support_479.png
support_479.png (25.58 KiB) Viewed 2768 times
The second screenshot is from professional edition what comes with more text extractor option and there I used PdfLayerTextExtractor:
support_479.png
support_479.png (25.58 KiB) Viewed 2768 times

Take a look at community source code, there should be added new text extractor for directly get text from layer or modify the existing PdfTextExtractor to check if the layer is available and do not process the document with ocr engine ( what I think it is what is happening now )
https://github.com/openkm/document-mana ... /extractor
Attachments
support_479.png
support_479.png (25.58 KiB) Viewed 2768 times
 #46994  by henrik
 
Thanks for pointing me into the absolute right direction.
I derusted my ancient Java knowledge, had a look at Apache Pdfbox and could totally recreate the issue. I also figured out, that if you set the spacingtolerance to -1 the issue disappears for the Xerox PDFs unfortunately with -1 it is present for all "regular" PDFs with text inside, so I wrote a small snipplet to evaluate the spacingtolerance and extract the correctly spaced text for every PDF (that I tested...)
Code: Select all
	
	private static String pdfextract(String filepath){
		try {
			FileInputStream s1 = new FileInputStream(filepath);
			PDFParser parser = new PDFParser(s1);
			parser.parse();
			PDDocument document = parser.getPDDocument();
	        PDFTextStripper stripper = new PDFTextStripper();
	        stripper.setSortByPosition(true);
			stripper.setLineSeparator("\n");
			String testspacing=stripper.getText(document);
			int spacecounter = 0, charcounter = 0;
			for( int i=0; i<testspacing.length(); i++ ) {
				if(testspacing.charAt(i) == ' ' ) {
					spacecounter++;
				} else {
					charcounter++;
				}
			}
			System.out.println("Spaces: " + spacecounter + " Characters:" + charcounter);
			if (spacecounter*1.2 > charcounter ) {
				stripper.setSpacingTolerance(-1);
				stripper.getText(document);
			}
			CharArrayWriter writer = new CharArrayWriter();
			stripper.writeText(document, writer);
			String st = writer.toString().trim();
			return("TextStripped: "+ st);
			
		} catch(IOException e) {
			e.printStackTrace();
			return "";
		}		
	}
This code checks if the amount of spaces +20% is higher as the character count (so the spacing problem is present) and then sets the spacingtolerance to -1.
I tested it on 100 random PDFs from mixed origins and a couple documents from the mentioned Xerox scanner and it works quite nice.
Unfortunately I'm neither a software developer nor a java specialist so this code is messy and the idea of character counting is probably far away from performance especially on big PDFs.
It would be great if this could be a prototype for the developers how to fix this issue and/or if somebody could tell me, how to patch this into my running instance (I could life with this solution ;) ).

Cheers,
Henrik
 #47002  by jllort
 
Go to our download section https://www.openkm.com/en/download.html -> we have prepared development environment based in a virtual machine, You have a video where we explain how installing it. From there I think we be easy to apply changes, really the environment has been thought as a quick starting point for playing with the code.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.