ImageEn Forum - ImageEnView LoadFileFromPDF GetText Trimming Spaces

Profile Join Active Topics Forum FAQ

Forum membership is Free! Click Join to sign-up
Username:	Password:
Save Password
Forgot your Password?

All Forums

ImageEn Library for Delphi, C++ and .Net

ImageEn and IEvolution Support Forum

ImageEnView LoadFileFromPDF GetText Trimming Spaces

New Topic

Reply to Topic

Author

Topic

Sidney Egnew

USA
59 Posts

Posted - Apr 21 2022 : 09:47:41

I am using a TImageEnView to extract text from PDF Files. The PDF text contains data in fixed field format. I need to extract the fields from that data. The text returned by GetText returns a single space anytime there are multiple spaces in the data. I need the text returned without trimming the spaces. Is this possible?

The text returned by copying from the PDF matches that returned by the GetText but does not match what is displayed.

PDF Content:
5200Doe, Jane 123122Blue 503
5200Smith, John 010120Purple 601

Returned Text
5200Doe, Jane 123122Blue 503
5200Smith, John 010120Purple 601

Code
v_ImageView: TImageEnView;
...
v_ImageView.PdfViewer.Enabled := True;
v_ImageView.IO.LoadFromFilePDF(p_FileName,-1,-1,p_Password);
Result := v_ImageView.PdfViewer.GetText(0,100000);

xequte

39076 Posts

Posted - Apr 21 2022 : 14:31:46

Hi Sidney

I'm afraid that is what is returned by PDFium. ImageEn is not touching the data in any way.

Nigel
Xequte Software
www.imageen.com

Sidney Egnew

USA
59 Posts

Posted - Apr 21 2022 : 15:48:59

What if I print the image to a file and then OCR it. Is that possible?

Can I write the image to a non-password-protected PDF file?

Thanks

xequte

39076 Posts

Posted - Apr 21 2022 : 16:26:02

Hi Sidney

You don't need to write the image to file, you can just apply OCR to the page bitmap (using IEVision).

That should preserve the layout, but it seems a long way to do it.

Nigel
Xequte Software
www.imageen.com

Sidney Egnew

USA
59 Posts

Posted - Apr 21 2022 : 19:24:45

Nigel,

I set up an OCR program and attempted to access the text. It did not go well.

1) Tested using 1.tif from OCR demos.
The text was recognized.
ImageEnView1.IO.LoadFromFile(c_FileName);
ImageEnView1.Fit;

2) Tested using the password-protected PDF.
I got one line that is nothing like the PDF: "G mmewslli SIS"
ImageEnView1.PdfViewer.Enabled := True;
ImageEnView1.IO.LoadFromFilePDF(c_FileName,-1,-1,c_Password);
ImageEnView1.Fit;

3) I loaded the PDF into Acrobat and printed it to "Microsoft Print to PDF".
I got nothing when I loaded the new non-password PDF and did the OCR.
ImageEnView1.PdfViewer.Enabled := True;
ImageEnView1.IO.LoadFromFilePDF(c_FileName,-1,-1);
ImageEnView1.Fit;

4) I loaded a different non-password-protected PDF.
The text was recognized although not very well.
ImageEnView1.PdfViewer.Enabled := True;
ImageEnView1.IO.LoadFromFilePDF(c_FileName,-1,-1);
ImageEnView1.Fit;

This is the common OCR routine:
MainMemo.Lines.Clear;
IEVisionOCR := IEVisionLib.createOCR(IEOCRLanguageList[OCR_English_language].Code);
try
MainMemo.Text := IEVisionOCR.recognize(ImageEnView1.IEBitmap.GetIEVisionImage(), IEVisionRect(0, 0, 0, 0)).c_str();
except
on E: Exception do
ShowMessage(E.Message);
end;

Sidney Egnew

USA
59 Posts

Posted - Apr 21 2022 : 21:13:53

OCR was not working very well so I decided to use PDFViewer.GetText(TRect). This works as I was able to arbitrarily set the ImageENView width the 8500 and height to 11000 and extract the last columns using a rectangle (6300,0,7220,11000). It appears the character width based on this is about 66 x 135.

This doesn't completely resolve the issue because white space is still trimmed to a single space but if I establish each individual field, I should be able to detect later fields violating the field space and just remove that data.

I thought I could use ImageEnView1.SelectedRectangle to determine the rectangles I need but the result is not what I expected. When I create a window over the first section of text I get X=66 Y=68 W=90 H=13. These numbers are incorrect.

Can I establish a rectangle over each column of data and get the rectangle values needed for GetText?

xequte

39076 Posts

Posted - Apr 25 2022 : 21:30:18

Hi Sidney

SelectedRectangle is only for image selections. With PDFium you need to get the text rects from starting and ending text indexes (e.g. the selected text):

https://www.imageen.com/help/TIEPdfViewerInteraction.GetTextRects.html

Nigel
Xequte Software
www.imageen.com

Topic

New Topic

Reply to Topic