ImageEn for Delphi and C++ Builder ImageEn for Delphi and C++ Builder

 

ImageEn Forum
Profile    Join    Active Topics    Forum FAQ    Search this forumSearch
Forum membership is Free!  Click Join to sign-up
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 ImageEn Library for Delphi, C++ and .Net
 ImageEn and IEvolution Support Forum
 ImageEnView LoadFileFromPDF GetText Trimming Spaces
 New Topic  Reply to Topic
Author Previous Topic Topic Next Topic  

Sidney Egnew

USA
55 Posts

Posted - Apr 21 2022 :  09:47:41  Show Profile  Reply
I am using a TImageEnView to extract text from PDF Files. The PDF text contains data in fixed field format. I need to extract the fields from that data. The text returned by GetText returns a single space anytime there are multiple spaces in the data. I need the text returned without trimming the spaces. Is this possible?

The text returned by copying from the PDF matches that returned by the GetText but does not match what is displayed.

PDF Content:
5200Doe, Jane 123122Blue 503
5200Smith, John 010120Purple 601

Returned Text
5200Doe, Jane 123122Blue 503
5200Smith, John 010120Purple 601

Code
v_ImageView: TImageEnView;
...
v_ImageView.PdfViewer.Enabled := True;
v_ImageView.IO.LoadFromFilePDF(p_FileName,-1,-1,p_Password);
Result := v_ImageView.PdfViewer.GetText(0,100000);

xequte

38613 Posts

Posted - Apr 21 2022 :  14:31:46  Show Profile  Reply
Hi Sidney

I'm afraid that is what is returned by PDFium. ImageEn is not touching the data in any way.

Nigel
Xequte Software
www.imageen.com
Go to Top of Page

Sidney Egnew

USA
55 Posts

Posted - Apr 21 2022 :  15:48:59  Show Profile  Reply
What if I print the image to a file and then OCR it. Is that possible?

Can I write the image to a non-password-protected PDF file?

Thanks
Go to Top of Page

xequte

38613 Posts

Posted - Apr 21 2022 :  16:26:02  Show Profile  Reply
Hi Sidney

You don't need to write the image to file, you can just apply OCR to the page bitmap (using IEVision).

That should preserve the layout, but it seems a long way to do it.



Nigel
Xequte Software
www.imageen.com
Go to Top of Page

Sidney Egnew

USA
55 Posts

Posted - Apr 21 2022 :  19:24:45  Show Profile  Reply
Nigel,

I set up an OCR program and attempted to access the text. It did not go well.

1) Tested using 1.tif from OCR demos.
The text was recognized.
ImageEnView1.IO.LoadFromFile(c_FileName);
ImageEnView1.Fit;

2) Tested using the password-protected PDF.
I got one line that is nothing like the PDF: "G mmewslli SIS"
ImageEnView1.PdfViewer.Enabled := True;
ImageEnView1.IO.LoadFromFilePDF(c_FileName,-1,-1,c_Password);
ImageEnView1.Fit;

3) I loaded the PDF into Acrobat and printed it to "Microsoft Print to PDF".
I got nothing when I loaded the new non-password PDF and did the OCR.
ImageEnView1.PdfViewer.Enabled := True;
ImageEnView1.IO.LoadFromFilePDF(c_FileName,-1,-1);
ImageEnView1.Fit;

4) I loaded a different non-password-protected PDF.
The text was recognized although not very well.
ImageEnView1.PdfViewer.Enabled := True;
ImageEnView1.IO.LoadFromFilePDF(c_FileName,-1,-1);
ImageEnView1.Fit;


This is the common OCR routine:
MainMemo.Lines.Clear;
IEVisionOCR := IEVisionLib.createOCR(IEOCRLanguageList[OCR_English_language].Code);
try
MainMemo.Text := IEVisionOCR.recognize(ImageEnView1.IEBitmap.GetIEVisionImage(), IEVisionRect(0, 0, 0, 0)).c_str();
except
on E: Exception do
ShowMessage(E.Message);
end;
Go to Top of Page

Sidney Egnew

USA
55 Posts

Posted - Apr 21 2022 :  21:13:53  Show Profile  Reply
OCR was not working very well so I decided to use PDFViewer.GetText(TRect). This works as I was able to arbitrarily set the ImageENView width the 8500 and height to 11000 and extract the last columns using a rectangle (6300,0,7220,11000). It appears the character width based on this is about 66 x 135.

This doesn't completely resolve the issue because white space is still trimmed to a single space but if I establish each individual field, I should be able to detect later fields violating the field space and just remove that data.

I thought I could use ImageEnView1.SelectedRectangle to determine the rectangles I need but the result is not what I expected. When I create a window over the first section of text I get X=66 Y=68 W=90 H=13. These numbers are incorrect.

Can I establish a rectangle over each column of data and get the rectangle values needed for GetText?

Go to Top of Page

xequte

38613 Posts

Posted - Apr 25 2022 :  21:30:18  Show Profile  Reply
Hi Sidney

SelectedRectangle is only for image selections. With PDFium you need to get the text rects from starting and ending text indexes (e.g. the selected text):

https://www.imageen.com/help/TIEPdfViewerInteraction.GetTextRects.html


Nigel
Xequte Software
www.imageen.com
Go to Top of Page
  Previous Topic Topic Next Topic  
 New Topic  Reply to Topic
Jump To: