ImageEn for Delphi and C++ Builder ImageEn for Delphi and C++ Builder

 

ImageEn Forum
Profile    Join    Active Topics    Forum FAQ    Search this forumSearch
 All Forums
 ImageEn Library for Delphi, C++ and .Net
 ImageEn and IEvolution Support Forum
 ImageEnView LoadFileFromPDF GetText Trimming Spaces

Note: You must be registered in order to post a reply.
To register, click here. Registration is FREE!

View 
UserName:
Password:
Format  Bold Italicized Underline  Align Left Centered Align Right  Horizontal Rule  Insert Hyperlink   Browse for an image to attach to your post Browse for a zip to attach to your post Insert Code  Insert Quote Insert List
   
Message 

 

Emoji
Smile [:)] Big Smile [:D] Cool [8D] Blush [:I]
Tongue [:P] Evil [):] Wink [;)] Black Eye [B)]
Frown [:(] Shocked [:0] Angry [:(!] Sleepy [|)]
Kisses [:X] Approve [^] Disapprove [V] Question [?]

 
Check here to subscribe to this topic.
   

T O P I C    R E V I E W
Sidney Egnew Posted - Apr 21 2022 : 09:47:41
I am using a TImageEnView to extract text from PDF Files. The PDF text contains data in fixed field format. I need to extract the fields from that data. The text returned by GetText returns a single space anytime there are multiple spaces in the data. I need the text returned without trimming the spaces. Is this possible?

The text returned by copying from the PDF matches that returned by the GetText but does not match what is displayed.

PDF Content:
5200Doe, Jane 123122Blue 503
5200Smith, John 010120Purple 601

Returned Text
5200Doe, Jane 123122Blue 503
5200Smith, John 010120Purple 601

Code
v_ImageView: TImageEnView;
...
v_ImageView.PdfViewer.Enabled := True;
v_ImageView.IO.LoadFromFilePDF(p_FileName,-1,-1,p_Password);
Result := v_ImageView.PdfViewer.GetText(0,100000);
6   L A T E S T    R E P L I E S    (Newest First)
xequte Posted - Apr 25 2022 : 21:30:18
Hi Sidney

SelectedRectangle is only for image selections. With PDFium you need to get the text rects from starting and ending text indexes (e.g. the selected text):

https://www.imageen.com/help/TIEPdfViewerInteraction.GetTextRects.html


Nigel
Xequte Software
www.imageen.com
Sidney Egnew Posted - Apr 21 2022 : 21:13:53
OCR was not working very well so I decided to use PDFViewer.GetText(TRect). This works as I was able to arbitrarily set the ImageENView width the 8500 and height to 11000 and extract the last columns using a rectangle (6300,0,7220,11000). It appears the character width based on this is about 66 x 135.

This doesn't completely resolve the issue because white space is still trimmed to a single space but if I establish each individual field, I should be able to detect later fields violating the field space and just remove that data.

I thought I could use ImageEnView1.SelectedRectangle to determine the rectangles I need but the result is not what I expected. When I create a window over the first section of text I get X=66 Y=68 W=90 H=13. These numbers are incorrect.

Can I establish a rectangle over each column of data and get the rectangle values needed for GetText?

Sidney Egnew Posted - Apr 21 2022 : 19:24:45
Nigel,

I set up an OCR program and attempted to access the text. It did not go well.

1) Tested using 1.tif from OCR demos.
The text was recognized.
ImageEnView1.IO.LoadFromFile(c_FileName);
ImageEnView1.Fit;

2) Tested using the password-protected PDF.
I got one line that is nothing like the PDF: "G mmewslli SIS"
ImageEnView1.PdfViewer.Enabled := True;
ImageEnView1.IO.LoadFromFilePDF(c_FileName,-1,-1,c_Password);
ImageEnView1.Fit;

3) I loaded the PDF into Acrobat and printed it to "Microsoft Print to PDF".
I got nothing when I loaded the new non-password PDF and did the OCR.
ImageEnView1.PdfViewer.Enabled := True;
ImageEnView1.IO.LoadFromFilePDF(c_FileName,-1,-1);
ImageEnView1.Fit;

4) I loaded a different non-password-protected PDF.
The text was recognized although not very well.
ImageEnView1.PdfViewer.Enabled := True;
ImageEnView1.IO.LoadFromFilePDF(c_FileName,-1,-1);
ImageEnView1.Fit;


This is the common OCR routine:
MainMemo.Lines.Clear;
IEVisionOCR := IEVisionLib.createOCR(IEOCRLanguageList[OCR_English_language].Code);
try
MainMemo.Text := IEVisionOCR.recognize(ImageEnView1.IEBitmap.GetIEVisionImage(), IEVisionRect(0, 0, 0, 0)).c_str();
except
on E: Exception do
ShowMessage(E.Message);
end;
xequte Posted - Apr 21 2022 : 16:26:02
Hi Sidney

You don't need to write the image to file, you can just apply OCR to the page bitmap (using IEVision).

That should preserve the layout, but it seems a long way to do it.



Nigel
Xequte Software
www.imageen.com
Sidney Egnew Posted - Apr 21 2022 : 15:48:59
What if I print the image to a file and then OCR it. Is that possible?

Can I write the image to a non-password-protected PDF file?

Thanks
xequte Posted - Apr 21 2022 : 14:31:46
Hi Sidney

I'm afraid that is what is returned by PDFium. ImageEn is not touching the data in any way.

Nigel
Xequte Software
www.imageen.com