ImageEn Forum

Profile Join Active Topics Forum FAQ

All Forums

ImageEn Library for Delphi, C++ and .Net

ImageEn and IEvolution Support Forum

ImageEnView LoadFileFromPDF GetText Trimming Spaces

Note: You must be registered in order to post a reply.
To register, click here. Registration is FREE!

View

UserName:

Password:

Format

Message

Emoji

Check here to include your profile signature.
Check here to subscribe to this topic.

T O P I C R E V I E W
Sidney Egnew	Posted - Apr 21 2022 : 09:47:41 I am using a TImageEnView to extract text from PDF Files. The PDF text contains data in fixed field format. I need to extract the fields from that data. The text returned by GetText returns a single space anytime there are multiple spaces in the data. I need the text returned without trimming the spaces. Is this possible? The text returned by copying from the PDF matches that returned by the GetText but does not match what is displayed. PDF Content: 5200Doe, Jane 123122Blue 503 5200Smith, John 010120Purple 601 Returned Text 5200Doe, Jane 123122Blue 503 5200Smith, John 010120Purple 601 Code v_ImageView: TImageEnView; ... v_ImageView.PdfViewer.Enabled := True; v_ImageView.IO.LoadFromFilePDF(p_FileName,-1,-1,p_Password); Result := v_ImageView.PdfViewer.GetText(0,100000);
6 L A T E S T R E P L I E S (Newest First)
xequte	Posted - Apr 25 2022 : 21:30:18 Hi Sidney SelectedRectangle is only for image selections. With PDFium you need to get the text rects from starting and ending text indexes (e.g. the selected text): https://www.imageen.com/help/TIEPdfViewerInteraction.GetTextRects.html Nigel Xequte Software www.imageen.com
Sidney Egnew	Posted - Apr 21 2022 : 21:13:53 OCR was not working very well so I decided to use PDFViewer.GetText(TRect). This works as I was able to arbitrarily set the ImageENView width the 8500 and height to 11000 and extract the last columns using a rectangle (6300,0,7220,11000). It appears the character width based on this is about 66 x 135. This doesn't completely resolve the issue because white space is still trimmed to a single space but if I establish each individual field, I should be able to detect later fields violating the field space and just remove that data. I thought I could use ImageEnView1.SelectedRectangle to determine the rectangles I need but the result is not what I expected. When I create a window over the first section of text I get X=66 Y=68 W=90 H=13. These numbers are incorrect. Can I establish a rectangle over each column of data and get the rectangle values needed for GetText?
Sidney Egnew	Posted - Apr 21 2022 : 19:24:45 Nigel, I set up an OCR program and attempted to access the text. It did not go well. 1) Tested using 1.tif from OCR demos. The text was recognized. ImageEnView1.IO.LoadFromFile(c_FileName); ImageEnView1.Fit; 2) Tested using the password-protected PDF. I got one line that is nothing like the PDF: "G mmewslli SIS" ImageEnView1.PdfViewer.Enabled := True; ImageEnView1.IO.LoadFromFilePDF(c_FileName,-1,-1,c_Password); ImageEnView1.Fit; 3) I loaded the PDF into Acrobat and printed it to "Microsoft Print to PDF". I got nothing when I loaded the new non-password PDF and did the OCR. ImageEnView1.PdfViewer.Enabled := True; ImageEnView1.IO.LoadFromFilePDF(c_FileName,-1,-1); ImageEnView1.Fit; 4) I loaded a different non-password-protected PDF. The text was recognized although not very well. ImageEnView1.PdfViewer.Enabled := True; ImageEnView1.IO.LoadFromFilePDF(c_FileName,-1,-1); ImageEnView1.Fit; This is the common OCR routine: MainMemo.Lines.Clear; IEVisionOCR := IEVisionLib.createOCR(IEOCRLanguageList[OCR_English_language].Code); try MainMemo.Text := IEVisionOCR.recognize(ImageEnView1.IEBitmap.GetIEVisionImage(), IEVisionRect(0, 0, 0, 0)).c_str(); except on E: Exception do ShowMessage(E.Message); end;
xequte	Posted - Apr 21 2022 : 16:26:02 Hi Sidney You don't need to write the image to file, you can just apply OCR to the page bitmap (using IEVision). That should preserve the layout, but it seems a long way to do it. Nigel Xequte Software www.imageen.com
Sidney Egnew	Posted - Apr 21 2022 : 15:48:59 What if I print the image to a file and then OCR it. Is that possible? Can I write the image to a non-password-protected PDF file? Thanks
xequte	Posted - Apr 21 2022 : 14:31:46 Hi Sidney I'm afraid that is what is returned by PDFium. ImageEn is not touching the data in any way. Nigel Xequte Software www.imageen.com