Author |
Topic |
|
Sidney Egnew
USA
55 Posts |
Posted - Apr 21 2022 : 09:47:41
|
I am using a TImageEnView to extract text from PDF Files. The PDF text contains data in fixed field format. I need to extract the fields from that data. The text returned by GetText returns a single space anytime there are multiple spaces in the data. I need the text returned without trimming the spaces. Is this possible?
The text returned by copying from the PDF matches that returned by the GetText but does not match what is displayed.
PDF Content: 5200Doe, Jane 123122Blue 503 5200Smith, John 010120Purple 601
Returned Text 5200Doe, Jane 123122Blue 503 5200Smith, John 010120Purple 601
Code v_ImageView: TImageEnView; ... v_ImageView.PdfViewer.Enabled := True; v_ImageView.IO.LoadFromFilePDF(p_FileName,-1,-1,p_Password); Result := v_ImageView.PdfViewer.GetText(0,100000);
|
|
xequte
38613 Posts |
Posted - Apr 21 2022 : 14:31:46
|
Hi Sidney
I'm afraid that is what is returned by PDFium. ImageEn is not touching the data in any way.
Nigel Xequte Software www.imageen.com
|
|
|
Sidney Egnew
USA
55 Posts |
Posted - Apr 21 2022 : 15:48:59
|
What if I print the image to a file and then OCR it. Is that possible?
Can I write the image to a non-password-protected PDF file?
Thanks |
|
|
xequte
38613 Posts |
Posted - Apr 21 2022 : 16:26:02
|
Hi Sidney
You don't need to write the image to file, you can just apply OCR to the page bitmap (using IEVision).
That should preserve the layout, but it seems a long way to do it.
Nigel Xequte Software www.imageen.com
|
|
|
Sidney Egnew
USA
55 Posts |
Posted - Apr 21 2022 : 19:24:45
|
Nigel,
I set up an OCR program and attempted to access the text. It did not go well.
1) Tested using 1.tif from OCR demos. The text was recognized. ImageEnView1.IO.LoadFromFile(c_FileName); ImageEnView1.Fit;
2) Tested using the password-protected PDF. I got one line that is nothing like the PDF: "G mmewslli SIS" ImageEnView1.PdfViewer.Enabled := True; ImageEnView1.IO.LoadFromFilePDF(c_FileName,-1,-1,c_Password); ImageEnView1.Fit;
3) I loaded the PDF into Acrobat and printed it to "Microsoft Print to PDF". I got nothing when I loaded the new non-password PDF and did the OCR. ImageEnView1.PdfViewer.Enabled := True; ImageEnView1.IO.LoadFromFilePDF(c_FileName,-1,-1); ImageEnView1.Fit;
4) I loaded a different non-password-protected PDF. The text was recognized although not very well. ImageEnView1.PdfViewer.Enabled := True; ImageEnView1.IO.LoadFromFilePDF(c_FileName,-1,-1); ImageEnView1.Fit;
This is the common OCR routine: MainMemo.Lines.Clear; IEVisionOCR := IEVisionLib.createOCR(IEOCRLanguageList[OCR_English_language].Code); try MainMemo.Text := IEVisionOCR.recognize(ImageEnView1.IEBitmap.GetIEVisionImage(), IEVisionRect(0, 0, 0, 0)).c_str(); except on E: Exception do ShowMessage(E.Message); end;
|
|
|
Sidney Egnew
USA
55 Posts |
Posted - Apr 21 2022 : 21:13:53
|
OCR was not working very well so I decided to use PDFViewer.GetText(TRect). This works as I was able to arbitrarily set the ImageENView width the 8500 and height to 11000 and extract the last columns using a rectangle (6300,0,7220,11000). It appears the character width based on this is about 66 x 135.
This doesn't completely resolve the issue because white space is still trimmed to a single space but if I establish each individual field, I should be able to detect later fields violating the field space and just remove that data.
I thought I could use ImageEnView1.SelectedRectangle to determine the rectangles I need but the result is not what I expected. When I create a window over the first section of text I get X=66 Y=68 W=90 H=13. These numbers are incorrect.
Can I establish a rectangle over each column of data and get the rectangle values needed for GetText?
|
|
|
xequte
38613 Posts |
|
|
Topic |
|