T O P I C R E V I E W |
fsterk92 |
Posted - May 26 2022 : 12:21:11 Hello,
I want to extract all text content including all text positions from a PDF file within a Delphi console app. My idea is to use the PDFViewer component in the ImageEnView class (without the display part).
I have written the following (working) test code (using Delphi XE2 and ImageEn 11.0.0) and I have some questions (below):
program pdfTextExtractor;
{$APPTYPE CONSOLE}
{$R *.res}
uses
System.SysUtils, hyiedefs, hyieutils, iexBitmaps, Math,
iesettings, iexLayers, iexRulers, iexToolbars, iexUserInteractions, imageenio,
imageenproc, ieview, imageenview, Vcl.StdCtrls, iexPdfiumCore;
var
imageEnView: TImageEnView;
i, j: Integer;
text: String;
textRect, charBox: TDRect;
begin
try
// init imageen view
IEGlobalSettings().RegisterPlugIns([ iepiPDFium ]);
IEGlobalSettings().PdfViewerDefaults.DPI := 300;
imageEnView := TImageEnView.Create(nil);
with imageEnView.PdfViewer do
begin
// init pdf viewer and load pdf file
Enabled := True;
LoadFromFile('c:\temp\test.pdf');
// step through all document pages
for i := 0 to Document.PageCount - 1 do
begin
// variant 1 : step through all text rects of current page and output position and content
for j := 0 to Page[i].GetTextRectCount(1, -1) - 1 do // (1, -1) = all text rects
begin
textRect := Page[i].GetTextRect(j);
text := Page[i].GetTextAt(textRect);
WriteLn(Format('page %d, left %d, top %d, right %d, bottom %d: %s',
[i + 1, Floor(textRect.Left), Floor(textRect.Top), Floor(textRect.Bottom), Floor(textRect.Right), text]));
end;
// variant 2 : step through all char boxes of current page and output position and content
for j := 0 to Page[i].GetCharCount - 1 do
begin
charBox := Page[i].GetCharBox(j);
text := Page[i].GetTextAt(charBox);
WriteLn(Format('page %d, left %d, top %d, right %d, bottom %d: %s',
[i + 1, Floor(charBox.Left), Floor(charBox.Top), Floor(charBox.Bottom), Floor(charBox.Right), text]));
end;
end;
end;
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
end.
The test program is working and outputs all text elements including the text positions (variant 1) and also outputs all char elements including the char positions (variant 2).
Now I have some questions:
1. Is it 'allowed' to use the TImageEnView component within a console app without any display output? If so, are there any known side effects that have to be considered? Or is there a better way? I have also tried to use the PDFViewer / TIEPDFViewerInteraction class without TImageEnView (pdfViewer := TIEPdfViewerInteraction.Create(nil)) but this causes an access violation.
2. Is there a better way to step through the text or char elements? Currently I am stepping through the positions and then use the method GetTextAt to access the text or char elements. This might not be the best way regarding performance.
3. There seems to be no resolution (dpi) property within PDFViewer. From the comments in the ImageEn / PDFium source code it looks like the PDF format is based on a coordinate system (user space) originated bottom left with a fixed 'resolution' of 72 points per inch. Is it right, that all text positions returned by ImageEnView.PDFViewer are based on that 72 dpi resolution? Or are there any further parameters that have to be considered if I want to transform the positions e.g. into millimeters?
Many thanks in advance for any help or further information.
Regards
Florian
|
5 L A T E S T R E P L I E S (Newest First) |
xequte |
Posted - Jul 21 2022 : 03:11:02 Also, from 11.0.2, you can just output PDF pages as text and formatted text.
Nigel Xequte Software www.imageen.com
|
xequte |
Posted - May 29 2022 : 16:41:40 Here is an example to access all text rects of the page (in terms of current DPI)
// Output all text rects in current page
var
rects: TIERectArray;
text: string;
begin
rects := ImageEnView1.PdfViewer.GetTextRects( 0, MAXINT );
for i := Low( rects ) to High( rects ) do
begin
text := ImageEnView1.PdfViewer.GetText( rects[i] );
memo1.Lines.Add( Format( '%d (%d, %d, %d, %d): %s', [ i + 1, rects[i].Left, rects[i].Top, rects[i].Right, rects[i].Bottom, text ]));
end;
Nigel Xequte Software www.imageen.com
|
xequte |
Posted - May 27 2022 : 22:22:38 Hi Florian
3. Yes, as you are using the PDFPage methods directly, they are not affected by ImageEn properties.
Nigel Xequte Software www.imageen.com
|
fsterk92 |
Posted - May 27 2022 : 02:33:37 Hi Nigel,
thank you for your quick answer!
1. That sounds good, so I will use TImageEnView as a non-visual component. If I encouter any problems with the non-visual use of the PDFViewer, I will let you know. I also have tried to reproduce the access violation when using the TIEPDFViewerInteraction class directly, but this time it worked fine, so I think that was an error on my side.
2. Ok, that's fine for me.
3. Ok, I have set IEGlobalSettings().PdfViewerDefaults.DPI = 300 dpi and expected the returned positions to use this resolution. But as I have 'bypassed' ImageEnView by directly accessing the GetCharBox and GetTextRect methods of the PDFViewer / PDFium librarys, the results are returned based on 72 points per inch. As this doesn't change, I can work with this.
Many thanks!
Florian
|
xequte |
Posted - May 26 2022 : 14:13:45 Hi Florian
1. Yes, there are many use cases for TImageEnView as a non-visual component and that is something we test for. However in the case of non-visual use of the PDFViewer, that has not been extensively tested and is relatively new, so you may encounter some issues. If so, let us know with the steps to reproduce.
1b. You should not use TIEPDFViewerInteraction directly, they are designed to have a TImageEnView as a parent. That said, I don't like that you are seeing an A/V rather than a more friendly messages, so please give me the steps to reproduce that.
1c. You can use TPdfDocument directly, but it is not documented
2. You can use the GetText* methods to step through the text rects
3. PDF documents are built on "PDF Points" which are 72 dpi. In ImageEnView we scale this using IEGlobalSettings().PdfViewerDefaults.DPI
https://www.imageen.com/help/TIEGlobalSettings.PdfViewerDefaults.html
Nigel Xequte Software www.imageen.com
|
|
|