ImageEn for Delphi and C++ Builder ImageEn for Delphi and C++ Builder

 

ImageEn Forum
Profile    Join    Active Topics    Forum FAQ    Search this forumSearch
Forum membership is Free!  Click Join to sign-up
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 ImageEn Library for Delphi, C++ and .Net
 ImageEn and IEvolution Support Forum
 Extracting all text incl. text positions from PDF file within Delphi console app
 New Topic  Reply to Topic
Author Previous Topic Topic Next Topic  

fsterk92

Germany
5 Posts

Posted - May 26 2022 :  12:21:11  Show Profile  Reply
Hello,

I want to extract all text content including all text positions from a PDF file within a Delphi console app. My idea is to use the PDFViewer component in the ImageEnView class (without the display part).

I have written the following (working) test code (using Delphi XE2 and ImageEn 11.0.0) and I have some questions (below):

program pdfTextExtractor;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.SysUtils, hyiedefs, hyieutils, iexBitmaps, Math,
  iesettings, iexLayers, iexRulers, iexToolbars, iexUserInteractions, imageenio,
  imageenproc, ieview, imageenview, Vcl.StdCtrls, iexPdfiumCore;
var
  imageEnView: TImageEnView;
  i, j: Integer;
  text: String;
  textRect, charBox: TDRect;
begin
  try

    // init imageen view
    IEGlobalSettings().RegisterPlugIns([ iepiPDFium ]);
    IEGlobalSettings().PdfViewerDefaults.DPI := 300;
    imageEnView := TImageEnView.Create(nil);
    with imageEnView.PdfViewer do
    begin

      // init pdf viewer and load pdf file
      Enabled := True;
      LoadFromFile('c:\temp\test.pdf');

      // step through all document pages
      for i := 0 to Document.PageCount - 1 do
      begin

        // variant 1 : step through all text rects of current page and output position and content
        for j := 0 to Page[i].GetTextRectCount(1, -1) - 1 do  // (1, -1) = all text rects
        begin
          textRect := Page[i].GetTextRect(j);
          text := Page[i].GetTextAt(textRect);
          WriteLn(Format('page %d, left %d, top %d, right %d, bottom %d: %s',
            [i + 1, Floor(textRect.Left), Floor(textRect.Top), Floor(textRect.Bottom), Floor(textRect.Right), text]));
        end;

        // variant 2 : step through all char boxes of current page and output position and content
        for j := 0 to Page[i].GetCharCount - 1 do
        begin
          charBox := Page[i].GetCharBox(j);
          text := Page[i].GetTextAt(charBox);
          WriteLn(Format('page %d, left %d, top %d, right %d, bottom %d: %s',
            [i + 1, Floor(charBox.Left), Floor(charBox.Top), Floor(charBox.Bottom), Floor(charBox.Right), text]));
        end;

      end;

    end;

  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
end.

The test program is working and outputs all text elements including the text positions (variant 1) and also outputs all char elements including the char positions (variant 2).

Now I have some questions:

1. Is it 'allowed' to use the TImageEnView component within a console app without any display output? If so, are there any known side effects that have to be considered? Or is there a better way? I have also tried to use the PDFViewer / TIEPDFViewerInteraction class without TImageEnView (pdfViewer := TIEPdfViewerInteraction.Create(nil)) but this causes an access violation.

2. Is there a better way to step through the text or char elements? Currently I am stepping through the positions and then use the method GetTextAt to access the text or char elements. This might not be the best way regarding performance.

3. There seems to be no resolution (dpi) property within PDFViewer. From the comments in the ImageEn / PDFium source code it looks like the PDF format is based on a coordinate system (user space) originated bottom left with a fixed 'resolution' of 72 points per inch. Is it right, that all text positions returned by ImageEnView.PDFViewer are based on that 72 dpi resolution? Or are there any further parameters that have to be considered if I want to transform the positions e.g. into millimeters?

Many thanks in advance for any help or further information.

Regards

Florian

xequte

38608 Posts

Posted - May 26 2022 :  14:13:45  Show Profile  Reply
Hi Florian

1. Yes, there are many use cases for TImageEnView as a non-visual component and that is something we test for. However in the case of non-visual use of the PDFViewer, that has not been extensively tested and is relatively new, so you may encounter some issues. If so, let us know with the steps to reproduce.

1b. You should not use TIEPDFViewerInteraction directly, they are designed to have a TImageEnView as a parent. That said, I don't like that you are seeing an A/V rather than a more friendly messages, so please give me the steps to reproduce that.

1c. You can use TPdfDocument directly, but it is not documented

2. You can use the GetText* methods to step through the text rects

3. PDF documents are built on "PDF Points" which are 72 dpi. In ImageEnView we scale this using IEGlobalSettings().PdfViewerDefaults.DPI

https://www.imageen.com/help/TIEGlobalSettings.PdfViewerDefaults.html



Nigel
Xequte Software
www.imageen.com
Go to Top of Page

fsterk92

Germany
5 Posts

Posted - May 27 2022 :  02:33:37  Show Profile  Reply
Hi Nigel,

thank you for your quick answer!

1. That sounds good, so I will use TImageEnView as a non-visual component. If I encouter any problems with the non-visual use of the PDFViewer, I will let you know. I also have tried to reproduce the access violation when using the TIEPDFViewerInteraction class directly, but this time it worked fine, so I think that was an error on my side.

2. Ok, that's fine for me.

3. Ok, I have set IEGlobalSettings().PdfViewerDefaults.DPI = 300 dpi and expected the returned positions to use this resolution. But as I have 'bypassed' ImageEnView by directly accessing the GetCharBox and GetTextRect methods of the PDFViewer / PDFium librarys, the results are returned based on 72 points per inch. As this doesn't change, I can work with this.

Many thanks!

Florian
Go to Top of Page

xequte

38608 Posts

Posted - May 27 2022 :  22:22:38  Show Profile  Reply
Hi Florian

3. Yes, as you are using the PDFPage methods directly, they are not affected by ImageEn properties.

Nigel
Xequte Software
www.imageen.com
Go to Top of Page

xequte

38608 Posts

Posted - May 29 2022 :  16:41:40  Show Profile  Reply
Here is an example to access all text rects of the page (in terms of current DPI)

// Output all text rects in current page
var
  rects: TIERectArray;
  text: string;
begin
  rects := ImageEnView1.PdfViewer.GetTextRects( 0, MAXINT );
  for i := Low( rects ) to High( rects ) do
  begin
    text := ImageEnView1.PdfViewer.GetText( rects[i] );
    memo1.Lines.Add( Format( '%d (%d, %d, %d, %d): %s', [ i + 1, rects[i].Left, rects[i].Top, rects[i].Right, rects[i].Bottom, text ]));
  end;


Nigel
Xequte Software
www.imageen.com
Go to Top of Page

xequte

38608 Posts

Posted - Jul 21 2022 :  03:11:02  Show Profile  Reply
Also, from 11.0.2, you can just output PDF pages as text and formatted text.

Nigel
Xequte Software
www.imageen.com
Go to Top of Page
  Previous Topic Topic Next Topic  
 New Topic  Reply to Topic
Jump To: