ImageEn Forum - Maintaining Correct Spacing in OCR for PDF Data

Profile Join Active Topics Forum FAQ

Forum membership is Free! Click Join to sign-up
Username:	Password:
Save Password
Forgot your Password?

All Forums

ImageEn Library for Delphi, C++ and .Net

ImageEn and IEvolution Support Forum

Maintaining Correct Spacing in OCR for PDF Data

New Topic

Reply to Topic

Author

Topic

Sidney Egnew

USA
59 Posts

Posted - Feb 13 2025 : 14:54:44

I need the OCR routine to give me the correct spacing for text in a PDF.

I receive encrypted PDF files containing data in NACHA format. NACHA format has fields in specific columns without delimiters. The PDF is using a fixed font and all data fields are in the correct positions in the PDF.

Code to Load the PDF


  v_ImageEnView1.PdfViewer.Enabled := True;
  v_ImageEnView1.IO.LoadFromFilePDF(v_FileName,-1,-1,c_Password);
  v_ImageEnView1.Fit;
  v_Text := v_ImageEnView1.PdfViewer.GetText(0,1000000);

The value in v_Text is correct except all spaces have been converted into one column. Since the data is fixed format, this makes it impossible to extract the individual fields. To overcome this, I identified the line spacing and positions of the columns for each different record format in the file. I was then able to extract the data from each rectangle. This has worked fine for a long time but failed when we began receiving multi-page PDFs in 2025.

I tried a number of approaches but have not been able to extract the data after page 1 of the PDF. I then decided to extract the each PDF page into separate PDF files. With the single page PDF files, I do not get any text with v_ImageEnView1.PdfViewer.GetText even though the PDF image is correct.

I then decided to save the single page PDF files as TIFF files and OCR the TIFF files instead of the PDF. I created the TIFF files and tested using the IEVision\demos\OCR project. The OCR demo loaded the file and recognized the text. But it too converted all spacing to single spaces. So it looks like my real problem is that IEVision OCR is converting multiple spaces to single spaces.

Is there a way to get the OCR to leave multiple spaces in place?

Thanks, Sidney

xequte

39053 Posts

Posted - Feb 13 2025 : 15:02:07

Hi Sidney

Firstly, we should consider again the easier (less computational) methods.

1. Can you give us the steps to reproduce the issue of being unable to extract text after page 1.

2. Did you consider iterating over all the text objects in the page to get their positions?

http://www.imageen.com/help/TIEPdfViewer.Objects.html

With regard to OCR, did you try the OCR with Layout demo?

Nigel
Xequte Software
www.imageen.com

Sidney Egnew

USA
59 Posts

Posted - Feb 13 2025 : 18:31:14

1. Can you give us the steps to reproduce the issue of being unable to extract text after page 1.

After creating the TImageEnView, I set the following:


  v_ImageView.Align := alNone;
  v_ImageView.Width := 612;
  v_ImageView.Height := 792;

To load the PDF:


  v_ImageEnView.PdfViewer.Enabled := True;
  v_ImageEnView.IO.LoadFromFilePDF(v_FileName,-1,-1,c_Password);
  v_ImageEnView.Fit;
  v_TextMemo.Text := v_ImageEnView.PdfViewer.GetText(0,1000000);

  // Initialize PDF data position
  v_LineNo := 0;
  v_Top := 21;
  v_Bottom := 27;

  while GetDataLine do
  begin
    if v_RecordType = '5' then
      ProcessBatchHeader
    else if v_RecordType = '6' then
      ProcessEntryDetail
    else if v_RecordType = '7' then
      ProcessAddendaRecord
    else if v_RecordType = '8' then
      ProcessBatchControl
    else
      LogMemo.Lines.Add('Error Processing Record: '+v_RecordType);
    Application.ProcessMessages;
  end;


  function GetDataLine: Boolean;
  begin
    v_LineNo := V_LineNo+1;
    v_Top := v_Top+9;
    v_Bottom := v_Bottom+9;
    v_RecordType := StringField(0,'RECORDTYPE');
    if v_RecordType = '5' then
      Result := True
    else if v_RecordType = '6' then
      Result := True
    else if v_RecordType = '7' then
      Result := True
    else if v_RecordType = '8' then
      Result := True
    else
      Result := False;
  end;


  function StringField (const p_RecordType: Integer;
                        const p_FieldName: String): String;
  begin
    // Get Field Location from Database
    dmMain.GetFieldLocation (p_RecordType,p_FieldName,
                             v_Left,v_Right,v_FieldSize,v_FieldType);
    v_Rect := Trect.Create(v_Left,v_Top,v_Right,v_Bottom);
    Result := Trim(v_ImageEnView.PdfViewer.GetText(v_Rect));
  end;

When the last data on the page is processed, GetDataLine returns false and the process ends. That is why the data on the other pages is not processed.

To fix this I would need to add code to see if there are other pages. Load the next page and repeat the process (assuming the offsets are the same on all pages).

Can you tell me how to do that?

But, it would be so much better if the OCR would simply return the correct number of spaces instead of a single space. Then I could just process the text without all the rectangles.

2. Did you consider iterating over all the text objects in the page to get their positions?

Yes, that is one of the many "number of approaches" I tried. I used demos\other\PDFPageObjects.

All it shows is Object[0] - Form.

xequte

39053 Posts

Posted - Feb 13 2025 : 21:48:05

Hi Sidney

To go to the next page, you can just use PageIndex:

// Navigate to the next page
ImageEnView1.PdfViewer.PageIndex := ImageEnView1.PdfViewer.PageIndex + 1;

http://www.imageen.com/help/TIEPdfViewer.PageIndex.html
http://www.imageen.com/help/TIEPdfViewer.PageCount.html

You did not attach an example PDF, but assuming it contains text then I don't think OCR is a great solution because it is unnecessarily computational, i.e.

1. Load a text document
2. Render text document as an image
3. Perform OCR to convert image back into text

That process is slow and has potential to introduce errors.

Did you see the example at:

http://www.imageen.com/help/TIEPdfViewer.Objects.html

Showing how to parse page objects?

Nigel
Xequte Software
www.imageen.com

Sidney Egnew

USA
59 Posts

Posted - Feb 14 2025 : 02:32:31

I modified my code to use PageCount and PageIndex and that allowed me to OCR all the data.

I thought that would allow me to implement a solution when I realized I can have the PDF processing code extract the lines of data instead of the fields. That would significantly reduce the complexity of the program and will be easy to implement.

Unfortunately, it did not work out. The reason I had to implement field level extraction is because PdfViewer.GetText(v_Rect) also converts multiple spaces to single spaces. This makes getting the entire line impossible.

Additionally, the starting point on the subsequent pages differs from the first page. I can work around that if there is a way to get PDFViewer to give me the location of the first character. How can I do that?

I would like you to address the OCR issue because computationally intensive is not an issue:

1) I receive one file a day.

1 page - 574 times, usually just a few lines on the page
2 pages - 2 times (2025)
3 pages - 1 time and third page only had 3 lines (2025)

2) It only takes a second or two to OCR a page.

I have been doing this since I implemented the process. I loaded a memo with the OCR text during testing and never took the code out.

v_TextMemo.Text := v_ImageEnView.PdfViewer.GetText(0,1000000);

Questions

Can you let me know if OCR can return all spaces instead of one space? This is assuming a fixed font, which is the case here.

If possible, how would I code it.

If not possible, could I use a language file for Courier that treats spaces as characters? I use a language file to extract MICR.

Summary

When I initially started this project, I thought the "Single Space" issue was a problem with the PDF. I now see the issue is with IEVision OCR. I have a solution that will work, but it would be easier to have IEVision OCR provide the correct number of spaces.

I still don't have a solution.

Sidney Egnew

USA
59 Posts

Posted - Feb 14 2025 : 22:20:40

I successfully determined the location for all lines of text on all pages using


 rects := myImageEnView.PdfViewer.GetTextRects( 0, -1 );

I will still have to pull each field individually but this will guarantee I get all the data.

xequte

39053 Posts

Posted - Feb 15 2025 : 16:15:19

Hi Sidney

Do you have any outstanding requirements for this?

Nigel
Xequte Software
www.imageen.com

Topic

New Topic

Reply to Topic