How to get Tamil Language Legacy model for Tesseract OSD?

Captain_Odyssey · November 5, 2024, 7:07pm

Dear all,

I’m currently trying to use the python wrapper for Tesseract (pytesseract) to correct the rotation, in terms of multiple of 90 degrees, of images about Tamil newspapers. Specifically, I want to use pytesseract.image_to_osd(binary, config = ‘–oem 0 -l tam–psm 0’) to find the orientation OSD data of the individual images so as to correct them. I tried --oem 0, 1, 2, 3 and all of them did not work even after using the legacy engine. (the meanings of oem and psm are here: All Tesseract OCR options – Muthukrishnan)

Error for --oem 0 and 2:
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pytesseract\pytesseract.py", line 284, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, "Warning, detects only orientation with -l tam Error: Tesseract (legacy) engine requested, but components are not present in C:\\Program Files\\Tesseract-OCR\\tessdata/tam.traineddata!! Failed loading language 'tam' Tesseract couldn't load any languages! Could not initialize tesseract.")

Error for --oem 1 and 3:
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pytesseract\pytesseract.py", line 284, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Warning, detects only orientation with -l tam Error, OSD requires a model for the legacy engine')

Indeed, legacy engine for Tamil is needed for this task, and I used the tam.traineddata in this (GitHub - tesseract-ocr/tessdata: Trained models with fast variant of the "best" LSTM models + legacy models) legacy+LSTM repository. However, as you can see at the bottom of the page, it says “The legacy tesseract models (–oem 0) have been removed for Indic and Arabic script language files.”, which includes Tamil.

To confirm that this is indeed the problem, I also tried legacy fra and eng packs. They worked perfectly when I do👇
pytesseract.image_to_osd(binary, config = ‘–oem 0 -l fra --psm 0’)
pytesseract.image_to_osd(binary, config = ‘–oem 2 -l fra --psm 0’)
and
pytesseract.image_to_osd(binary, config = ‘–oem 0 -l eng --psm 0’)
pytesseract.image_to_osd(binary, config = ‘–oem 2 -l eng --psm 0’)

The output looks like this:
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 0.89
Script: Latin
Script confidence: 8.38

I guess the reason the legacy Tamil pack is removed is that the Tamil legacy engine worked poorly. Can you please provide me with the legacy model if you have one? If not, what other suggestions do you have for the problem I am trying to solve?

Thanks for reading this email in your busy schedule and have a great day!

Sincerely,
Siyou

tshrinivasan · November 8, 2024, 1:27pm

@Captain_Odyssey can you share the sample english and tamil images?

Hope the images should have metadata on orientation.

with proper EXIF metadata on the images, we can use exiftrans utility to rotate automatically.

Are all the images are in same direction?

share few sample images.

Captain_Odyssey · November 9, 2024, 9:03pm

No problem, the image is here. I used google drive to share it because this forum does not support TIF image upload. The reason I didn’t upload the English image is that this dataset rarely has English pages, and when it does, the image is upright.

Thank you for this alternative method. You should be an expert in media processing because people rarely know EXIF orientation data. I tried rotating these images by their EXIF orientation, but I found that all of them, irrespective of their orientation on screen, have 0 rotation in the EXIF data. Hence, I think the camera is in its correct orientation while taking the photo, and the rotation of the images is the result of people placing images in different directions on the table.

tshrinivasan · November 9, 2024, 9:52pm

yes @Captain_Odyssey as you said the EXIF data says as the orientation is normal. When multiple people do the scanning, they place the subject in all possible different direction.

If they follow a same pattern, we can rotate them all easily.
if each image has its own direction, then we have to rotate them manually.

gthumb can be used to preview, select multiple images and rotate them easily.

will explore for any other automated ways and share.
There may be some machine learning based solutions.

Captain_Odyssey · November 11, 2024, 5:48am

Thank you for that, the solution is very beautiful. It is definitely more straightforward than doing it in Windows file explorer.