Dear all,
I’m currently trying to use the python wrapper for Tesseract (pytesseract) to correct the rotation, in terms of multiple of 90 degrees, of images about Tamil newspapers. Specifically, I want to use pytesseract.image_to_osd(binary, config = ‘–oem 0 -l tam–psm 0’) to find the orientation OSD data of the individual images so as to correct them. I tried --oem 0, 1, 2, 3 and all of them did not work even after using the legacy engine. (the meanings of oem and psm are here: All Tesseract OCR options – Muthukrishnan)
Error for --oem 0 and 2:
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pytesseract\pytesseract.py", line 284, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, "Warning, detects only orientation with -l tam Error: Tesseract (legacy) engine requested, but components are not present in C:\\Program Files\\Tesseract-OCR\\tessdata/tam.traineddata!! Failed loading language 'tam' Tesseract couldn't load any languages! Could not initialize tesseract.")
Error for --oem 1 and 3:
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pytesseract\pytesseract.py", line 284, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Warning, detects only orientation with -l tam Error, OSD requires a model for the legacy engine')
Indeed, legacy engine for Tamil is needed for this task, and I used the tam.traineddata in this (GitHub - tesseract-ocr/tessdata: Trained models with fast variant of the "best" LSTM models + legacy models) legacy+LSTM repository. However, as you can see at the bottom of the page, it says “The legacy tesseract models (–oem 0) have been removed for Indic and Arabic script language files.”, which includes Tamil.
To confirm that this is indeed the problem, I also tried legacy fra and eng packs. They worked perfectly when I do👇
pytesseract.image_to_osd(binary, config = ‘–oem 0 -l fra --psm 0’)
pytesseract.image_to_osd(binary, config = ‘–oem 2 -l fra --psm 0’)
and
pytesseract.image_to_osd(binary, config = ‘–oem 0 -l eng --psm 0’)
pytesseract.image_to_osd(binary, config = ‘–oem 2 -l eng --psm 0’)
The output looks like this:
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 0.89
Script: Latin
Script confidence: 8.38
I guess the reason the legacy Tamil pack is removed is that the Tamil legacy engine worked poorly. Can you please provide me with the legacy model if you have one? If not, what other suggestions do you have for the problem I am trying to solve?
Thanks for reading this email in your busy schedule and have a great day!
Sincerely,
Siyou