Tesseract 4.0 version which supports 10+ Indic languages has been released and the results are awesome. There are many possibilities. I thought I must record these for posterity!! Read on:
- Tesseract may be embedded into the scanning machines. Since Google has developed Tesseract and released it as a Free Software (under Apache 2.0 licence), I hope either Google to come out with new age scanners with OCR for Indic languages, or any other company like even HP to come out with Tesseract enabled scanners. The result is direct text files bypassing PDF or JPG file formation, which are the digitisation norms now.
- Tesseract may be developed as an APP and any image could be text-read. I expect this to be a cloud service also for now, as Tesseract needs to learn a lot.
- Indic Language Spellchecker, hitherto a non-starter will get commercial value from now onwards. Spellchecker becomes an important interface between the OCR and the manual proof reading. If language experts / proof readers are PC users, the spellcheckers may even be the last point in proof reading.
- Tesseract may be used in Google goggles, and it will start reading bulletin boards, menu, traffic signboards and translate / transliterate the text, read it in the same language or in a preferred language.
- Bharatavani project’s dictionary may play a pivotal role, if the Governments are interested in developing content in all subjects in Indian languages. This effort, combined with Tesseract OCR has many possibilities in machine learning, content translation etc. In fact, all the dictionary databases can form a mega database and be part of the whole exercise.
- It is true that Google will benefit out of this commercially. But this cannot and should not be the reason for not supporting Tesseract development. Technological inventions cannot be stopped; we have to adopt and grow.
- Tesseract will greatly reduce the dependency on the DTP workforce for digitisation tasks. The DTP people have to upgrade themselves to data cleaning and proofing to some extent, if not language experts.
- Once, say after a decade, text digitisation is over w.r.t. existing image based content, Tesseract OCR would be exclusively applied in signage reading.
- Nations like India suffer from paradoxical situations. We still have new generation illiterates, who can operate smartphones. For them, Google OCR goggle will be helpful. People can read anything including print newspapers (if at all these will exist even then) through lenses and even listen. Text to Speech and Speech to Text tools are already available.
- Already Google has developed well functioning handwriting recognition tool and has embedded in its Gmail Inbox. The same is available in Android smartphone too. Hence, handwriting recognition tools will read everything you write / already written.
- Text to speech in some tools is upto a satisfactory level, even while IISc is desperately seeking funds to reinvent the wheel! I will come out with more details later, but I am confident that there are tools which are good enough to t est and use.
TESSERACT INSTALLATION AND USAGE GUIDELINES
- Download Tesseract for Windows from the following link: https://github.com/UB-Mannheim/tesseract/wiki
- Here go down to the paragraph which says
======================================================
The latest installers can be downloaded here: tesseract-ocr-setup-3.05.02-20180621.exe, tesseract-ocr-w32-setup-v4.0.0.20181030.exe (32 bit) and tesseract-ocr-w64-setup-v4.0.0.20181030.exe (64 bit). There are also older versions available.
======================================================
- Select your version (32 bit or 64 bit)
- Download and install the software. While installing it asks for downloading languages libraries. Choose all Indian languages.
- Now visit the Youtube page: https://www.youtube.com/watch?v=rSKYTefQv5g and do the Windows Environment variable changes as per the video.
- Open Windows PowerShell and check if it is working fine. The Youtube refers to the Help file. But you can check with a simple command line tesseract.exe
- This will complete the installation process.
USAGE
- I have used the following help pages to create my own BAT file to execute. https://digitalaladore.wordpress.com/2014/11/17/using-tesseract-via-command-line/
- https://stackoverflow.com/questions/31680193/how-to-tesseract-multiple-files-in-the-same-folder-from-command-prompt
- The first one refers multipages, but not multi languages. I have added this string in the following BAT file script:
Creating a BAT FILE (No line space, though this looks it has line spaces)
==========================================================
@Echo off
Set _SourcePath=C:\Users\sudarshana\Desktop\TEST3BVP\*.tif
Set _OutputPath=C:\Users\sudarshana\Desktop\TEST3BVP\
Set _Tesseract=”C:\Program Files (x86)\Tesseract-OCR\tesseract.exe”
For %%A in (%_SourcePath%) Do Echo Converting %%A…&%_Tesseract% %%A %_OutputPath%%%~nA -l kan+eng
Set “_SourcePath=”
Set “_OutputPath=”
Set “_Tesseract=”
===========================================================
- See the red coloured PATH. Replace the related path with your file path. To do this, open the folder in Windows Explorer and click on the address bar, it will convert into a path string. Copy it and paste it here in the place of red lines.
- Save this TXT file as a BAT file. To do this, Select Save as > (save as type)All files> and name the file with end text as “.bat” (without quotes)
- Check if you have written the language code properly in the output path. The above example has kan+eng. It means Kannada and English. But there are other language codes. Use them.