tesseract文字识别训练记录

1.下载tesseract，并安装 https://digi.bib.uni-mannheim.de/tesseract/，识别汉字的话把汉字库选上additional中找

tesseract文字识别训练记录

2.将tesseract，安装的文件夹，添加到环境变量

tesseract文字识别训练记录

3. 安装pytesseract库

pip install pytesseract

4.一段python 小程序识别

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import pytesseract
from PIL import Image

# open image
image = Image.open('name.jpg')
code = pytesseract.image_to_string(image, lang='chi_sim')
print(code)

5.输入与结果

tesseract文字识别训练记录

6.可以看出这个识别结果不能令人满意，需要对字库进行训练，下载训练软件jTessBoxEditorFX

tesseract文字识别训练记录

7.解压即可，这个是java软件，需要JRE环境，下载最新版的jre，无需添加环境变量自动添加

tesseract文字识别训练记录

8.材料准备，刚刚的测试图像，可以用画图打开另存为tif格式图像，命名为：语言名称.字体.exp序号

my.test.exp0

9.cmd下 cd 到当前文件夹路径下，生成box文件，tesseract my.test.exp0.tif -l chi_sim my.test.exp0 batch.nochop makebox

10.使用jtessboxeditor修改box文件，调整位置每个框包含一个文字，修改为正确值保存

tesseract文字识别训练记录

11. 生成字体库，别人写的一个脚本，非常方便，点击即可，就会生成一堆所需的文件

echo Run Tesseract for Training.. 
tesseract.exe my.test.exp0.tif my.test.exp0 nobatch box.train 

echo Compute the Character Set.. 
unicharset_extractor.exe my.test.exp0.box 
mftraining -F font_properties -U unicharset -O my.unicharset my.test.exp0.tr 


echo Clustering.. 
cntraining.exe my.test.exp0.tr 

echo Rename Files.. 
rename normproto my.normproto 
rename inttemp my.inttemp 
rename pffmtable my.pffmtable 
rename shapetable my.shapetable  

echo Create Tessdata.. 
combine_tessdata.exe my. 

echo. & pause

12.找到这个my.traineddata文件，复制到tesseract安装文件夹下tessdata里面，然后修改程序里调用的字体库即可，运行结果：

tesseract文字识别训练记录

13.自己训练一下，测试结果还可以，但是对手写字体还是不友好

秒客网

tesseract文字识别训练记录

相关文章