Tesseract OCR不会识别分区符号“÷”

时间:2022-01-03 09:01:18

I am using Tesseract in iOS 8 for an OCR based app but it incorrectly converts the division "÷" symbol in the image to a plus "+" sign.

我在iOS 8中使用Tesseract作为基于OCR的应用程序,但它错误地将图像中的“÷”符号转换为加号“+”符号。

For example, this image

例如,这个图像

Tesseract OCR不会识别分区符号“÷”

always converts to the text string "8+4+4". It should be "8+4÷4".

始终转换为文本字符串“8 + 4 + 4”。它应该是“8 + 4÷4”。

I've tried using different trained data language files "eng+equ", "ita", adding "÷" to the whitelist, setting the ocr_engine variable to cube, converting image to grayscale or black & white, upsizing the image by 2 and 4 times.

我尝试使用不同的训练数据语言文件“eng + equ”,“ita”,将“÷”添加到白名单,将ocr_engine变量设置为立方体,将图像转换为灰度或黑白,将图像增大2和4次。

Everything I've tried always returns a plus "+" sign instead of a division "÷" symbol.

我尝试的所有东西总是返回加号“+”而不是分号“÷”符号。

I tried using only the "equ" trained data file and that DOES return the division symbol correctly - but all other characters are then garbage.

我尝试只使用“equ”训练的数据文件,并且正确返回除法符号 - 但所有其他字符都是垃圾。

I've been looking into this (Google, *) for several days and cannot figure it out.

我一直在研究这个问题(谷歌,*)几天,但无法理解。

How do I get Tesseract to include and recognize the division "÷" symbol?

如何让Tesseract包含并识别除“÷”符号?

UPDATE:

The best I have been able to do is to set the AVCaptureSession preset to high

我能做的最好的事情是将AVCaptureSession预设设置为高

AVCaptureSession *session = [[AVCaptureSession alloc] init];
session.sessionPreset = AVCaptureSessionPresetHigh;

The captured image above dimensions are then 676 × 405 pixels. Using Tesseract OCR UIImage category (image is named 'source') to binarize the image:

然后,捕获的图像尺寸为676×405像素。使用Tesseract OCR UIImage类别(图像命名为“source”)来对图像进行二值化:

// Binarize the source image to improve contrast (using the UIImage category provided by TesseractOCR)
UIImage *blackAndWhiteImage = [source blackAndWhite];
[self.tesseract setImage:blackAndWhiteImage];

This will usually convert the division symbol to the text "-1-", but I've seen "-:-" and other numbers and uppercase characters between the minus signs.

这通常会将除法符号转换为文本“-1-”,但我看到“ - : - ”以及减号之间的其他数字和大写字符。

I can check for that in the returned text. But then it is impossible to know whether to treat the returned text "8-1-2" as a true subtraction or 'maybe' division.

我可以在返回的文本中检查它。但是,不可能知道是否将返回的文本“8-1-2”视为真正的减法或“可能​​”除法。

5 个解决方案

#1


4  

Train the or engine wit different fonts.

用不同的字体训练或引擎。

Here is the tool for training the engine. Have a look on this also

这是培训发动机的工具。看看这个也

Or you can use JTessBoxEditor

或者您可以使用JTessBoxEditor

#2


2  

Make sure your "white list" includes"÷" sign.

确保您的“白名单”包含“÷”符号。

In swift, this will do it: tesseract.setVariableValue("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:;,.!-()#&÷", forKey: "tessedit_char_whitelist")

在迅速,这将做到这一点:tesseract.setVariableValue( “0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:;! - ()#&÷”,forKey: “tessedit_char_whitelist”)

In objective-C, here is the code:

在objective-C中,这是代码:

[tesseract setVariableValue:@"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:;,.!-()#&÷" forKey:@"tessedit_char_whitelist"];

You can customize the character set based on your needs.

您可以根据需要自定义字符集。

#3


1  

It seems that symbol was not included in the existing data. You'd need to train for that symbol, and then use the resultant traineddata in combination with existing ones.

似乎符号未包含在现有数据中。您需要训练该符号,然后将结果训练数据与现有符号结合使用。

You can use a tool, such as jTessBoxEditor, to assist you in the training process.

您可以使用jTessBoxEditor等工具来协助您完成培训过程。

#4


0  

You can also try and capture this ambiguity via the unicharambigs file. Read more https://github.com/tesseract-ocr/tesseract/blob/master/doc/unicharambigs.5.asc.

您还可以尝试通过unicharambigs文件捕获此歧义。阅读更多https://github.com/tesseract-ocr/tesseract/blob/master/doc/unicharambigs.5.asc。

1       +       1      ÷    0

Tesseract would read it as "optionally (the trailing 0 in the above config) replace the 1 char sequence '+' with the 1 character sequence '÷'".

Tesseract会将其读作“可选(上述配置中的尾随0)将1字符序列'+'替换为1字符序列'÷'”。

#5


0  

In Swift, changing engineMode works for me

在Swift中,更改engineMode对我有用

let tesseract = G8Tesseract(language: "eng")!
tesseract.engineMode = .tesseractCubeCombined

#1


4  

Train the or engine wit different fonts.

用不同的字体训练或引擎。

Here is the tool for training the engine. Have a look on this also

这是培训发动机的工具。看看这个也

Or you can use JTessBoxEditor

或者您可以使用JTessBoxEditor

#2


2  

Make sure your "white list" includes"÷" sign.

确保您的“白名单”包含“÷”符号。

In swift, this will do it: tesseract.setVariableValue("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:;,.!-()#&÷", forKey: "tessedit_char_whitelist")

在迅速,这将做到这一点:tesseract.setVariableValue( “0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:;! - ()#&÷”,forKey: “tessedit_char_whitelist”)

In objective-C, here is the code:

在objective-C中,这是代码:

[tesseract setVariableValue:@"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:;,.!-()#&÷" forKey:@"tessedit_char_whitelist"];

You can customize the character set based on your needs.

您可以根据需要自定义字符集。

#3


1  

It seems that symbol was not included in the existing data. You'd need to train for that symbol, and then use the resultant traineddata in combination with existing ones.

似乎符号未包含在现有数据中。您需要训练该符号,然后将结果训练数据与现有符号结合使用。

You can use a tool, such as jTessBoxEditor, to assist you in the training process.

您可以使用jTessBoxEditor等工具来协助您完成培训过程。

#4


0  

You can also try and capture this ambiguity via the unicharambigs file. Read more https://github.com/tesseract-ocr/tesseract/blob/master/doc/unicharambigs.5.asc.

您还可以尝试通过unicharambigs文件捕获此歧义。阅读更多https://github.com/tesseract-ocr/tesseract/blob/master/doc/unicharambigs.5.asc。

1       +       1      ÷    0

Tesseract would read it as "optionally (the trailing 0 in the above config) replace the 1 char sequence '+' with the 1 character sequence '÷'".

Tesseract会将其读作“可选(上述配置中的尾随0)将1字符序列'+'替换为1字符序列'÷'”。

#5


0  

In Swift, changing engineMode works for me

在Swift中,更改engineMode对我有用

let tesseract = G8Tesseract(language: "eng")!
tesseract.engineMode = .tesseractCubeCombined