如何在文本中搜索字符串和相似的单词?

时间:2022-09-13 09:35:54

I have to lookup for a word "age" and similar word in a text file.

我必须在文本文件中查找单词“age”和类似的单词。

I have following sentence :

我有以下句子:

  • 18 years of age
  • 18岁
  • man aged 51
  • 51岁的男子
  • man ages between 25 to 50
  • 男子年龄介于25至50岁之间
  • between 5 to 75 years of age.(with dot)
  • 5至75岁之间。(带点)
  • between 5 to 75 years of age, (with comma)
  • 年龄在5到75岁之间(用逗号)
  • agent name is xyz (agent contain age).
  • 代理名称是xyz(代理包含年龄)。

String.contains always return true in each case. My requirement is to pass the first five sentence and it return false in last case.

String.contains在每种情况下总是返回true。我的要求是传递前五个句子,并在最后一个案例中返回false。

I will solve this problem by writing some code which contains a bunch of string " age ", " age." , "ages", "aged", " age," etc..

我将通过编写一些包含一串字符串“age”,“age”的代码来解决这个问题。 ,“年龄”,“年龄”,“年龄”等。

Is there any better way to solve this problem.

有没有更好的方法来解决这个问题。

3 个解决方案

#1


3  

If you use regex, you have to put all the possiblities.

如果你使用正则表达式,你必须把所有可能性。

string.matches("(?i).*\\bage[ds]?\\b.*");

#2


1  

A naive solution (expensive) would be the following:

一个天真的解决方案(昂贵)将如下:

  1. tokenize each line (e.g., split by " ", or even non-alphanumeric characters, which already removes punctuation).
  2. 标记每一行(例如,用“”拆分,或者甚至是非字母数字字符,它们已经删除了标点符号)。
  3. calculate the edit distance of each word to the word age
  4. 计算每个单词与单词年龄的编辑距离
  5. if the current word has a small edit distance (e.g., bellow 2), return line
  6. 如果当前字具有小的编辑距离(例如,波纹管2),则返回线

The edit distance of two string is the number of edits (additions, deletions and replacements) that are required to make one string equal to the other. You can find an implementation of edit distance in the simmetrics library, or maybe elsewhere, too.

两个字符串的编辑距离是使一个字符串等于另一个字符串所需的编辑(添加,删除和替换)的数量。您可以在simmetrics库中找到编辑距离的实现,也可以在其他地方找到。

Another option could be to stem the words at step 2 and use contains with the stemming of the word age (also expensive).

另一个选择可能是在第2步中使用词语,并使用词语来代替词语年龄(也很昂贵)。

If you already know all the acceptable answers (or at least their pattern), go for Avinash Raj's answer.

如果您已经知道所有可接受的答案(或者至少是他们的模式),那么请选择Avinash Raj的答案。

#3


1  

What you need is called a regular expression (or regex)

你需要什么被称为正则表达式(或正则表达式)

Here's a perfectly detailed definition of regular expressions and use in Java, which can be done with matches(String Regex) method of String class.

这里是Java中正则表达式和使用的完美详细定义,可以使用String类的匹配(String Regex)方法完成。

For your example, it could (normally) be : myString.matches(".*age? .*").

对于您的示例,它(通常)可以是:myString.matches(“。* age?。*”)。

Pay attention in escaping special characters in Java. You can try your regexs here. I didn't do it in the example above, but you can try :)

注意在Java中转义特殊字符。你可以在这里试试你的正则表达式。我没有在上面的例子中这样做,但你可以尝试:)

In detail :

详细 :

  • .* : the sentence can begin with everything
  • 。*:句子可以从一切开始
  • age : the sentence must contain 'age'
  • 年龄:句子必须包含'年龄'
  • ? : age can be followed by zero or one character
  • ? :年龄可以跟零或一个字符
  • : then a space
  • :然后是一个空间
  • .*: then everything again
  • 。*:然后再一切

Hope it helped.

希望它有所帮助。

#1


3  

If you use regex, you have to put all the possiblities.

如果你使用正则表达式,你必须把所有可能性。

string.matches("(?i).*\\bage[ds]?\\b.*");

#2


1  

A naive solution (expensive) would be the following:

一个天真的解决方案(昂贵)将如下:

  1. tokenize each line (e.g., split by " ", or even non-alphanumeric characters, which already removes punctuation).
  2. 标记每一行(例如,用“”拆分,或者甚至是非字母数字字符,它们已经删除了标点符号)。
  3. calculate the edit distance of each word to the word age
  4. 计算每个单词与单词年龄的编辑距离
  5. if the current word has a small edit distance (e.g., bellow 2), return line
  6. 如果当前字具有小的编辑距离(例如,波纹管2),则返回线

The edit distance of two string is the number of edits (additions, deletions and replacements) that are required to make one string equal to the other. You can find an implementation of edit distance in the simmetrics library, or maybe elsewhere, too.

两个字符串的编辑距离是使一个字符串等于另一个字符串所需的编辑(添加,删除和替换)的数量。您可以在simmetrics库中找到编辑距离的实现,也可以在其他地方找到。

Another option could be to stem the words at step 2 and use contains with the stemming of the word age (also expensive).

另一个选择可能是在第2步中使用词语,并使用词语来代替词语年龄(也很昂贵)。

If you already know all the acceptable answers (or at least their pattern), go for Avinash Raj's answer.

如果您已经知道所有可接受的答案(或者至少是他们的模式),那么请选择Avinash Raj的答案。

#3


1  

What you need is called a regular expression (or regex)

你需要什么被称为正则表达式(或正则表达式)

Here's a perfectly detailed definition of regular expressions and use in Java, which can be done with matches(String Regex) method of String class.

这里是Java中正则表达式和使用的完美详细定义,可以使用String类的匹配(String Regex)方法完成。

For your example, it could (normally) be : myString.matches(".*age? .*").

对于您的示例,它(通常)可以是:myString.matches(“。* age?。*”)。

Pay attention in escaping special characters in Java. You can try your regexs here. I didn't do it in the example above, but you can try :)

注意在Java中转义特殊字符。你可以在这里试试你的正则表达式。我没有在上面的例子中这样做,但你可以尝试:)

In detail :

详细 :

  • .* : the sentence can begin with everything
  • 。*:句子可以从一切开始
  • age : the sentence must contain 'age'
  • 年龄:句子必须包含'年龄'
  • ? : age can be followed by zero or one character
  • ? :年龄可以跟零或一个字符
  • : then a space
  • :然后是一个空间
  • .*: then everything again
  • 。*:然后再一切

Hope it helped.

希望它有所帮助。