在字符串中查找特定标记的最佳方法是什么(在Java中)?

时间:2021-10-27 23:13:06

I have a string with markup in it which I need to find using Java.

我有一个带有标记的字符串,我需要使用Java找到它。

eg.

string = abc<B>def</B>ghi<B>j</B>kl

desired output..

segment [n] = start, end

segment [1] = 4, 6
segment [2] = 10, 10

6 个解决方案

#1


Regular expressions should work wonderfully for this.

正则表达式应该非常有用。

Refer to your JavaDoc for

请参阅JavaDoc

  • java.langString.split()
  • java.util.regex package
  • java.util.Scanner

Note: StringTokenizer is not what you want since it splits around characters, not strings - the string delim is a list of characters, any one of which will split. It's good for the very simple cases like an unambiguous comma separated list.

注意:StringTokenizer不是你想要的,因为它分割字符而不是字符串 - 字符串delim是一个字符列表,其中任何一个都将被拆分。对于非常简单的情况,例如明确的逗号分隔列表,这是很好的。

#2


Given your example I think I'd use regex and particularly I'd look at the grouping functionality offered by Matcher.

鉴于你的例子,我认为我会使用正则表达式,特别是我会看看Matcher提供的分组功能。

Tom

String inputString = "abc<B>def</B>ghi<B>j</B>kl";

String stringPattern = "(<B>)([a-zA-Z]+)(<\\/B>)";

Pattern pattern = Pattern.compile(stringPattern);
Matcher matcher = pattern.matcher(inputString);

if (matcher.matches()) {

    String firstGroup  = matcher.group(1);
    String secondGroup = matcher.group(2); 
    String thirdGroup  = matcher.group(3);
}

#3


The StringTokenizer will give you separate tokens when you want to separate the string by a specific string. Or you can use the split() method in String to get the separate Strings. To get the different arrays you have to put a regular expression into.

当您想要通过特定字符串分隔字符串时,StringTokenizer将为您提供单独的标记。或者,您可以使用String中的split()方法来获取单独的字符串。要获得不同的数组,您必须将正则表达式放入其中。

#4


StringTokenizer takes the whole String as an argument, and is not really a good idea for big strings. You can also use StreamTokenizer

StringTokenizer将整个String作为参数,对于大字符串来说并不是一个好主意。您也可以使用StreamTokenizer

You also need to look at Scanner.

您还需要查看扫描仪。

#5


It is a bit 'Brute Force' and makes some assumptions but this works.

它有点'蛮力',并做出一些假设,但这是有效的。

public class SegmentFinder
{

    public static void main(String[] args)
    {
        String string = "abc<B>def</B>ghi<B>j</B>kl";
        String startRegExp = "<B>";
        String endRegExp = "</B>";
        int segmentCounter = 0;
        int currentPos = 0;
        String[] array = string.split(startRegExp);
        for (int i = 0; i < array.length; i++)
        {           
            if (i > 0) // Ignore the first one
            {
                segmentCounter++;
                //this assumes that every start will have exactly one end
                String[] array2 = array[i].split(endRegExp);
                int elementLenght = array2[0].length();
                System.out.println("segment["+segmentCounter +"] = "+ (currentPos+1) +","+ (currentPos+elementLenght) );
                for(String s : array2)
                {
                    currentPos += s.length();  
                }
            }
            else
            {
                currentPos += array[i].length();                
            }
        }
    }
}

#6


Does your input look like your example, and you need to get the text between specific tags? Then a simple StringUtils.substringsBetween(yourString, "<B>", "</B>") using the apache commons lang package (http://commons.apache.org/lang/) should do the job.

您的输入看起来像您的示例,您需要在特定标记之间获取文本吗?然后使用apache commons lang包(http://commons.apache.org/lang/)的简单StringUtils.substringsBetween(yourString,“”,“ ”)应该可以完成这项工作。

If you're up for a more general solution, for different and possibly nested tags, you might want to look at a parser that takes html input and creates an xml document out of it, such as NekoHTML, TagSoup, jTidy. You can then use XPath on the xml document to access the contents.

如果你想要一个更通用的解决方案,对于不同的和可能嵌套的标签,你可能想要查看一个解析器,它接受html输入并从中创建一个xml文档,例如NekoHTML,TagSoup,jTidy。然后,您可以在xml文档上使用XPath来访问内容。

#1


Regular expressions should work wonderfully for this.

正则表达式应该非常有用。

Refer to your JavaDoc for

请参阅JavaDoc

  • java.langString.split()
  • java.util.regex package
  • java.util.Scanner

Note: StringTokenizer is not what you want since it splits around characters, not strings - the string delim is a list of characters, any one of which will split. It's good for the very simple cases like an unambiguous comma separated list.

注意:StringTokenizer不是你想要的,因为它分割字符而不是字符串 - 字符串delim是一个字符列表,其中任何一个都将被拆分。对于非常简单的情况,例如明确的逗号分隔列表,这是很好的。

#2


Given your example I think I'd use regex and particularly I'd look at the grouping functionality offered by Matcher.

鉴于你的例子,我认为我会使用正则表达式,特别是我会看看Matcher提供的分组功能。

Tom

String inputString = "abc<B>def</B>ghi<B>j</B>kl";

String stringPattern = "(<B>)([a-zA-Z]+)(<\\/B>)";

Pattern pattern = Pattern.compile(stringPattern);
Matcher matcher = pattern.matcher(inputString);

if (matcher.matches()) {

    String firstGroup  = matcher.group(1);
    String secondGroup = matcher.group(2); 
    String thirdGroup  = matcher.group(3);
}

#3


The StringTokenizer will give you separate tokens when you want to separate the string by a specific string. Or you can use the split() method in String to get the separate Strings. To get the different arrays you have to put a regular expression into.

当您想要通过特定字符串分隔字符串时,StringTokenizer将为您提供单独的标记。或者,您可以使用String中的split()方法来获取单独的字符串。要获得不同的数组,您必须将正则表达式放入其中。

#4


StringTokenizer takes the whole String as an argument, and is not really a good idea for big strings. You can also use StreamTokenizer

StringTokenizer将整个String作为参数,对于大字符串来说并不是一个好主意。您也可以使用StreamTokenizer

You also need to look at Scanner.

您还需要查看扫描仪。

#5


It is a bit 'Brute Force' and makes some assumptions but this works.

它有点'蛮力',并做出一些假设,但这是有效的。

public class SegmentFinder
{

    public static void main(String[] args)
    {
        String string = "abc<B>def</B>ghi<B>j</B>kl";
        String startRegExp = "<B>";
        String endRegExp = "</B>";
        int segmentCounter = 0;
        int currentPos = 0;
        String[] array = string.split(startRegExp);
        for (int i = 0; i < array.length; i++)
        {           
            if (i > 0) // Ignore the first one
            {
                segmentCounter++;
                //this assumes that every start will have exactly one end
                String[] array2 = array[i].split(endRegExp);
                int elementLenght = array2[0].length();
                System.out.println("segment["+segmentCounter +"] = "+ (currentPos+1) +","+ (currentPos+elementLenght) );
                for(String s : array2)
                {
                    currentPos += s.length();  
                }
            }
            else
            {
                currentPos += array[i].length();                
            }
        }
    }
}

#6


Does your input look like your example, and you need to get the text between specific tags? Then a simple StringUtils.substringsBetween(yourString, "<B>", "</B>") using the apache commons lang package (http://commons.apache.org/lang/) should do the job.

您的输入看起来像您的示例,您需要在特定标记之间获取文本吗?然后使用apache commons lang包(http://commons.apache.org/lang/)的简单StringUtils.substringsBetween(yourString,“”,“ ”)应该可以完成这项工作。

If you're up for a more general solution, for different and possibly nested tags, you might want to look at a parser that takes html input and creates an xml document out of it, such as NekoHTML, TagSoup, jTidy. You can then use XPath on the xml document to access the contents.

如果你想要一个更通用的解决方案,对于不同的和可能嵌套的标签,你可能想要查看一个解析器,它接受html输入并从中创建一个xml文档,例如NekoHTML,TagSoup,jTidy。然后,您可以在xml文档上使用XPath来访问内容。