
时间:2021-10-27 23:13:06

I have a string with markup in it which I need to find using Java.



string = abc<B>def</B>ghi<B>j</B>kl

desired output..

segment [n] = start, end

segment [1] = 4, 6
segment [2] = 10, 10

6 个解决方案


Regular expressions should work wonderfully for this.


Refer to your JavaDoc for


  • java.langString.split()
  • java.util.regex package
  • java.util.Scanner

Note: StringTokenizer is not what you want since it splits around characters, not strings - the string delim is a list of characters, any one of which will split. It's good for the very simple cases like an unambiguous comma separated list.

注意:StringTokenizer不是你想要的,因为它分割字符而不是字符串 - 字符串delim是一个字符列表,其中任何一个都将被拆分。对于非常简单的情况,例如明确的逗号分隔列表,这是很好的。


Given your example I think I'd use regex and particularly I'd look at the grouping functionality offered by Matcher.



String inputString = "abc<B>def</B>ghi<B>j</B>kl";

String stringPattern = "(<B>)([a-zA-Z]+)(<\\/B>)";

Pattern pattern = Pattern.compile(stringPattern);
Matcher matcher = pattern.matcher(inputString);

if (matcher.matches()) {

    String firstGroup  = matcher.group(1);
    String secondGroup = matcher.group(2); 
    String thirdGroup  = matcher.group(3);


The StringTokenizer will give you separate tokens when you want to separate the string by a specific string. Or you can use the split() method in String to get the separate Strings. To get the different arrays you have to put a regular expression into.



StringTokenizer takes the whole String as an argument, and is not really a good idea for big strings. You can also use StreamTokenizer


You also need to look at Scanner.



It is a bit 'Brute Force' and makes some assumptions but this works.


public class SegmentFinder

    public static void main(String[] args)
        String string = "abc<B>def</B>ghi<B>j</B>kl";
        String startRegExp = "<B>";
        String endRegExp = "</B>";
        int segmentCounter = 0;
        int currentPos = 0;
        String[] array = string.split(startRegExp);
        for (int i = 0; i < array.length; i++)
            if (i > 0) // Ignore the first one
                //this assumes that every start will have exactly one end
                String[] array2 = array[i].split(endRegExp);
                int elementLenght = array2[0].length();
                System.out.println("segment["+segmentCounter +"] = "+ (currentPos+1) +","+ (currentPos+elementLenght) );
                for(String s : array2)
                    currentPos += s.length();  
                currentPos += array[i].length();                


Does your input look like your example, and you need to get the text between specific tags? Then a simple StringUtils.substringsBetween(yourString, "<B>", "</B>") using the apache commons lang package (http://commons.apache.org/lang/) should do the job.

您的输入看起来像您的示例,您需要在特定标记之间获取文本吗?然后使用apache commons lang包(http://commons.apache.org/lang/)的简单StringUtils.substringsBetween(yourString,“”,“ ”)应该可以完成这项工作。

If you're up for a more general solution, for different and possibly nested tags, you might want to look at a parser that takes html input and creates an xml document out of it, such as NekoHTML, TagSoup, jTidy. You can then use XPath on the xml document to access the contents.



Regular expressions should work wonderfully for this.


Refer to your JavaDoc for


  • java.langString.split()
  • java.util.regex package
  • java.util.Scanner

Note: StringTokenizer is not what you want since it splits around characters, not strings - the string delim is a list of characters, any one of which will split. It's good for the very simple cases like an unambiguous comma separated list.

注意:StringTokenizer不是你想要的,因为它分割字符而不是字符串 - 字符串delim是一个字符列表,其中任何一个都将被拆分。对于非常简单的情况,例如明确的逗号分隔列表,这是很好的。


Given your example I think I'd use regex and particularly I'd look at the grouping functionality offered by Matcher.



String inputString = "abc<B>def</B>ghi<B>j</B>kl";

String stringPattern = "(<B>)([a-zA-Z]+)(<\\/B>)";

Pattern pattern = Pattern.compile(stringPattern);
Matcher matcher = pattern.matcher(inputString);

if (matcher.matches()) {

    String firstGroup  = matcher.group(1);
    String secondGroup = matcher.group(2); 
    String thirdGroup  = matcher.group(3);


The StringTokenizer will give you separate tokens when you want to separate the string by a specific string. Or you can use the split() method in String to get the separate Strings. To get the different arrays you have to put a regular expression into.



StringTokenizer takes the whole String as an argument, and is not really a good idea for big strings. You can also use StreamTokenizer


You also need to look at Scanner.



It is a bit 'Brute Force' and makes some assumptions but this works.


public class SegmentFinder

    public static void main(String[] args)
        String string = "abc<B>def</B>ghi<B>j</B>kl";
        String startRegExp = "<B>";
        String endRegExp = "</B>";
        int segmentCounter = 0;
        int currentPos = 0;
        String[] array = string.split(startRegExp);
        for (int i = 0; i < array.length; i++)
            if (i > 0) // Ignore the first one
                //this assumes that every start will have exactly one end
                String[] array2 = array[i].split(endRegExp);
                int elementLenght = array2[0].length();
                System.out.println("segment["+segmentCounter +"] = "+ (currentPos+1) +","+ (currentPos+elementLenght) );
                for(String s : array2)
                    currentPos += s.length();  
                currentPos += array[i].length();                


Does your input look like your example, and you need to get the text between specific tags? Then a simple StringUtils.substringsBetween(yourString, "<B>", "</B>") using the apache commons lang package (http://commons.apache.org/lang/) should do the job.

您的输入看起来像您的示例,您需要在特定标记之间获取文本吗?然后使用apache commons lang包(http://commons.apache.org/lang/)的简单StringUtils.substringsBetween(yourString,“”,“ ”)应该可以完成这项工作。

If you're up for a more general solution, for different and possibly nested tags, you might want to look at a parser that takes html input and creates an xml document out of it, such as NekoHTML, TagSoup, jTidy. You can then use XPath on the xml document to access the contents.
