1 概述

众所周知，在程序开发中，经常需要对字符串进行匹配、切割、替换、获取等操作，而这些情况有时又比较复杂，如果用纯编码方式解决，往往会浪费程序员的时间及精力。利用正则表达式就可以大大提高编程的效率，并且正则表达式和主流的编程语言如C，Java，Python，Perl等都有很好的结合。自从Java1.4推出regex包，就为我们提供了很好的JAVA正则表达式应用平台。下面从正则表达式对字符串的匹配、切割、替换和获取四个方面介绍正则表达式在Java中的应用。

2 匹配

2.1 源码分析

利用正则表达式对字符串进行匹配只需要调用String类的matches方法即可，注意这里的匹配指的是完全匹配，参考下面的代码。

/**
 * Created by fubinhe on 16/10/6.
 */
public class Regex {
    public static void main(String[] args) {
        String str = "hello123fdfd";
        System.out.println(str.matches("[a-z]+\\d+[a-z]+")); // 输出true
        System.out.println(str.matches("\\d+")); // 输出false
    }
}

这里\d是匹配数字的，其前面的反斜杠是转义字符。查看matches方法的源代码如下。

public boolean matches(String regex) {
    return Pattern.matches(regex, this);
}

可以看到，其实其是调用了Pattern类的静态方法matches，而Pattern类的matches方法的源代码如下。

public static boolean matches(String regex, CharSequence input) {
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(input);
    return m.matches();
}

2.2 效率分析

从Pattern类的matches方法的源代码可以看出，其原理是首先通过Pattern类的静态方法compile方法得到一个Pattern对象，然后调用这个对象的matcher方法。而compile方法是比较耗时的，所以如果对一个正则表达式大量的调用String类的matches方法，效率比较低下，一般的做法如下。

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * Created by fubinhe on 16/10/6.
 */
public class Regex {

    private static final Pattern PATTERN = Pattern.compile("\\w+[0-9]+\\w+");

    public static void main(String[] args) {
        String str = "hello123fdfd";
        Matcher m = PATTERN.matcher(str);
        System.out.println(m.matches()); // 输出true
    }
}

也就是定义一个static final的Pattern对象，只需要编译一次即可。这种做法对于大量的应用替换、提取也是有用的。

3 切割

3.1 普通切割

普通切割就是调用String类的split方法，返回切割后的字符串数组，参考下面的代码。

/**
 * Created by fubinhe on 16/10/6.
 */
public class Regex {
    public static void main(String[] args) {
        String str = "192.168.0.134";
        String[] nums = str.split("\\.");
        for (String num : nums) {
            System.out.println(num);
        }
    }
}

程序以点号进行切割，返回切割后的192，168，0和134三个数。

3.2 叠词切割

对一个字符串进行叠词切割，需要用到回溯引用，参考以下的代码。

package October06;

/**
 * Created by fubinhe on 16/10/6.
 */
public class Regex {
    public static void main(String[] args) {
        String str = "abccdefggggh";
        String[] ss = str.split("(.)\\1+");
        for (String s : ss) {
            System.out.println(s);
        }
    }
}

上面的代码用叠词"cc", "gggg"进行切割，程序将打印"ab", "def"和"h"三个字符串。

4 替换

4.1 普通替换

利用正表达式对字符串进行普通替换是调用String类的replaceAll方法（String类还有一个replace方法，这个方法和正则表达式无关，只是进行简单的替换，比如将"a"替换成"b"可以如此调用: replace("a", "b")），参考下面的代码。

/**
 * Created by fubinhe on 16/10/6.
 */
public class Regex {
    public static void main(String[] args) {
        String str = "absddfd13401487890fdsfd";
        String replacedStr = str.replaceAll("\\d{5,}", "#"); // 输出absddfd#fdsfd
        System.out.println(replacedStr);
    }
}

4.2 叠词替换

和叠词切割类似，叠词替换需要用到回溯引用，参考下面的代码。

/**
 * Created by fubinhe on 16/10/6.
 */
public class Regex {
    public static void main(String[] args) {
        String str = "absddcef22dgfdddd76";
        String replacedStr = str.replaceAll("(.)\\1+", "$1"); // 输出absdcef2dgfd76
        System.out.println(replacedStr);
    }
}

上面的代码可以将叠词替换成单个字母或者数字。其中，在replaceAll方法中的第二个参数利用$获得前面正则的相对位置的子表达式。如果想将叠词替换成$，需要对$进行转义，参考下面的代码。

/**
 * Created by fubinhe on 16/10/6.
 */
public class Regex {
    public static void main(String[] args) {
        String str = "absddcef22dgfdddd76";
        String replacedStr = str.replaceAll("(.)\\1+", "\\$"); // abs$cef$dgf$76
        System.out.println(replacedStr);
    }
}

P.S. 如果要利用一个正则对大量的字符串进行替换，考虑到效率问题，可以参考在匹配中提到的方法，具体可以参考replaceAll的源代码。

5 获取

从字符串中获取符合某个正则的子字符串，需要用到Pattern类和Matcher类，参考下面的代码。

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * Created by fubinhe on 16/10/6.
 */
public class Regex {
    public static void main(String[] args) {
        String str = "absddcef22dgfdddd76";
        Pattern digitPattern = Pattern.compile("\\d+");
        Matcher m = digitPattern.matcher(str);
        while (m.find()) {
            System.out.println(m.group()); // 依次输出22和76
        }
    }
}

上面的代码将提取字符串中的数字，其中Matcher类的find方法可以依正则进行匹配，并记录匹配到的位置，下次匹配从这个位置开始（具体可以参考源代码）。而group方法可以将匹配到的字符串提取出来。

如果需要分别提取正则表达式中的各个部分，则需要给group方法传入参数了，参考下面的代码。

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * Created by fubinhe on 16/10/6.
 */
public class Regex {
    public static void main(String[] args) {
        String str = "absddcef22.23dgfdddd45.76";
        Pattern digitPattern = Pattern.compile("(\\d+)\\.(\\d+)");
        Matcher m = digitPattern.matcher(str);
        while (m.find()) {
            System.out.println(m.group(1) + " " + m.group(2)); // 依次输出22 23和45 76
        }
    }
}

需要注意的是，分别提取正则表达式中的各个部分时，需要将各个部分用括号括起来。如果给group方法传入参数0，将提取整个表达式匹配的部分，在上面的代码中，将提取字符串"22.23"和"45.76"。

P.S. 如果要利用一个正则对大量的字符串进行提取，考虑到效率问题，可以参考在匹配中提到的方法。

6 综合应用

参考下面的代码，需求是将字符串中ip提取出来并按照大小排列，这里的大小是指按照ip的每段依次进行比较。如果直接按照空格切割然后排序显然是不行的。这里参用了迂回的方式，主要按照以下5步进行，详细参考注释。

import java.util.Arrays;

/**
 * Created by fubinhe on 16/10/6.
 */
public class Regex {
    public static void main(String[] args) {
        String str = "192.168.0.1 10.4.56.43    20.100.178.45";
        // step 1. 补零
        str = str.replaceAll("(\\d+)", "00$1");
        // step 2. 将每组的数字改为3个
        str = str.replaceAll("0*(\\d{3})", "$1");
        // step 3. 切割
        String[] ips = str.split("\\s+");
        // step 4. 排序
        Arrays.sort(ips);
        for (String ip : ips) {
            // step 5. 将多余的前导零去掉
            System.out.println(ip.replaceAll("0*(\\d+)", "$1"));
        }
    }
}

程序的运行结果如下图，可以看到达到了想要的目的。

Chapter 6 正则表达式在Java中的应用

秒客网