简单入门正则表达式 - 第十一章　Java与.Net中的正则表达式应用

一、java.util.regex 与 System.Text.RegularExpressions 介绍

对于 Java 和 .Net 这样的高级语言来说，正则表达式的支持是必不可少的，现在我们就分别针对 Java 和 .Net 的正则表达式进行介绍，然后对它们的操作功能进行详细说明。

为了支持正则表达式功能，Java 提供的工具包 java.util.regex，它主要由三个类(Pattern、Matcher 和 PatternSyntaxException)组成。正如 Pattern 的名字所表示的那样，它所代表的就是正则表达式的样式，除此之外，它也为开发人员提供了必要的正则表达式操作途径，匹配、替换和分割；而 Matcher 代表的是正则表达式编译并执行某个操作之后的结果，并为结果操作提供了丰富的功能，每次执行匹配所涉及的所有状态都驻留在匹配器(Matcher)中，所以，在多个匹配器之间，它们可以共享同一个模式；如果正则表达式引擎中编译的时候，发现输入的样式无效，那么异常 PatternSyntaxException 就会被抛出。

在使用正则表达式时，首先要对样式进行编译，然后才可以利用 Pattern 所提供的操作方法，Pattern 类并没有直接提供公共的构造函数，但我们可以通过它提供的静态方法 compile 来创建 Pattern 对象，在方法 compile 中，除了可以指定正则表达式样式外，我们还可以设置一些正则表达式选项，比如让正则表达式大小写不敏感和多行匹配模式等。与 Pattern 类类似，Matcher 类也不提供公共的构造函数，我们可以从 Pattern 对象的 matcher 方法中获取结果 Matcher 对象，其实 Matcher 本身是一个对正则表达式进行解释分析并进行匹配操作的引擎，这就是为什么多个 Matcher 对象可以共享一个 Pattern 对象的原因。

在 .Net 中，System.Text.RegularExpressions 命名空间中同样提供了与正则表达式操作密切相关的类，其中有Regex、Match、MatchCollection、Group、GroupCollection、Capture、CaptureCollection以及委托MatchEvaluator。除了 MatchEvaluator 类之外，其它类的作用与 Java 的 java.util.regex 包提供的类的功能类似，Regex 与 Pattern 对应，Match 与 Matcher 对应，而 Group、GroupCollection、Capture、CaptureCollection 是对 Match 中的内容的细化，它们提供了比 Java Matcher更多的可操作功能。而 MatchEvaluator 则相当于一个事件，该事件会在方法 System.Text.RegularExpressions.Regex.Replace 执行的时候被触发。

二、类 String 中的正则表达式方法介绍

为方便使用正则表达式，Java 在 String 类中提供了 matches、replaceAll、replaceFirst 和 split 四种快捷方法，它们的作用分别是匹配、替换全部匹配内容、替换第一个匹配的内容、分割；在 .Net 的 String 类中也有类似的方法，但它们并不支持正则表达式，不过我们可以利用 System.Text.RegularExpressions 命名空间下的类实现其对应的替换方案。

三、match 的使用

我们先通过两个例子来学习下匹配的相关操作，第一个用于学习简单的匹配操作，第二个用于学习复杂的匹配操作。对于处理简单的字符串匹配任务来说，String 类提供了一个 matches 方法，它能接受一个正则表达式样式的字符串，然后将自身的内容与之比较，如果与给定的样式相匹配就返回 true，否则就返回 false，此外，java.util.regex.Pattern 类的静态方法 matches(String, CharSequence) 方法也提供了完全相同的功能；如果要进行庞大而复杂的处理，就要借助工具包 java.util.regex 或是命名空间 System.Text.RegularExpressions 所提供的高级功能了。

在接下来的例子中，我们分别使用 Java 和 .Net 两个版本的代码介绍的匹配的用法。先来看下 Java 的 String 类的 matches 方法的实现源码，其实它就是利用工具包 java.util.regex 中的 Pattern 类的静态方法实现的，所以，即便在 .Net 中没有这样的方法，我们还是可以用命名空间 System.Text.RegularExpressions 的 Regex 类的 match 方法来实现相同的功能。

 
  

public boolean matches(String regex) {
 
   
 return Pattern.matches(regex, this);
 
   
}

matches 方法接受一个正则表达式样式 regex 参数，然后利用该样式与自身内容，即 this 作为比较的对象，按照这种思路，我们可以实现一个 .Net 版的 matches 方法。首先为 String 类型做一个包装类 StringWrapper，这样就可以为 String 添加 matches 方法了，然后实现 matches 方法，具体实现内容如下。

 
  
/// <summary> 
   
/// String包装类，提供正则表达式相关方法 
   
/// </summary> 
   

class StringWrapper 
   
{ 
   
 private String value;
 
   
  
   
 public StringWrapper(String value)
 
   
 {
 
   
 this.value = value;
 
   
 }
 
   
  
   
 public String Value 
   
 {
 
   
 get
 
   
 {
 
   
 return this.value;
 
   
 }
 
   
 set
 
   
 {
 
   
 this.value = value;
 
   
 }
 
   
 }
 
   
  
   
 /// <summary>
 
   
 ///
 
   
 /// </summary>
 
   
 /// <param name="pattern"></param>
 
   
 /// <returns></returns>
 
   
 public bool matches(String pattern)
 
   
 {
 
   
 return System.Text.RegularExpressions.Regex.IsMatch(this.value, pattern);
 
   
 }
 
   
}

现在，我们用 matches 方法来做一个简单的 Javascript 标识符判断功能。简单的理解，Javascript 标识符的组成，可以利用半角字母“A”至“Z”，半角数字“0”至“9”，以及符号“$”和“_”反复重复，其中字母并不区分大小写，并且标识符的首字母不能是数字。在构造正则表达式样式之前，我们先考虑下大小写问题，对于标识符来说，字母的大小写形式都是允许，这里有两个方案，一个是直接利用一个[A-Za-z$_][0-9A-Za-z$_]*样式，另外，我们还可以考虑使用正则表达式引擎的忽略大小写选项来达到不区分大小写匹配的功能，代码如下：

 
  
/** 
   
 * Java版本测试类 
   
 */ 
   

public class Main {
 
   
 /**
 
   
 * @param args 
   
 */ 
   
 public static void main(String[] args) {
 
   
 // 匹配样式
 
   
 String pattern = "^[0-9A-Za-z$_]+$";
 
   
 // 测试字符串
 
   
 String[] testIdentifiers = new String[] {"$123",
 
   
 "hello world", "_int_value", "$_POST", "ok"};
 
   
 for (int i = 0; i < testIdentifiers.length; i++) {
 
   
 System.out.println("String /"" + testIdentifiers[i] + "/" is" +
 
   
 // matches 为 false 时，返回字符串 not
 
   
 (testIdentifiers[i].matches(pattern) ? "" : " not") +
 
   
 " a valid identifier.");
 
   
 }
 
   
 }
 
   
}

 
  
/// <summary> 
   
/// C#版本测试类 
   
/// </summary> 
   

class Tester 
   
{ 
   
 public static void Main(string[] args)
 
   
 {
 
   
 // 匹配样式
 
   
 String pattern = "^(?i)[A-Za-z$_][0-9A-Za-z$_]*$";
 
   
 // 测试字符串
 
   
 StringWrapper[] testIdentifiers = new StringWrapper[] {new StringWrapper("$123"),
 
   
 new StringWrapper("hello world"), new StringWrapper("_int_value"),
 
   
 new StringWrapper("$_POST"), new StringWrapper("ok")};
 
   
 for (int i = 0; i < testIdentifiers.Length; i++) {
 
   
 Console.WriteLine("String /"" + testIdentifiers[i] + "/" is" +
 
   
 // matches 为 false 时，返回字符串 not
 
   
 (testIdentifiers[i].matches(pattern) ? "" : " not") +
 
   
 " a valid identifier.");
 
   
 }
 
   
 Console.ReadLine();
 
   
 }
 
   
}

设置正则表达式引擎的选项时，有两种备选方案：一种是利用选项参数显示地设置参数值，另一种则是采用内联方式进行指定，即把选项参数嵌套到正则表达式中。上面代码采用的就是内联方式，样式前方的(?i)就是告诉正则表达式引擎进行匹配时，不区分字母大小写，除了大小写之外还有多行模式(?m)、单行模式(?s)等选项，具体用法可以参考相关的说明文档。Java 和 .Net 的内联选项分别为 (?idmsux-idmsux) 和 (?imnsx-imnsx),同时，它们还有一个非捕获群的版本(?idmsux-idmsux:) 和 (?imnsx-imnsx:)。在“?”之后我们可以直接指明正则表达式选项，选项前面的符号“-”表示可以禁止某些选项，如果在符号“-”前后指明相同的选项，后面的选项设置会覆盖前面相同的选项设置，表示该选项被禁用，在默认的情况下，所有的正则表达式选项都会被关闭。

非捕获群版本的正则表达式选项和正则表达式选项的位置，都是影响选项作用范围的因素。无论在正则表达式样式的哪个位置使用非捕获群版本的选项时，只有选项中的样式以及嵌套在选项中的捕获群或非捕获群才会受到设置的选项影响，选项之外的内容仍会按照未设置选项时的规则进行样式匹配；而在使用一般的正则表达式选项设置样式时，在选项位置之后且与选项属于同一捕获群中的样式以及嵌套在这个样式中的捕获群或非捕获群才会受到设置的选项影响，样式之外的内容仍按照未设置选项时的规则进行样式匹配。下面的代码演示了几种内联选项的使用方法：

 
  
/** 
   
 * Java版本测试类 
   
 */ 
   

public class Main {
 
   
 /**
 
   
 * @param args 
   
 */ 
   
 public static void main(String[] args) {
 
   
 // 在最前设置选项，选项之后内容受到影响，结果：true
 
   
 System.out.println("xABCABC".matches("(?i)Xabcabc"));
 
   
 // 在最前设置选项，选项之后，同一群组中内容受到影响，结果：false
 
   
 System.out.println("xABCABC".matches("((?i)Xabc)abc"));
 
   
 // 在最前设置选项，选项之后，同一群组中内容受到影响，结果：true
 
   
 System.out.println("xABCABC".matches("((?i)Xabc)ABC"));
 
   
 // 在最前设置选项，与选项在同一群组以及嵌套于同一群组的内容受到影响，结果：true
 
   
 System.out.println("xABCABC".matches("((?i)X(ab)c)ABC"));
 
   
 // 在中间设置选项，选项之后内容受到影响，结果：true
 
   
 System.out.println("xABCABC".matches("xAB(?i)caBc"));
 
   
 // 在中间设置选项，选项之后内容受到影响，结果：false
 
   
 System.out.println("xABCABC".matches("xABCaB(?i)c"));
 
   
 // 在中间设置选项，选项内部内容受到影响，结果：true
 
   
 System.out.println("xABCABC".matches("x(?i:abc)ABC"));
 
   
 // 在中间设置选项，选项内部内容受到影响，结果：false
 
   
 System.out.println("xABCABC".matches("x(?i:abc)abc"));
 
   
 // 在中间设置选项，选项内部内容受到影响，结果：false
 
   
 System.out.println("xABCABC".matches("x(?i:a(Bc)A)bc"));
 
   
 // 在中间设置选项，选项内部内容受到影响，结果：true
 
   
 System.out.println("xABCABC".matches("x(?i:a((b)C)a)BC"));
 
   
 // 在中间设置选项，选项内部内容以及嵌套捕获群受到影响，结果：true
 
   
 System.out.println("xABCABC".matches("xAB(?i:C(a)b)C"));
 
   
 // 在最后设置选项，选项内部内容受到影响，结果：false
 
   
 System.out.println("xABCABC".matches("xabcabc(?i)"));
 
   
 // 在最前设置选项，选项之后的样式以及样式中捕获群或非捕获群受到影响，结果：true
 
   
 System.out.println("xABCABC".matches("((?i)Xa(?:b)(c))ABC"));
 
   
 }
 
   
}

 
  
/// <summary> 
   
/// C#版本测试类 
   
/// </summary> 
   

class Tester 
   
{ 
   
 public static void Main(string[] args)
 
   
 {
 
   
 // 在最前设置选项，选项之后内容受到影响，结果：true
 
   
 Console.WriteLine(new StringWrapper("xABCABC").matches("(?i)Xabcabc"));
 
   
 // 在最前设置选项，选项之后，同一群组中内容受到影响，结果：false
 
   
 Console.WriteLine(new StringWrapper("xABCABC").matches("((?i)Xabc)abc"));
 
   
 // 在最前设置选项，选项之后，同一群组中内容受到影响，结果：true
 
   
 Console.WriteLine(new StringWrapper("xABCABC").matches("((?i)Xabc)ABC"));
 
   
 // 在最前设置选项，与选项在同一群组以及嵌套于同一群组的内容受到影响，结果：true
 
   
 Console.WriteLine(new StringWrapper("xABCABC").matches("((?i)X(ab)c)ABC"));
 
   
 // 在中间设置选项，选项之后内容受到影响，结果：true
 
   
 Console.WriteLine(new StringWrapper("xABCABC").matches("xAB(?i)caBc"));
 
   
 // 在中间设置选项，选项之后内容受到影响，结果：false
 
   
 Console.WriteLine(new StringWrapper("xABCABC").matches("xABCaB(?i)c"));
 
   
 // 在中间设置选项，选项内部内容受到影响，结果：true
 
   
 Console.WriteLine(new StringWrapper("xABCABC").matches("x(?i:abc)ABC"));
 
   
 // 在中间设置选项，选项内部内容受到影响，结果：false
 
   
 Console.WriteLine(new StringWrapper("xABCABC").matches("x(?i:abc)abc"));
 
   
 // 在中间设置选项，选项内部内容受到影响，结果：false
 
   
 Console.WriteLine(new StringWrapper("xABCABC").matches("x(?i:a(Bc)A)bc"));
 
   
 // 在中间设置选项，选项内部内容受到影响，结果：true
 
   
 Console.WriteLine(new StringWrapper("xABCABC").matches("x(?i:a((b)C)a)BC"));
 
   
 // 在中间设置选项，选项内部内容以及嵌套捕获群受到影响，结果：true
 
   
 Console.WriteLine(new StringWrapper("xABCABC").matches("xAB(?i:C(a)b)C"));
 
   
 // 在最后设置选项，选项内部内容受到影响，结果：false
 
   
 Console.WriteLine(new StringWrapper("xABCABC").matches("xabcabc(?i)"));
 
   
 // 在最前设置选项，选项之后的样式以及样式中捕获群或非捕获群受到影响，结果：true
 
   
 Console.WriteLine(new StringWrapper("xABCABC").matches("((?i)Xa(?:b)(c))ABC"));
 
   
 Console.ReadLine();
 
   
 }
 
   
}

现在我们来看下第二个例子。首先构造一个 Book 类，其中含有四个属性，出版商、书名、版本和出版时间，然后从含有这些信息的字符串数组中将相关的内容提取到 Book 类的各个属性中。比如字符串“Addison Wesley - Introduction To SQL - Mastering The Relational Database Language, 4th Edition, Sep 2006”，第一个“-”之前的内容为出版商信息，第一个“-”之后与“Edition”之前的一个“,”之间的内容为书名，书名至“Edition”间的内容为版本信息，最后剩下的就是出版日期了。首先需要构造一个含有四个捕获群的正则表达式，每个群组都代表着 Book 的一个属性，其中代表“版本”的第三个捕获群是可选的，按照这种思路，我们可以构造样式(?i)(.+?-)(.+,)(.+?Edition)?(.+)，并且在样式最前方加上一个内联的选项(?i)来忽略对“Edition”的大小写匹配，然后利用相关的 API 把每个捕获群的内容提取出来分别设置到对应的属性中去，下面的代码示例演示的就是如何利用正则表达式样式和相关编程语言的 API 来取得所有的属性。

 
  

import java.util.regex.Matcher;
 
   

import java.util.regex.Pattern;
 
   
  
   
/** 
   
 * Java版本测试类 
   
 */ 
   

public class Main {
 
   
  
   
 /**
 
   
 * @param args 
   
 */ 
   
 public static void main(String[] args) {
 
   
 // 测试字符串
 
   
 String[] bookNames = new String[] {
 
   
 "Apress - Java Regular Expressions Taming the java.util.regex Engine, 2004",
 
   
 "O'Reilly - Learning Perl, 3rd Edition, Jul 2001",
 
   
 "O'Reilly - SWT - A Developer's Notebook, Oct 2004",
 
   
 "Wiley - Excel 2007 Power Programming with VBA, Apr 2007",
 
   
 "O'Reilly - JavaScript The Definitive Guide, 5th Edition, Aug 2006",
 
   
 "O'Reilly - CSS The Missing Manual, Aug 2006" };
 
   
 // 匹配样式
 
   
 String bookPattern = "(?i)(.+?-)(.+,)(.+?Edition)?(.+)";
 
   
 // 编译Pattern
 
   
 Pattern pattern = Pattern.compile(bookPattern);
 
   
 Matcher matcher = null;
 
   
 for (int i = 0; i < bookNames.length; i++) {
 
   
 String bookName = bookNames[i];
 
   
 // 进行匹配操作
 
   
 matcher = pattern.matcher(bookName);
 
   
 while (matcher.find()) {
 
   
 // 输出捕获群
 
   
 println(matcher.group(1));
 
   
 println(matcher.group(2));
 
   
 println(matcher.group(3));
 
   
 println(matcher.group(4));
 
   
 println("==============");
 
   
 }
 
   
 }
 
   
 }
 
   
  
   
 /**
 
   
 * 格式化打印 
   
 * 
   
 * @param value 
   
 */ 
   
 public static void println(String value) {
 
   
 if (value == null) {
 
   
 System.out.println();
 
   
 } else {
 
   
 System.out.println(value);
 
   
 }
 
   
 }
 
   
}

 
  
/// <summary> 
   
/// C#版本测试类 
   
/// </summary> 
   

class Tester 
   
{ 
   
 /// <summary>
 
   
 ///
 
   
 /// </summary>
 
   
 /// <param name="args"></param>
 
   
 public static void Main(string[] args)
 
   
 {
 
   
 // 测试字符串
 
   
 String[] bookNames = new String[] {
 
   
 "Apress - Java Regular Expressions Taming the java.util.regex Engine, 2004",
 
   
 "O'Reilly - Learning Perl, 3rd Edition, Jul 2001",
 
   
 "O'Reilly - SWT - A Developer's Notebook, Oct 2004",
 
   
 "Wiley - Excel 2007 Power Programming with VBA, Apr 2007",
 
   
 "O'Reilly - JavaScript The Definitive Guide, 5th Edition, Aug 2006",
 
   
 "O'Reilly - CSS The Missing Manual, Aug 2006" };
 
   
 // 匹配样式
 
   
 String bookPattern = "(?i)(.+?-)(.+,)(.+?Edition)?(.+)";
 
   
 // 编译Pattern
 
   
 Regex regex = new Regex(bookPattern);
 
   
 Match match = null;
 
   
 for (int i = 0; i < bookNames.Length; i++)
 
   
 {
 
   
 String bookName = bookNames[i];
 
   
 // 进行匹配操作
 
   
 match = regex.Match(bookName);
 
   
 while (match.Success)
 
   
 {
 
   
 // 输出捕获群
 
   
 println(match.Groups[1].Value);
 
   
 println(match.Groups[2].Value);
 
   
 println(match.Groups[3].Value);
 
   
 println(match.Groups[4].Value);
 
   
 println("==============");
 
   
 match = match.NextMatch();
 
   
 }
 
   
 }
 
   
 Console.ReadLine();
 
   
 }
 
   
  
   
 /// <summary>
 
   
 /// 格式化打印
 
   
 /// </summary>
 
   
 /// <param name="value"></param>
 
   
 public static void println(String value)
 
   
 {
 
   
 if (value == null)
 
   
 {
 
   
 Console.WriteLine();
 
   
 }
 
   
 else
 
   
 {
 
   
 Console.WriteLine(value);
 
   
 }
 
   
 }
 
   
}

Java 和 .Net 的 API 操作方法很相似，都是先获取一个 match 对象，然后通过 match 对象得到样式中所代表的各个捕获群。这里要注意的是，通过索引值来取出捕获群的内容时，下标要从 1 开始，因为 0 代表的是整个表达式，而非第一个捕获群。下面继续介绍 match 的另一种操作方法，就是从一个指定的字符串中，利用一个正则表达式样式反复进行匹配操作，然后将匹配的结果进行处理，示例代码如下：

 
  

import java.util.regex.Matcher;
 
   

import java.util.regex.Pattern;
 
   
  
   
/** 
   
 * Java版本测试类 
   
 */ 
   

public class Main {
 
   
  
   
 /**
 
   
 * @param args 
   
 */ 
   
 public static void main(String[] args) {
 
   
 // 测试字符串
 
   
 String bookNames = "Apress - Java Regular Expressions Taming the java.util.regex Engine, 2004/n" +
 
   
 "O'Reilly - Learning Perl, 3rd Edition, Jul 2001/n" +
 
   
 "O'Reilly - SWT - A Developer's Notebook, Oct 2004/n" +
 
   
 "Wiley - Excel 2007 Power Programming with VBA, Apr 2007/n" +
 
   
 "O'Reilly - JavaScript The Definitive Guide, 5th Edition, Aug 2006/n" +
 
   
 "O'Reilly - CSS The Missing Manual, Aug 2006";
 
   
 // 匹配样式
 
   
 String bookPattern = "(?i)(.+?-)(.+,)(.+?Edition)?(.+)";
 
   
 // 编译Pattern
 
   
 Pattern pattern = Pattern.compile(bookPattern);
 
   
 Matcher matcher = null;
 
   
 // 进行匹配操作
 
   
 matcher = pattern.matcher(bookNames);
 
   
 while (matcher.find()) {
 
   
 // 输出捕获群
 
   
 println(matcher.group(1));
 
   
 println(matcher.group(2));
 
   
 println(matcher.group(3));
 
   
 println(matcher.group(4));
 
   
 println("==============");
 
   
 }
 
   
 }
 
   
  
   
 /**
 
   
 * 格式化打印 
   
 * 
   
 * @param value 
   
 */ 
   
 public static void println(String value) {
 
   
 if (value == null) {
 
   
 System.out.println();
 
   
 } else {
 
   
 System.out.println(value);
 
   
 }
 
   
 }
 
   
}

 
  
/// <summary> 
   
/// C#版本测试类 
   
/// </summary> 
   

class Tester 
   
{ 
   
 /// <summary>
 
   
 ///
 
   
 /// </summary>
 
   
 /// <param name="args"></param>
 
   
 public static void Main(string[] args)
 
   
 {
 
   
 // 测试字符串
 
   
 String bookNames = "Apress - Java Regular Expressions Taming the java.util.regex Engine, 2004/n" +
 
   
 "O'Reilly - Learning Perl, 3rd Edition, Jul 2001/n" +
 
   
 "O'Reilly - SWT - A Developer's Notebook, Oct 2004/n" +
 
   
 "Wiley - Excel 2007 Power Programming with VBA, Apr 2007/n" +
 
   
 "O'Reilly - JavaScript The Definitive Guide, 5th Edition, Aug 2006/n" +
 
   
 "O'Reilly - CSS The Missing Manual, Aug 2006";
 
   
 // 匹配样式
 
   
 String bookPattern = "(?i)(.+?-)(.+,)(.+?Edition)?(.+)";
 
   
 // 编译Pattern
 
   
 Regex regex = new Regex(bookPattern);
 
   
 Match match = null;
 
   
 // 进行匹配操作
 
   
 match = regex.Match(bookNames);
 
   
 while (match.Success)
 
   
 {
 
   
 // 输出捕获群
 
   
 println(match.Groups[1].Value);
 
   
 println(match.Groups[2].Value);
 
   
 println(match.Groups[3].Value);
 
   
 println(match.Groups[4].Value);
 
   
 println("==============");
 
   
 match = match.NextMatch();
 
   
 }
 
   
 Console.ReadLine();
 
   
 }
 
   
  
   
 /// <summary>
 
   
 /// 格式化打印
 
   
 /// </summary>
 
   
 /// <param name="value"></param>
 
   
 public static void println(String value)
 
   
 {
 
   
 if (value == null)
 
   
 {
 
   
 Console.WriteLine();
 
   
 }
 
   
 else
 
   
 {
 
   
 Console.WriteLine(value);
 
   
 }
 
   
 }
 
   
}

从上面几个 match 的例子中，我们可以总结出三点内容：第一，字符串所提供的 match 或正则表达式 API 引擎中所提供的静态方法可以进行简单的字符串匹配判断；第二，可以利用 match 类所提供的方法，按索引从捕获群中取得相关的内容；第三，根据指定的正则表达式样式，从整个字符串中反复进行匹配，依次取得与样式匹配的内容，该内容可以用 match 对象所代表，所以我们可以利用第二条来继续进行更为细致的操作。

在 Java 中，只用一个 Matcher 类就可以达到对匹配的内容进行操作的目的，而 .Net 则是采用了逐层细化的方式，将匹配的内容用 Match、Group 和 Capture 来表示，然后可以通过这三个类来对结果进行操作，这样就可以获取更大的灵活性了。比如我们用正则表达式样式(abc)*来匹配字符串“abcabcabc”，首先得到一个 match 对象，然后通过该对象得到 group，再用 group 取得 capture。由于表达式中只有一组圆括号，所以 group 的值就是最后一次成功匹配的内容，即最后一组 abc；capture 的作用就是记录每次匹配的结果，所以匹配的结果中有三个 capture 对象，每个 capture 记录一组 abc。

四、replace 的使用

在 Java 的 String 类中，有两个利用正则表达式进行字符串替换的方法：replaceFirst 和 replaceAll。调用方式分别是 str.replaceAll(regex, replacement) 和 str.replaceFirst(regex, replacement)，参数 regex 和 replacement 都是字符串类型，前面一个是正则表达式，后面是要进行替换的内容，它们的执行结果都是替换操作后的字符串。虽然 .Net 并没有为 String 类型提供这两种方法，但我们可以仿照 Java 的 String 类的 replaceFirst 和 replaceAll 的实现，然后利用 .Net 命名空间 System.Text.RegularExpressions 中 Regex 类的 replace 方法(参考 Regex.Replace (String, String, Int32) 和 Regex.Replace (String, String))来为 StringWrapper 增加这两个方法。

现在，我们来看一个例子，把下面表格中的左边 Java 代码多行注释，改写成右边的单行注释的形式。



// Sample Code 
       

package info.lession;
 
       
  
       

public class MainRunner {
 
       
  
       
 /**
 
       
 * @param args 
       
 */ 
       
 public static void main(String[] args) {
 
       
 // 添加任务处理器
 
       
 TaskProcessorManager.addProcessor(null);
 
       
  
       
 // 执行任务
 
       
 TaskProcessorManager.processAll();
 
       
 }
 
       
}



// Sample Code 
       

package info.lession;
 
       
  
       

public class MainRunner {
 
       
  
       
 /** @param args */
 
       
 public static void main(String[] args) {
 
       
 // 添加任务处理器
 
       
 TaskProcessorManager.addProcessor(null);
 
       
  
       
 // 执行任务
 
       
 TaskProcessorManager.processAll();
 
       
 }
 
       
}

首先要构造正则表达式(?s)/[*].*?[*]/来匹配我们要修改的注释，然后使用replaceAll方法进行替换。




public static void convertCommentFromMultilineToSingle() throws Exception {
 
   
 BufferedReader reader = new BufferedReader(new FileReader(
 
   
 new File("C://SampleCode.txt")));
 
   
 String lineString = null;
 
   
 StringBuilder stringBuilder = new StringBuilder();
 
   
 FileWriter fileWriter = new FileWriter("C://x.txt");
 
   
 while ((lineString = reader.readLine()) != null) {
 
   
 fileWriter.write(lineString + System.getProperty("line.separator"));
 
   
 // stringBuilder.append(new String(lineString.getBytes("ISO-8859-1"), "UTF-8"))
 
   
 stringBuilder.append(lineString)
 
   
 .append(System.getProperty("line.separator"));
 
   
 }
 
   
 // 为多行注释做标记
 
   
 lineString = stringBuilder.toString().replaceAll("(?s)/[*](.*?)[*]/",
 
   
 "#标记开始#$1#标记结束#");
 
   
 // 多行注释转单行
 
   
 lineString.replaceAll("(?s)#标记开始#((//s*)?([^//s]+)(//s*)?)*?#标记结束#", "123");
 
   
 fileWriter.close();
 
   
 System.out.println(lineString);
 
   
}

五、split 的使用

split 方法的作用是把原始字符串按照一定样式规则进行分割，然后再把分割后的结果字符串放到一个数组中，这样我们就可以遍历其中每一个分割的结果内容了。String 类为我们提供了两个 split 方法，一个就是我们通常理解的安分割样式提取字符，另一个更为灵活一些，可以让我们指定应用分割样式的次数以便得到不同的分割结果。通过 java.util.regex.Pattern 创建的 Pattern 对象所提供的 split 方法的操作结果与 String 类的 split 方法的操作结果是相同。在结果数组中，被提取出来的字符串顺序与它们在被分割字符串中的先后位置保持一致。

在 Sun 的 Javadoc 中有这样一个例子，要求对字符串“boo:and:foo”分别按字符“:”和“o”进行分割。下面表格是不同的参数样式对应的 split 方法执行后的结果：

Regex	Limit	Result
:	2	{ "boo", "and:foo" }
:	5	{ "boo", "and", "foo" }
:	-2	{ "boo", "and", "foo" }
o	5	{ "b", "", ":and:f", "", "" }
o	-2	{ "b", "", ":and:f", "", "" }
o	0	{ "b", "", ":and:f" }

参数 limit 能限制正则表达式被应用的次数，从而也就能影响到最终结果字符串的长度。假设 limit 参数值为正整数 n，操作的结果就应该是 n + 1，这种情况只适用于分割符的数量大于等于 n。在 Java 中，如果指定了参数值为正整数 n，那么实际上执行分割操作的次数就为 n - 1，务必要记住这一点，表格中第一个结果就属于这种情况。如果 n 为非正整数，字符串就会被尽可能多的被分割，表格中的第三个、第五个和第六个结果就属于这种情况，但当 n 为零的时候，split 会自动地把结果字符串后面的空项抛弃。

现在我们用下面的表格来分多步分析上面表格中第四个结果。这里有两个要注意的地方就是第二步和第四步，“o:and:foo”被分割后的结果是一个空字符串和“:and:foo”，因为位于第一个“o”的前端没有任何字符，所以就被指派了一个空字符串；同理，第四步的“o”的两端没有任何字符，分割后的结果，也就是两个空字符串了。

步骤	结果
一	{ "b", "o:and:foo" }
二	{ "b", "", ":and:foo" }
三	{ "b", "", ":and:f", "o" }
四	{ "b", "", ":and:f", "", "" }

对于 .Net 的也有与之类似的功能，请参考"System_Text_RegularExpressions_Regex_Split".split(/_/g).join(".")。 // :p Run with a kind of ECMAScript

秒客网

简单入门正则表达式 - 第十一章　Java与.Net中的正则表达式应用

一、java.util.regex 与 System.Text.RegularExpressions 介绍

二、类 String 中的正则表达式方法介绍

三、match 的使用

四、replace 的使用

五、split 的使用

相关文章

简单入门正则表达式 - 第十一章 Java与.Net中的正则表达式应用

一、java.util.regex 与 System.Text.RegularExpressions 介绍

二、类 String 中的正则表达式方法介绍

三、match 的使用

四、replace 的使用

五、split 的使用

相关文章

简单入门正则表达式 - 第十一章　Java与.Net中的正则表达式应用