Thursday, May 5, 2011

Using The Pattern Class in regular expression matching

In Java, you compile a regular expression by using the Pattern.compile() class factory. This factory returns an object of type Pattern. E.g.:

Pattern myPattern = Pattern.compile("regex"); 


You can specify certain options as an optional second parameter.

Pattern.compile("regex", 
Pattern.CASE_INSENSITIVE 
| Pattern.DOTALL 
| Pattern.MULTILINE);


makes the regex case insensitive for ASCII characters, causes the dot to match line breaks and causes the start and end of string anchors to match at embedded line breaks as well.

When working with Unicode strings, specify Pattern.UNICODE_CASE if you want to make the regex case insensitive for all characters in all languages. You should always specify Pattern.CANON_EQ to ignore differences in Unicode encodings, unless you are sure your strings contain only ASCII characters and you want to increase performance.

If you will be using the same regular expression often in your source code, you should create a Pattern object to increase performance. Creating a Pattern object also allows you to pass matching options as a second parameter to the Pattern.compile() class factory. If you use one of the String methods above, the only way to specify options is to embed mode modifier into the regex. Putting (?i) at the start of the regex makes it case insensitive. (?m) is the equivalent of Pattern.MULTILINE, (?s) equals Pattern.DOTALL and (?u) is the same as Pattern.UNICODE_CASE. Unfortunately, Pattern.CANON_EQ does not have an embedded mode modifier equivalent.

Use myPattern.split("subject") to split the subject string using the compiled regular expression. This call has exactly the same results as myString.split("regex"). The difference is that the former is faster since the regex was already compiled.

No comments:

Post a Comment

Chitika