The first half of this tutorial introduced you to regular expressions and the Regex API. You learned about the Pattern
class, then worked through examples demonstrating regex constructs, from basic pattern matching with literal strings to more complex matches using ranges, boundary matchers, and quantifiers.
In Part 2 we'll pick up where we left off, exploring methods associated with the Pattern
, Matcher
, and PatternSyntaxException
classes. You'll also be introduced to two tools that use regular expressions to simplify common coding tasks. The first extracts comments from code for documentation purposes. The second is a reusable library for performing lexical analysis, which is an essential component of assemblers, compilers, and similar software.
Explore the Regex API
Pattern
, Matcher
, and PatternSyntaxException
are the three classes that comprise the Regex API. Each class offers methods that you can use to integrate regexes into your code.
Pattern methods
An instance of the Pattern
class describes a compiled regex, also known as a pattern. Regexes are compiled to increase performance during pattern-matching operations. The following static
methods support compilation.
Pattern compile(String regex)
compilesregex
's contents into an intermediate representation stored in a newPattern
object. This method either returns the object's reference upon success, or throwsPatternSyntaxException
if it detects invalid syntax in theregex
. AnyMatcher
object used by or returned from thisPattern
object adheres to various default settings, such as case-sensitive searching. As an example,Pattern p = Pattern.compile("(?m)^\\.");
creates aPattern
object that stores a compiled representation of the regex for matching all lines starting with a period character.Pattern compile(String regex, int flags)
accomplishes the same task asPattern compile(String regex)
, but is able to account forflags
: a bitwise-inclusive ORed set of flag constant bit values.Pattern
declaresCANON_EQ
,CASE_INSENSITIVE
,COMMENTS
,DOTALL
,LITERAL
,MULTILINE
,UNICODE_CASE
,UNICODE_CHARACTER_CLASS
, andUNIX_LINES
constants that can be bitwise ORed together (e.g.,CASE_INSENSITIVE | DOTALL
) and passed toflags
.
Except forCANON_EQ
,LITERAL
, andUNICODE_CHARACTER_CLASS
, these constants are an alternative to embedded flag expressions, which were demonstrated in Part 1. ThePattern compile(String regex, int flags)
method throwsjava.lang.IllegalArgumentException
when it detects a flag constant other than those defined byPattern
constants. For example,Pattern p = Pattern.compile("^\\.", Pattern.MULTILINE);
is equivalent to the previous example, where thePattern.MULTILINE
constant and the(?m)
embedded flag expression accomplish the same task.
At times you will need to obtain a copy of an original regex string that has been compiled into a Pattern
object, along with the flags it is using. You can do this by calling the following methods:
String pattern()
returns the original regex string that was compiled into thePattern
object.int flags()
returns thePattern
object's flags.
After obtaining a Pattern
object, you'll typically use it to obtain a Matcher
object, so that you can perform pattern-matching operations. The Matcher matcher(Charsequence input)
creates a Matcher
object that matches provided input
text against a given Pattern
object's compiled regex. When called, it returns a reference to this Matcher
object. For example, Matcher m = p.matcher(args[1]);
returns a Matcher
for the Pattern
object referenced by variable p
.
Splitting text
Most developers have written code to break input text into its component parts, such as converting a text-based employee record into a set of fields. Pattern
offers a quicker way to handle this tedium, via a pair of text-splitting methods:
String[] split(CharSequence text, int limit)
splitstext
around matches of thePattern
object's pattern and returns the results in an array. Each entry specifies a text sequence that's separated from the next text sequence by a pattern match (or the text's end). All array entries are stored in the same order as they appear in thetext
.
In this method, the number of array entries depends onlimit
, which also controls the number of matches that occur:- A positive value means that at most
limit - 1
matches are considered and the array's length is no greater than thelimit
entries. - A negative value means all possible matches are considered, and the array can be of any length.
- A zero means all possible matches are considered, the array can have any length, and trailing empty strings are discarded.
- A positive value means that at most
String[] split(CharSequence text)
invokes the previous method with zero as the limit and returns the method call's result.
Here's how split(CharSequence text)
handles the task of splitting an employee record into its field components of name, age, street address, and salary:
Pattern p = Pattern.compile(",\\s");
String[] fields = p.split("John Doe, 47, Hillsboro Road, 32000");
for (int i = 0; i < fields.length; i++)
System.out.println(fields[i]);
The above code specifies a regex that matches a comma character immediately followed by a single-space character. Here's the output:
John Doe
47
Hillsboro Road
32000
Pattern predicates and the Streams API
Java 8 introduced the Predicate<String> asPredicate()
method to Pattern
. This method creates a predicate (Boolean-valued function) that's used for pattern matching. The code below demonstrates asPredicate()
:
List<String> progLangs = Arrays.asList("apl", "basic", "c", "c++", "c#", "cobol",
"java", "javascript", "perl", "python",
"scala");
Pattern p = Pattern.compile("^c");
progLangs.stream().filter(p.asPredicate()).forEach(System.out::println);
This code creates a list of programming language names, then compiles a pattern for matching all of the names that start with the lowercase letter c
. The last line above obtains a sequential stream with the list as its source. It installs a filter that uses asPredicate()
's Boolean function, which returns true when a name begins with c
, and iterates over the stream, outputting matched names to the standard output.
That last line is equivalent to the following traditional loop, which you might remember from the RegexDemo
application in Part 1:
for (String progLang: progLangs)
if (p.matcher(progLang).find())
System.out.println(progLang);
Matcher methods
An instance of the Matcher
class describes an engine that performs match operations on a character sequence by interpreting a Pattern
's compiled regex. Matcher
objects support different kinds of pattern-matching operations:
boolean find()
scans input text for the next match. This method starts its scan either at the beginning of the given text, or at the first character following the previous match. The latter option is only possible when the previous method invocation has returned true and the matcher hasn't been reset. In either case, Boolean true is returned when a match is found. You will find an example of this method in theRegexDemo
from Part 1.boolean find(int start)
resets the matcher and scans text for the next match. The scan begins at the index specified bystart
. Boolean true is returned when a match is found. For example,m.find(1);
scans text beginning at index1
. (Index 0 is ignored.) Ifstart
contains a negative value or a value exceeding the length of the matcher's text, this method throwsjava.lang.IndexOutOfBoundsException
.boolean matches()
attempts to match the entire text against the pattern. This method returns true when the entire text matches. For example,Pattern p = Pattern.compile("\\w*"); Matcher m = p.matcher("abc!"); System.out.println(p.matches());
outputsfalse
because the!
symbol isn't a word character.boolean lookingAt()
attempts to match the given text against the pattern. This method returns true when any of the text matches. Unlikematches()
, the entire text doesn't need to be matched. For example,Pattern p = Pattern.compile("\\w*"); Matcher m = p.matcher("abc!"); System.out.println(p.lookingAt());
outputstrue
because the beginning of theabc!
text consists of word characters only.
Unlike Pattern
objects, Matcher
objects record state information. Occasionally, you might want to reset a matcher to clear that information after performing a pattern match. The following methods reset a matcher:
Matcher reset()
resets a matcher's state, including the matcher's append position (which is cleared to zero). The next pattern-match operation begins at the start of the matcher's text. A reference to the currentMatcher
object is returned. For example,m.reset();
resets the matcher referenced bym
.Matcher reset(CharSequence text)
resets a matcher's state and sets the matcher's text totext
. The next pattern-match operation begins at the start of the matcher's new text. A reference to the currentMatcher
object is returned. For example,m.reset("new text");
resets them
-referenced matcher and also specifiesnew text
as the matcher's new text.
Appending text
A matcher's append position identifies the start of the matcher's text that's appended to a java.lang.StringBuffer object
. The following methods use the append position:
Matcher appendReplacement(StringBuffer sb, String replacement)
reads the matcher's text characters and appends them to thesb
-referencedStringBuffer
object. This method stops reading after the last character preceding the previous pattern match. Next, the method appends the characters in thereplacement
-referencedString
object to theStringBuffer
object. (Thereplacement
string may contain references to text sequences captured during the previous match, via dollar-sign characters ($
) and capturing group numbers.) Finally, the method sets the matcher's append position to the index of the last matched character plus one, then returns a reference to the current matcher.
TheMatcher appendReplacement(StringBuffer sb, String replacement)
method throwsjava.lang.IllegalStateException
when the matcher hasn't yet made a match, or when the previous match attempt has failed. It throwsIndexOutOfBoundsException
whenreplacement
specifies a capturing group that doesn't exist in the pattern.StringBuffer appendTail(StringBuffer sb)
appends all text to theStringBuffer
object and returns that object's reference. Following a final call to theappendReplacement(StringBuffer sb, String replacement)
method, callappendTail(StringBuffer sb)
to copy remaining text to theStringBuffer
object.
The following code calls appendReplacement(StringBuffer sb, String replacement)
and appendTail(StringBuffer sb)
to replace all occurrences of cat
with caterpillar
in the provided text:
Pattern p = Pattern.compile("(cat)");
Matcher m = p.matcher("one cat, two cats, or three cats on a fence");
StringBuffer sb = new StringBuffer();
while (m.find())
m.appendReplacement(sb, "$1erpillar");
m.appendTail(sb);
System.out.println(sb);
Placing a capturing group and a reference to the capturing group in the replacement text instructs the program to insert erpillar
after each cat
match. The above code results in the following output:
one caterpillar, two caterpillars, or three caterpillars on a fence
Replacing text
Matcher
provides a pair of text-replacement methods that complement appendReplacement(StringBuffer sb, String replacement)
. These methods let you replace either the first match or all matches:
String replaceFirst(String replacement)
resets the matcher, creates a newString
object, copies all of the matcher's text characters (up to the first match) to the string, appends thereplacement
characters to the string, copies remaining characters to the string, and returns theString
object. (Thereplacement
string may contain references to text sequences captured during the previous match, via dollar-sign characters and capturing-group numbers.)String replaceAll(String replacement)
operates similarly toreplaceFirst(String replacement)
, but replaces all matches withreplacement
's characters.
The \s+
regex detects one or more occurrences of whitespace characters in the input text. Below, we use this regex and call the replaceAll(String replacement)
method to remove duplicate whitespace:
Pattern p = Pattern.compile("\\s+");
Matcher m = p.matcher("Remove the \t\t duplicate whitespace. ");
System.out.println(m.replaceAll(" "));
Here is the output: