Friday, September 12, 2008

Regular Expressions in C# - Negative Look-ahead

In my last article on Regular Expressions, we looked at a couple of simple expressions and 3 algorithms to use the expressions for “validating” strings. The main point of the article is that Regular Expression behavior will be confusing if not considered in the context of the algorithm being used. Please take a look at Regular Expressions. What’s that got to do with C#? to get the background for this article.

In this article, we will discuss one of the more difficult Regular Expression problems. How do you use “Negative Look-ahead”? To use Regular Expression “Negative Look-Ahead” or “Negative Look-Behind”, you have to change the way you think about pattern matching. First, the negative look-ahead takes the syntax…

“(?!pattern)”

In the words of the MSDN documentation…

(Zero-width negative lookahead assertion.) Continues match only if the subexpression does not match at this position on the right. For example, \b(?!un)\w+\b matches words that do not begin with un.


This expression will “find” parts of the supplied string that are not followed by the “pattern”. One might erroneously think that this pattern will refuse to match strings that have the “pattern” in them. It does not, in fact, the following example actually “finds” some strange results.

string pattern = @"(?!invalid)"; //negative lookahead
string test = "invalid";
Regex rx = new Regex(pattern);
Console.WriteLine("Match:\t\t\t{0}\t({1})", rx.Match(test).Success.ToString(), test);
Match mx = rx.Match(test);
while (mx.Success)
{
foreach (Group g in mx.Groups)
{
Console.WriteLine("\t\t\t\t({0}) ({1}): {2}", mx.Value, g.Value, g.Index);
}
mx = mx.NextMatch();
}

The result is that there are actually 7 matches in the test string. The confusing matter is that each of the matches is an empty string, and the index for each match increments, from 1 to the number of characters in the test string. To understand this behavior, let’s think about the pattern differently. Let’s break it apart into some simple components. You could consider the pattern to be equivalent to…

“” + “(?!invalid)” + “”

This is a pattern to match an empty string, followed by a pattern to reject the string “invalid”, followed by a pattern to match an empty string. In short, without the negative look-ahead, this is a pattern to match an empty string. If you were to use just the simple empty string pattern on the test string, there would be 8 matches with indices for each match incrementing from 0 to the number of characters in the test string. There are 7 matches with the negative look-ahead and 8 matches without. The indices start at 1 with the negative look-ahead, and they start at 0 without. So, the negative look-ahead is causing one of the potential matches to fail. It’s rejecting the empty string match that occurs just prior to the first character in the test string. I.e. it is rejecting the only empty string match that is followed by “invalid” and matching all the others.

Oh my head is spinning! Why did I eliminate the negative look-ahead above? I did so to understand its effect on the entire pattern matching process. And I learned that the negative is actually finding a pattern to eliminate. In other words, it’s doing its job. It is eliminating from the set of matches the one match that is followed by the string “invalid”. But, because there are many other “empty-string” matches, IsMatch() returns “true” and Match() returns 7 matches. I also eliminated the negative look-ahead because look-around” patterns do not “consume” characters. They will never show up in the match result, so by temporarily removing the non-consuming pattern, I can see the real pattern that the look-around pattern constraints.

The Regular Expression philosophy is to find things, not to eliminate things. Since Regex works so hard at finding matches, it becomes difficult to write patterns whose job is to exclude strings. One could just write the pattern to find the strings that are unwanted and negate the match in program logic. Negative matching syntax is available, but Regex treats such syntax as a means of reducing the number of matches while trying to find ANYTHING that would otherwise match.

So how do I make this work? Consider the requirement to match only those strings that do not have the sequence “invalid” in it. First, you must define a pattern to match any otherwise valid string. You must define the pattern in such a way that it will match the string using the most restrictive form of the 3 programming approaches described in my last article. In particular, your pattern must pass the following test…

Match mx = Regex.Match(test, pattern);
if(mx.Success && (mx.Groups[0].Value == test))

The test above makes sure that not only does the test string “have” a match but the test string “is” the match. The empty string pattern will find many matches, but the whole match will not equal the test string for any test string other than an empty string. You have to watch out for the asterisk (*) and plus (+) which will consume as much as they can, or as little as necessary to achieve a match. Such patterns will work in the logic above, but they can change behavior as you start adding look-around patterns. In the case of our requirement, the all encompassing pattern would be “^.*”. Leave off the “$” because we will be using look-ahead. When looking ahead, we do not want to anchor the end of the string unless absolutely necessary.

Next, define a pattern to find whole strings that you’d like to exclude. “.*invalid.*” works. This pattern matches any string containing the sequence “invalid”. Next, wrap this pattern with the negative look-ahead syntax “(?.*invalid.*)”. And finally, insert it into the first pattern after the first anchor (^ in our case). Our resulting pattern would be “^(?!.*invalid.*).*”. Use the following test jig to prove this pattern to yourself.

string pattern = @"^(?!.*invalid.*).*";//negative look-ahead
string[] tests = {
"invalid",
"",
"this is also invalid",
"but this is okay",
"but this invalid string is not",
};

Regex rx = new Regex(pattern);
foreach (string test in tests)
{
Console.WriteLine("Match:\t\t\t{0}\t({1})", rx.Match(test).Success.ToString(), test);
Match mx = rx.Match(test);
while (mx.Success)
{
foreach (Group g in mx.Groups)
{
Console.WriteLine("\t\t\t\t({0}) ({1}): {2}", mx.Value, g.Value, g.Index);
}
mx = mx.NextMatch();
}
}

I’ll stop here for today and let you digest what’s going on. Certainly, the pattern can be optimized and some characters reduced. Experiment with changes and observe the effects. I am not through with the subject of Regular Expressions and Negative type matching. Check back later as I hope to post on the subject again soon.

2 comments:

jbl said...

Hi,

thanks for this very interesting post on this subject I just discovered.

I have a question :
In your last example, wouldn't this more complex pattern : ^(?!invalid)(.(?!invalid))*$ lead to a more simple (less matches) result ?

Provided it's equivalent, what would you recommand ?

Les Potter - Xalnix Corporation said...

Yes jbl, your pattern is nice and should work as an equivalent. Thanks for the post.