Saturday, September 20, 2008

Regular Expression Double Negatives

In my last article titled "Regular Expression Alternation" we discussed Alternation as one means of performing the AND operation in a regular expression pattern.

(?(expression)yes|no)

But, what if you don't have this pattern in your version of RegEx? Some of you may be reading to gain knowledge that you can apply in other programming languages besides C# or the .NET environment. The pattern I described in the last article does not work in all versions of .NET Regular Expressions. In particular, it does not work in JScript Regular Expression Syntax. And, when you use tools like ASP.NET’s RegularExpressionValidator, you are implicitly using JScript. So then, what do you do?

To restate the original requirement, we want to validate a string that has at least one digit, at least one uppercase letter, and at least 6 characters. The problem statement is clearly an AND problem, so it begs to be written with the Alternation pattern. But in situations where we can’t use the Alternation, we can opt for my second approach. We can use the Negative Look-Ahead pattern.

(?!expression)

The negative look-ahead pattern is “non-consuming” and matches strings (or parts of string) that are not followed with the “expression”. You may wish to brush up on negative look-ahead with my article “Regular Expressions in C# - Negative Look-Ahead”. With negative look-ahead and using the OR pattern, we can use double-negative logic to express the original problem. This approach works based upon the logic axiom…

( A AND B ) == NOT ( (NOT A) OR (NOT B) )

Using this rule, the problem can be stated differently. We can state our problem in terms of ORs instead of ANDs. In other words...


If it is not the case that our string is devoid of digits or devoid of uppercase letters or short of 6 characters, then it is a valid string.


Notice I chose terms that suggest negative tests. Take a moment and think about it.

IF ( NOT ( NOT(Has Digit) OR NOT(Has Uppercase) OR NOT(6 or more chars) ) )

This has the same logic result as…

IF ( (Has Digit) AND (Has Uppercase) AND (6 or more chars) )

If you can’t get your mind around this basic programming concept, then re-writing your pattern requirement in terms of ORs will be difficult. So, experiment on paper and with generic logic variables to prove to yourself that your re-written pattern logic means the same thing as the original.

Now let’s look at the actual Regular Expression pattern. The component patterns from which we build look like this...

(^[^\d]*$) string is devoid of digits
(^[^A-Z]*$) string is devoid of uppercase letters
(^.{0,5}$) string has 5 or fewer characters

String them together to get the inner test ...

(^[^\d]*$)|(^[^A-Z]*$)|(^.{0,5}$)

Sometimes you can pause here and simply test for your string to match. If you supply an invalid string, it should MATCH. If you provide a valid string, it should NOT match. So, at this point, we have the opposite of what we really want. (A validator may not allow you to test this way, you would have to write a C# program to use as a “test jig”). So, we now negate the inner pattern with the negative look-ahead syntax...

(?!(^[^\d]*$)|(^[^A-Z]*$)|(^.{0,5}$))

This would work for the IsMatch() approach, but it won't quite work for the Match() approach. We have to add some bounds...

^.*(?!(^[^\d]*$)|(^[^A-Z]*$)|(^.{0,5}$)).*$

This pattern is much harder to decipher than the AND pattern using Alternation. But now, armed with the knowledge from these last two articles, you should be able to discern the meaning of such patterns. We should be able to plug this pattern in to all of the algorithms we covered in previous article and have it validate strings according to the requirement.

That’s all for now, I hope you enjoyed the discussion and find these examples useful.

No comments: