Saturday, October 25, 2008

Regular Expressions in C# - Password Validator Revisited

Sometimes we make life more difficult than it needs to be.

Lately, I've learned a little more about Regular Expressions and "Negative Look-Around". I've used it a lot, but it only recently dawned upon me that I was not making full use of it.

Case in point, take a look at my earlier regular expression article where I explain how to validate a password with multiple requirements (Regular Expression Alternations). This article discusses the absense of the boolean AND in regular expressions and provides a complex IF-THEN-ELSE approach to test a string for conforming to multiple "password" constraints. And my subsequent article (Regular Expression Double Negatives) gets more complex with an "inside-out" approach to compensate for the lack of an AND operator.

But, let's go back even further to the article where I "explain" Negative Look-ahead. In this article, I provide the simple Negative Look-ahead pattern...


“(?!pattern)”

... and proceed to show the unexpected behavior and put it in perspective so that you can understand the behavior better. Though the pattern looks simple, its behavior is not. And frankly, I'm beginning to consider it a mis-use of look-around. Even my standard pattern for any sort of look-around goes something like this...


[look-behind]pattern[look-ahead]

... where either "look-around" pattern is optional. This is certainly a valid use of the look-around as it creates required bounds around the pattern to be found. Or, to put it another way, the positive look-behind and look-ahead here describes the "context" of the pattern to match.

But in my earlier articles, I suggested that multiple constraints on a pattern was a lot more difficult. Yes, it's more difficult, but not really as difficult as I first believed. Once you understand the look-around pattern better than I did, in particular, once you understand the construction...


[look-ahead]pattern[look-behind]

... you can begin writing more understandable multiple-constraint patterns.

Let's dig in. I've introduced two terms that need explanation. Context and Constraint are the terms I used for describing two important aspects of the pattern matching requirements.

When requirements dictate that we are looking for occurances of a particular pattern in a string, very often the match will be dependent upon the non-matched terms around it. This is the context of the match. For example, if we are looking for the word "an" in a paragraph, the implied context is that the two characters "an" are only a word if it is preceded by whitespace or starts the paragraph, and it is followed by whitespace or ends the paragraph. The simple pattern for the word "an" would be...


\b[Aa]n\b

... but look-around syntax can also be used...


[look-behind]pattern[look-ahead]
(?<=^|\s+)[Aa]n(?=$|\s)

... thus expressing the context of the pattern that makes it a word. (Actually, word boundary must take into account words at the end of phrases and sentences which are followed with punctuation. For now, we'll keep it simple.)

Expanding the context, let's say we want to find the word "an" where ever it is mis-used. That would be every occurance of "an" that is not followed by a word beginning with a vowel. (Again, there are some exceptions. When an acronym begins with an F, H, L, M, N, R, S, or X, and we vocalize the letters as in FTP, we precede the acronym with the article "an" rather than "a". But when vocalizing the acronym as a word as in SCSI (pronounced scuzzy), we use the article "a". Again, let's keep it simple.) So, the pattern would look like this...


\b[Aa]n\b(?=\s*[^AaEeIiOoUu])

... finding all occurances of the word "an" followed by a word that should use the article "a" instead. And a simple Regex.Replace() will fix such occurances as in this example...


string test =
@"This is a test of a string that mis-uses the word ""an"".
An helicopter is incorrect in American english.
But an elephant is correct usage.";
string pattern = @"\b[Aa]n\b(?=\s*[^AaEeIiOoUu])";
Console.WriteLine("{0}", Regex.Replace(test, pattern, "A"));

... Yes, it will replace "An" or "an" with the uppercase "A". An "Evaluator" parameter would make great sense here. But rather than getting bogged down in the details, I want to look at constraints right now.

Constraints generally provide the exceptions to the rule. Your search requirements may be to find words, but not all words. Constraints limit your matches to more specific values or exclude certain values. By their very nature, constraints are often best described with the boolean AND operator. According to my previous articles, you might conclude that there is no hope but to write a very obscure pattern using double negative and OR pattern matching. I'm happy to say, you have a few more options based upon the constraining construction using look-around. That construction would be...


[look-ahead]pattern[look-behind]

... In this construction, when implemented correctly, the look-ahead and look-behind will both examine the same characters that pattern examines. But, look-around is non-consumptive. That means a look-around expression that matches does not appear in the Match.Value property of a Regex.Match() operation. For example...


string test = "the end.";
string pattern = "(?<=the )end"; //[look-behind]pattern
Console.WriteLine("what'd I find ({0})",Regex.Match(test,pattern).Value);
// Outputs: what'd I find (end)

... So, the pattern "the " was found but was not considered part of the match.

The non-consumptive behavior comes in very handy when placing a look-ahead in front of a pattern. That's because the look-ahead AND the pattern must both match to have a match. Let's say we want to validate a string contains upper and lowercase letters and has at least one lowercase letter. You could write this pattern without look-around as follows...


string pattern = "^[a-zA-Z]*[a-z]+[a-zA-Z]*$";

... or with look-ahead as ...


string pattern = "^(?=[a-zA-Z]*$).*[a-z].*$";

... Notice that the look-ahead makes a general check that it only contains letters. But, because it does not consume the letters, the pattern following examines the same characters to make sure that the string contains a lowercase letter. The judicious placement of ^ and $ guarantee that both patterns examine the same set of characters. When arranged like this, the construction says that the look-ahead pattern AND the capture pattern must match or nothing matches.

Now, if you want to validate that the letter string has at least one uppercase and one lowercase letter, you cannot do it without look-around. The single pattern cannot be constructed in a way that does not impose an unwanted ordering. But with look-around, you only need to add additional look-around patterns like so...


string pattern = "^(?=.*[A-Z].*$)(?=[a-zA-Z]*$).*[a-z].*$";

... Here's some code to demonstrate...


string [] tests = {
"abcdEFG",
"ABCDEFG",
"Abcd$fg",
"abcdefg",
};
string pattern1 = "^[a-zA-Z]*[a-z]+[a-zA-Z]*$";
string pattern2 = "^(?=[a-zA-Z]*$).*[a-z].*$";
string pattern3 = "^(?=.*[A-Z].*$)(?=[a-zA-Z]*$).*[a-z].*$";
foreach (string test in tests)
{
Console.WriteLine("pattern1 {0} {1}", Regex.IsMatch(test, pattern1).ToString(), test);
Console.WriteLine("pattern2 {0} {1}", Regex.IsMatch(test, pattern2).ToString(), test);
Console.WriteLine("pattern3 {0} {1}", Regex.IsMatch(test, pattern3).ToString(), test);
}

So then, we do have an AND operation implied in the way we combine the look-ahead and pattern. Similarly, the AND operation is implied when combining a pattern followed by the look-behind. But, in order for these to operate like an AND operation, the patterns in each look-ahead and the final pattern must overlap the string space exactly. I accomplished the overlap by placing the "^" start-of-string and "$" end-of-string in such a way that each individual pattern match refered to the same characters. Of course, you don't have to make them overlap precisely if you are looking for more exotic behavior than AND.

Here is another example of a pattern that's very difficult (if not impossible) to write without look-around. Consider a string with words in it. You want to capture all words except a few choice words, let's say "and", "but" and "or". You need a pattern that finds all words AND a pattern that rejects specific words. You can only reject specific words using negative look-around. Here is the pattern and some code to demonstrate...


string pattern = @"\b\w+(?<!\band|\bbut|\bor)\b";
string test =
"This is a test string, and it is good for finding words, but "
+"it will not find all words. If you want to find all words, "
+"modify this or write your own. Band is a word to make sure "
+"words are words, butter is another and so is order.";
MatchCollection mx = Regex.Matches(test, pattern);
foreach (Match m in mx)
Console.WriteLine("{0}", m.Value);

... The pattern first finds the words, then the negative look-behind tosses out the ones we don't want.

Finally, let's look at that one more pattern. This would be the password validator pattern I covered in previous articles. But, this time we shall use a straight forward AND construction with look-ahead. A valid password will be only letters, and digits , will have at least one uppercase, one lowercase and one digit, and be at least 6 characters long. Here's the pattern...


string pattern = "^(?=[a-zA-Z0-9]{6,}$)(?=.*[a-z].*$)(?=.*[A-Z].*$).*[0-9].*$";

That it for now. I hope you've found this useful.

No comments: