Saturday, October 25, 2008

Regular Expressions in C# - Password Validator Revisited

Sometimes we make life more difficult than it needs to be.

Lately, I've learned a little more about Regular Expressions and "Negative Look-Around". I've used it a lot, but it only recently dawned upon me that I was not making full use of it.

Case in point, take a look at my earlier regular expression article where I explain how to validate a password with multiple requirements (Regular Expression Alternations). This article discusses the absense of the boolean AND in regular expressions and provides a complex IF-THEN-ELSE approach to test a string for conforming to multiple "password" constraints. And my subsequent article (Regular Expression Double Negatives) gets more complex with an "inside-out" approach to compensate for the lack of an AND operator.

But, let's go back even further to the article where I "explain" Negative Look-ahead. In this article, I provide the simple Negative Look-ahead pattern...


“(?!pattern)”

... and proceed to show the unexpected behavior and put it in perspective so that you can understand the behavior better. Though the pattern looks simple, its behavior is not. And frankly, I'm beginning to consider it a mis-use of look-around. Even my standard pattern for any sort of look-around goes something like this...


[look-behind]pattern[look-ahead]

... where either "look-around" pattern is optional. This is certainly a valid use of the look-around as it creates required bounds around the pattern to be found. Or, to put it another way, the positive look-behind and look-ahead here describes the "context" of the pattern to match.

But in my earlier articles, I suggested that multiple constraints on a pattern was a lot more difficult. Yes, it's more difficult, but not really as difficult as I first believed. Once you understand the look-around pattern better than I did, in particular, once you understand the construction...


[look-ahead]pattern[look-behind]

... you can begin writing more understandable multiple-constraint patterns.

Let's dig in. I've introduced two terms that need explanation. Context and Constraint are the terms I used for describing two important aspects of the pattern matching requirements.

When requirements dictate that we are looking for occurances of a particular pattern in a string, very often the match will be dependent upon the non-matched terms around it. This is the context of the match. For example, if we are looking for the word "an" in a paragraph, the implied context is that the two characters "an" are only a word if it is preceded by whitespace or starts the paragraph, and it is followed by whitespace or ends the paragraph. The simple pattern for the word "an" would be...


\b[Aa]n\b

... but look-around syntax can also be used...


[look-behind]pattern[look-ahead]
(?<=^|\s+)[Aa]n(?=$|\s)

... thus expressing the context of the pattern that makes it a word. (Actually, word boundary must take into account words at the end of phrases and sentences which are followed with punctuation. For now, we'll keep it simple.)

Expanding the context, let's say we want to find the word "an" where ever it is mis-used. That would be every occurance of "an" that is not followed by a word beginning with a vowel. (Again, there are some exceptions. When an acronym begins with an F, H, L, M, N, R, S, or X, and we vocalize the letters as in FTP, we precede the acronym with the article "an" rather than "a". But when vocalizing the acronym as a word as in SCSI (pronounced scuzzy), we use the article "a". Again, let's keep it simple.) So, the pattern would look like this...


\b[Aa]n\b(?=\s*[^AaEeIiOoUu])

... finding all occurances of the word "an" followed by a word that should use the article "a" instead. And a simple Regex.Replace() will fix such occurances as in this example...


string test =
@"This is a test of a string that mis-uses the word ""an"".
An helicopter is incorrect in American english.
But an elephant is correct usage.";
string pattern = @"\b[Aa]n\b(?=\s*[^AaEeIiOoUu])";
Console.WriteLine("{0}", Regex.Replace(test, pattern, "A"));

... Yes, it will replace "An" or "an" with the uppercase "A". An "Evaluator" parameter would make great sense here. But rather than getting bogged down in the details, I want to look at constraints right now.

Constraints generally provide the exceptions to the rule. Your search requirements may be to find words, but not all words. Constraints limit your matches to more specific values or exclude certain values. By their very nature, constraints are often best described with the boolean AND operator. According to my previous articles, you might conclude that there is no hope but to write a very obscure pattern using double negative and OR pattern matching. I'm happy to say, you have a few more options based upon the constraining construction using look-around. That construction would be...


[look-ahead]pattern[look-behind]

... In this construction, when implemented correctly, the look-ahead and look-behind will both examine the same characters that pattern examines. But, look-around is non-consumptive. That means a look-around expression that matches does not appear in the Match.Value property of a Regex.Match() operation. For example...


string test = "the end.";
string pattern = "(?<=the )end"; //[look-behind]pattern
Console.WriteLine("what'd I find ({0})",Regex.Match(test,pattern).Value);
// Outputs: what'd I find (end)

... So, the pattern "the " was found but was not considered part of the match.

The non-consumptive behavior comes in very handy when placing a look-ahead in front of a pattern. That's because the look-ahead AND the pattern must both match to have a match. Let's say we want to validate a string contains upper and lowercase letters and has at least one lowercase letter. You could write this pattern without look-around as follows...


string pattern = "^[a-zA-Z]*[a-z]+[a-zA-Z]*$";

... or with look-ahead as ...


string pattern = "^(?=[a-zA-Z]*$).*[a-z].*$";

... Notice that the look-ahead makes a general check that it only contains letters. But, because it does not consume the letters, the pattern following examines the same characters to make sure that the string contains a lowercase letter. The judicious placement of ^ and $ guarantee that both patterns examine the same set of characters. When arranged like this, the construction says that the look-ahead pattern AND the capture pattern must match or nothing matches.

Now, if you want to validate that the letter string has at least one uppercase and one lowercase letter, you cannot do it without look-around. The single pattern cannot be constructed in a way that does not impose an unwanted ordering. But with look-around, you only need to add additional look-around patterns like so...


string pattern = "^(?=.*[A-Z].*$)(?=[a-zA-Z]*$).*[a-z].*$";

... Here's some code to demonstrate...


string [] tests = {
"abcdEFG",
"ABCDEFG",
"Abcd$fg",
"abcdefg",
};
string pattern1 = "^[a-zA-Z]*[a-z]+[a-zA-Z]*$";
string pattern2 = "^(?=[a-zA-Z]*$).*[a-z].*$";
string pattern3 = "^(?=.*[A-Z].*$)(?=[a-zA-Z]*$).*[a-z].*$";
foreach (string test in tests)
{
Console.WriteLine("pattern1 {0} {1}", Regex.IsMatch(test, pattern1).ToString(), test);
Console.WriteLine("pattern2 {0} {1}", Regex.IsMatch(test, pattern2).ToString(), test);
Console.WriteLine("pattern3 {0} {1}", Regex.IsMatch(test, pattern3).ToString(), test);
}

So then, we do have an AND operation implied in the way we combine the look-ahead and pattern. Similarly, the AND operation is implied when combining a pattern followed by the look-behind. But, in order for these to operate like an AND operation, the patterns in each look-ahead and the final pattern must overlap the string space exactly. I accomplished the overlap by placing the "^" start-of-string and "$" end-of-string in such a way that each individual pattern match refered to the same characters. Of course, you don't have to make them overlap precisely if you are looking for more exotic behavior than AND.

Here is another example of a pattern that's very difficult (if not impossible) to write without look-around. Consider a string with words in it. You want to capture all words except a few choice words, let's say "and", "but" and "or". You need a pattern that finds all words AND a pattern that rejects specific words. You can only reject specific words using negative look-around. Here is the pattern and some code to demonstrate...


string pattern = @"\b\w+(?<!\band|\bbut|\bor)\b";
string test =
"This is a test string, and it is good for finding words, but "
+"it will not find all words. If you want to find all words, "
+"modify this or write your own. Band is a word to make sure "
+"words are words, butter is another and so is order.";
MatchCollection mx = Regex.Matches(test, pattern);
foreach (Match m in mx)
Console.WriteLine("{0}", m.Value);

... The pattern first finds the words, then the negative look-behind tosses out the ones we don't want.

Finally, let's look at that one more pattern. This would be the password validator pattern I covered in previous articles. But, this time we shall use a straight forward AND construction with look-ahead. A valid password will be only letters, and digits , will have at least one uppercase, one lowercase and one digit, and be at least 6 characters long. Here's the pattern...


string pattern = "^(?=[a-zA-Z0-9]{6,}$)(?=.*[a-z].*$)(?=.*[A-Z].*$).*[0-9].*$";

That it for now. I hope you've found this useful.

Wednesday, October 15, 2008

Callbacks And Delegates

It's been a couple of weeks since last writing. Besides being very busy, I've had a touch of "writer's block". Nothing too serious, I should get over it soon.

Lately, I've been intrigued by the abundant use of callback functions and delegates in the .NET Framework. If you want to do anything interesting, you gotta know callbacks. I'll try to describe callbacks and delegates in simplistic terms. Maybe it will help you over the hump.

Consider the following code...


List list = new List();
//Populate list with some MyClass objects
for(int i=0; i<10; ++i)
list.Add(new MyClass(){ ID=i, strValue=i.ToString() });
//Sort the list
list.Sort();
foreach(MyClass c in list)
Console.WriteLine("{0}",c.strValue);

...

public class MyClass
{
public int ID;
public string strValue;
public override string ToString() { return strValue; }
}


This code throws an exception at the line that says Sort(). That's because the default List.Sort() expects the objects it contains to implement IComparable. Now, we could quickly add IComparable to the class and implement the interface. But, if you don't have control of the class, what would you do then? Or if the built-in IComparable interface didn't sort on the field you want, what then?

You're in luck, because Sort() is overloaded and will accept as a parameter an IComparer or a Comparison object.


public class CompareMyClass : IComparer
{
public int Compare(MyClass x, MyClass y)
{
return x.strValue.CompareTo(y.strValue);
}
}


With this piece of code, you can now call Sort() with a new CompareMyClass object and the exception goes away and the list comes out sorted. This is pretty cool, but it's also pretty static. If I wanted to sort by the ID field instead, I would have to change my class or write a new class. We can reduce some of that by going to the Comparison class. This is a generic class whose constructor takes as an argument a callback function. The Comparison constructor has the calling signature...



Comparison.Comparison(int (T,T) target)

The generic parameter T can be anything. But when we see the signature in the constructor's parameter list, one might start scratching their heads. The best way to read the parameter signature is to start with the word "target". It could have been anything, but "target" suggests that it is the target of some operation. In fact, it is. It is the target function that Sort() will call over and over to determine the correct order of the list elements. The "int (T,T)" is the "type" of the parameter. This means that "target" is a function that takes two parameters of type T and returns an "int" value. Since T is a generic parameter we can replace it with MyClass as we do below. We no longer need the CompareMyClass class, but we do need a function to call.


int MyCompare(MyClass x, MyClass y)
{ return x.strValue.CompareTo(y.strValue); }

// and Sort looks like this...
list.Sort(new Comparison(MyCompare));

You could have two different functions to compare MyClass objects two different ways.



int MyCompare2(MyClass x, MyClass y)
{ return x.ID.CompareTo(y.ID); }

But, then you have to plug in the correct function when you want a different sort behavior. If the comparison type might change at runtime based upon user input, you can set a "delegate" variable and provide that to Comparison() instead.


// at class scope
delegate int MyCompareDelegate(MyClass x, MyClass y); // declares a delegate type

...
// at method scope
MyCompareDelegate dlgt; // declares a delegate variable
if(radioButton1.Checked == true)
dlgt = MyCompare;
else
dlgt = MyCompare2;

list.Sort(new Comparison(dlgt));

The delegate lets us treat the functions as objects. Since a delegate for the callback will work just as well as the callback, we can write the comparison code right at the place where we use it with "anonymous delegates"...


MyCompareDelegate dlgt;
if (radioButton1.Checked == true)
dlgt = delegate(MyClass x, MyClass y)
{
return x.ID.CompareTo(y.ID);
};
else
dlgt = delegate(MyClass x, MyClass y)
{
return x.strValue.CompareTo(y.strValue);
};
//Sort the list
list.Sort(new Comparison(dlgt));

Here we have set the "dlgt" delegate variable to one of two "anonymous" functions. Creating and anonymous function returns a delegate that can be assigned to a variable like any other function, as long as the signatures match. Well if that's the case, then the Lambda syntax should also work, shouldn't it?


MyCompareDelegate dlgt;
if (radioButton1.Checked == true)
dlgt = (x,y) => x.ID.CompareTo(y.ID);
else
dlgt = (x,y) => x.strValue.CompareTo(y.strValue);
//Sort the list
list.Sort(new Comparison(dlgt));

Yes. The Lamda Expression implicitly determines the types of x, y and the return value from the delegate assignment and context. Pretty cool!

I stop there. This is starting to be too much fun. The power of the delegate and callback is huge. Keep playing with these and you'll get the hang of it.