Friday, September 5, 2008

Regular Expressions? What's That Got To Do With C#?

...Only that I often need to know Regular Expressions for my C# work. However, the online help and resources seem to come up a little short. So, today I diverge a little and discuss this cryptic yet valuable ancillary topic to try to help you through your next Regex dilema.

I'm not going to waste time and internet bandwidth explaining what a Regular Expression is, there are plenty of sites for those. But, I will give special thanks here to OmegaMan who compiled and posted the following on the MSDN Regex Forum...

OmegaMan's .Net Regex Resources Reference

I refer to it often.

So let's dive right in. Whenever you are considering using regular expressions, you need to determine what kind of pattern matching problem are you trying to solve.

1) Do I want a regular expression to check a string for validity?
2) Do I want a regular expression to find certain things in my string?
3) Do I want a regular expression so that I can replace patterns in my string?

There can be some overlap in #1 and #2 since validity may depend upon the string containing "certain things". Certainly, if you want to replace something in #3, you need to find it in the string with #2. But, many of the problems people run into with regular expressions can be traced to not having identified the problem properly.

Take the problem, "make sure my string contains only letters and digits". Sounds simple enough, and one may write the pattern ...


"[\da-zA-Z]*"
\d = digit
a-z = lowercase letter
A-Z = uppercase letter
[]* = zero or more of the characters in the brackets


This pattern might "find" letters and digits in a string, but it doesn't say that there are ONLY letters and digits in that string. Regex.IsMatch(...) tests a string with a pattern and returns true if the string contains a match. Regex.Match(...) tests a string with a pattern and returns a Match object indicating if the string contains matches and details, if any, about each match.

So if you are doing a validity check with the pattern above, your results will depend on which tools you use and how you use them. Given that pattern, both IsMatch() and Match() will find matches in a string, even if it contains undesirable characters. In fact, because of the asterisk (*), the string doesn't have to contain any of the pattern characters for there to be "a match" (it matches the empty string). These functions, given this pattern, are simply indicating whether or not any part of the string matches the pattern. Here's some code to demonstrate...

Example 1:

string pattern = @"[\da-zA-Z]*"; //use the @ to tell c# to leave \ alone
string[] tests = {
"containsOnlyLettersAnd01234",
"contains letters And 01234, but also spaces",
"!@#$%", // contains none of the desired characters
"", // a completely empty string
};
foreach(string test1 in tests)
Console.WriteLine("IsMatch:\t{0}\t({1})",Regex.IsMatch(test1,pattern).ToString(),test1);
foreach(string test2 in tests)
Console.WriteLine("Match:\t{0}\t({1})",Regex.Match(test2,pattern).Success.ToString(), test2);

...outputting...

IsMatch: True (containsOnlyLettersAnd01234)
IsMatch: True (contains letters And 01234, but also spaces)
IsMatch: True (!@#$%)
IsMatch: True ()
Match: True (containsOnlyLettersAnd01234)
Match: True (contains letters And 01234, but also spaces)
Match: True (!@#$%)
Match: True ()


Obviously, several of these strings are not valid by our requirements. So, what went wrong, and how would you "validate" the string? In order to do the desired validity check, one must consider both the pattern and the Regex method that will be used. For instance, the pattern as written will validate if you write the supporting code to accommodate it. For example...

Example 2:

Regex rx = new Regex(pattern);
foreach(string test1 in tests)
Console.WriteLine("Modified Match:\t{0}\t({1})",
(rx.Match(test1).Success && (rx.Match(test1).Value == test1)).ToString(),
test1);

...outputting...

Modified Match: True (containsOnlyLettersAnd01234)
Modified Match: False (contains letters And 01234, but also spaces)
Modified Match: False (!@#$%)
Modified Match: True ()


Now that looks much better. Our pattern match now "works" except for the empty string. The requirement might be considered vague and allow for such a match. Many applications will have an input text box that starts out blank. When the user enters characters, the text is then validated. These apps usually explicitly test for empty text. Case in point, the ASP.NET RegularExpressionValidator states that it will not validate the empty string, i.e., empty strings will PASS. It is up to the programmer to require some input. By the way, RegularExpressionValidator does pattern matching validation on both the client and the server. On the client, it uses JScript Regular Expression syntax, which has a smaller feature set and syntax than the server uses. It also uses the same program construction as the second example.


if(match != null && (match[0] == value)) // valid


If you use other tools, you must know how the validity test is done. For, in the last example, patterns that have look-ahead or look-behind will often fail. They will look perfectly valid, and they will match using IsMatch(), but they require that you add some additional pattern to consume those characters that look-around does not consume. We'll get into that in a future article.

Now the pattern could have been written differently. Using the next pattern example, you can use any of the three methods to validate that the string contians only letters and digits. This time, I also require that the input string not be blank.

"^[\da-zA-Z]+$"
^ = match the beginning of string/line, zero width pattern
$ = match the end of the string/line, zero width pattern
[]+ = 1 or more of the characters in the brackets

Example 3:

string pattern2 = @"^[\da-zA-Z]+$";
Regex rx = new Regex(pattern2);
foreach (string test in tests)
{
Console.WriteLine("Modified Match:\t{0}\t({1})",
(rx.Match(test).Success && (rx.Match(test).Value == test)).ToString(),
test);
Console.WriteLine("IsMatch:\t\t{0}\t({1})", rx.IsMatch(test).ToString(), test);
Console.WriteLine("Match:\t\t\t{0}\t({1})", rx.Match(test).Success.ToString(), test);
}

...outputting...

Modified Match: True (containsOnlyLettersAnd01234)
IsMatch: True (containsOnlyLettersAnd01234)
Match: True (containsOnlyLettersAnd01234)
Modified Match: False (contains letters And 01234, but also spaces)
IsMatch: False (contains letters And 01234, but also spaces)
Match: False (contains letters And 01234, but also spaces)
Modified Match: False (!@#$%)
IsMatch: False (!@#$%)
Match: False (!@#$%)
Modified Match: False ()
IsMatch: False ()
Match: False ()


In otherwords, the pattern and the approach must work together. When you have control of both, then solving the problem is easier. But as you can see in the last example, all 3 approaches can validate according to our requirements by tweaking the pattern. The trick is often to find the tweak that works in all cases. When you can't change the underlying programming, like in the case of RegularExpressionValidator, you have to be able to write your pattern to "match" the underlying approach.

Now these are simple examples and I would have liked to get into some real meaty Regex expressions, but I've run out of time, and this posting is late. I'll be back though with more on Regular Expressions in the next article. For now, I hope this is of some use to you.

No comments: