Alexander Beletsky's development blog

My profession is engineering

Regex to match a words in dictionary on page body

Using a Regex is pretty easy in .NET applications. All you have to use is Regex object and have basic understanding of regular expression patterns.

My goal was to create a code, that would give an answer: does this particular text contain some words from dictionary or not? Using a regular expressions is an obvious choice then you do such type of operation. So, I was trying to understand what technology is demanded by job offer (Cpp, Java or .NET) and is TDD skill demanded. To archive that I created a set of “matchers” small classes each of its own area. Crawler just used those matchers to get actual data.

    protected bool MatchToTdd(string description)
    {
      return new TddMatcher().Match(description);
    }

    protected bool MatchToJava(string desciption)
    {
      return new JavaMatcher().Match(desciption);
    }

    protected bool MatchToCpp(string desciption)
    {
      return new CppMatcher().Match(desciption);
    }

    protected bool MatchToDotNet(string desciption)
    {
      return new DotNetMatcher().Match(desciption);
    }

* This source code was highlighted with Source Code Highlighter.

As you see, I have 4 matchers to cover my requirements: CppMatcher, DotNetMatcher, JavaMatcher, TddMatcher. All of them implements simple IMatcher interface.

namespace Crawler.Core.Matchers
{
  public interface IMatcher
  {
    bool Match(string input);
  }
}

* This source code was highlighted with Source Code Highlighter.

Now, let’s review the matcher. Because all the matchers do basically the same operations and differ only but its dictionary contents, they contain a dictionary of target words and delegates matching functionality to MatchUtil class. Let’s see C++ matcher for instance.

namespace Crawler.Core.Matchers
{
  public class CppMatcher : IMatcher
  {
    private static IList<string> _patterns = new List<string>()
      {
        "c\\+\\+",
        "cpp",
        "stl",
        "cppunit"
      };

    public bool Match(string input)
    {
      return MatchUtil.Match(input, _patterns);
    }
  }
}


* This source code was highlighted with Source Code Highlighter.

I wanted to design MatchUtil.Match to be universal, as much as possible and to do not depend on kind of input words. Matching words with boundaries “\b” works perfecly, as soon as you have a simple words, like ‘java’, ‘nunit’, ‘tests’ and so on, but my tests stated to fail as soon as I tried ‘c++’ or ‘.net’. Because of ‘\b’ matches boudary between 2 alphanumeric symbols, in my case ‘+’ or ‘.’ is not alphanumeric. That made a problem to me and asked StackOverflow for help. I finished up with such implementation, that I hope could be useful if you do similar stuff.

namespace Crawler.Core.Matchers
{
  class MatchUtil
  {
    public static bool Match(string input, IList<string> patterns)
    {
      var lower = input.ToLower();
      foreach (var pattern in patterns)
      {
        var start = pattern.StartsWith("\\.") ? "(?!\\w)" : "\\b";
        if (Regex.IsMatch(lower, start + pattern + "(?!\\w)"))
        {
          return true;
        }
      }
      return false;
    }
  }
}

* This source code was highlighted with Source Code Highlighter.

So, Regex.IsMatch static method is used to perform match.

This is it. If you see some issues or improvements, please let me know. http://github.com/alexbeletsky/TddDemand