Monday, March 21, 2011

Names of Gaddafi

A question was posted yesterday on StackOverflow asking how to create a regular expression to search for the many names of Libyan leader, Muammar al-Gaddafi. I thought it would be fun to explore this problem using Factor.

Someone helpfully posted an image that demonstrates some of the variations of his full name:


List of Names

One approach would be to list out all his possible names, and then check to see if a given string is in the list:

CONSTANT: names {
    "Gadaffi"
    "Gadafi"
    "Gadafy"
    "Gaddafi"
    "Gaddafy"
    "Gaddhafi"
    "Gadhafi"
    "Gathafi"
    "Ghadaffi"
    "Ghadafi"
    "Ghaddafi"
    "Ghaddafy"
    "Gheddafi"
    "Kadaffi"
    "Kadafi"
    "Kad'afi"
    "Kaddafi"
    "Kadhafi"
    "Kazzafi"
    "Khadaffy"
    "Khadafy"
    "Khaddafi"
    "Qadafi"
    "Qaddafi"
    "Qadhafi"
    "Qadhaafi"
    "Qadhdhafi"
    "Qadthafi"
    "Qathafi"
    "Quathafi"
    "Qudhafi"
}

: gaddafi? ( string -- ? )
    names member? ;

Regular Expressions

If we wanted to speed it up, we could build a regular expression from all of the names (using literals to build the regular expression at compile time):

USE: regexp

: gaddafi? ( string -- ? )
    $[ names "|" join <regexp> ] matches? ;
Note: the first time you call this method it needs to compile the regular expression and is a bit slow. Subsequent calls are much faster.

One problem with that, is that it doesn't take into account that sometimes he is called "Al Qaddafi" or "el-Gaddafi". We could create our own case-insensitive pattern that attempts to capture all the possible variations of his name:

CONSTANT: names-pattern
R/ ((al|el)[-\s]?)?(Kh?|Gh?|Qu?)[aeu](d['dt]?|t|zz|dhd)h?aa?ff?[iy]/i

: gaddafi? ( string -- ? )
    names-pattern matches? ;

An advantage of using regular expressions is that it is easy to take a piece of text and normalize all the mentions of his name:

: normalize-gaddafi ( string -- string' )
    names-pattern "Gaddafi" re-replace ;

Soundex

An interesting idea was proposed on the Hacker News discussion to use Soundex to match names. That might look something like this:

USE: soundex

: gaddafi? ( string -- ? )
    soundex { "G310" "K310" "Q310" "Q331" } member? ;

Some disadvantages of this is that it doesn't capture names with prefix "Al" or "El", and misses some names (e.g., "Kazzafi" has a soundex value of "K210", but that would produce a false match against a name like "KOSOFF").

This problem is made even harder if you want to include all the possible variations of his first name (e.g., Muammar, Moamar, Mo'ammar, Mu'amar, Moamma, etc.), and include text that is "FIRSTNAME LASTNAME" or "LASTNAME, FIRSTNAME".

No comments: