A spell Corrector in perl6 part 1
Back to blogging! Ok, I'll try to bring this blog back to his old glory^W^W^Wnormal activity, maybe not too much posting, but enough to be able to say that I actually have a blog in english.
At the moment I'm doing a little bit of perl6 hacking just to get familiar with all the new cool stuff that Larry Wall has been putting inside the language. I think that perl6 could be a lot simpler and use less punctuation, but I believe that a lot of the stuff that is being put into the design is extremely intersting.
So, after solving a couple of the 99problems I looked for something slightly bigger, and settled on a port of Norvig's spell corrector.
The main idea of the thing is quite simple: you build a huge database of existing words and their frequency, then correct("word") simply looks if "word" is in that database, or if there is a variation of it, or a variation of a variation, and returns the one with maximum frequency.
If nothing is found we surrender and return "word" unchanged.
The possible variations are:
- insertion: 'helo' -> 'hello'
- substitution: 'edd' -> 'odd'
- deletion: 'housse' -> 'house'
- transposition: ''hots' -> 'host'
In the python code these are implemented as list comprehensions plus handling string as arrays. Each variation is stored inside a Set object, which is an enumerable object that doesn't hold duplicates.
In perl6 the we could implement each of the operations easily with Str.subst(regexp,replacement), use gather+for for the iteration and result extraction, and use Junction objects in place of Sets.
Let's look at the for operations in detail:
1 2 3 4 |
# 'abc' -> 'ac'
sub deletion($w) {
gather take $w.subst(/<at($_)>./,'') for ^$w.chars
}
|
The gather sets a kind of scope where take can operate. Each call to take puts the argument into an impicit array that is then returnd by gather. The for statement modifer iterates over an iterable object. The object is built from ^ which is the upto operator, which builds a range 0..N from a number N.
The regex is quite simple, too: at(N) is a zero-width assertion that matches at the N-th element, then we capture it with "." and replace it with nothing, deleting it.
Doing substitution is simple too:
1 2 3 4 |
# 'abc' -> 'adc'
sub substitution($w) {
gather take $w.subst(/<at($_[0])>./,$_[1]) for $w.chars X @ALPHA
}
|
here the only difference is that we use the "X" operator, or cross opertator. The cross operator builds a cartesian product of two iterables, so iterating over 'a'..b' X 1..2 menas iterate over (('a',1),('a',2),('b',1),('b',2)). The inner code is then: replace N-th char with each one in the alphabet.
Insertion, on the same line, does the same thing withouth deleting a character, and doing one more iteration to add a character after the end of the string:
1 2 3 4 |
# 'abc' -> 'abbc'
sub insertion($w) {
gather take $w.subst(/<at($_[0])>/,$_[1]) for ^($w.chars+1) X @ALPHA
}
|
Finally, trnasposition ues a slightly different form of subst, by passing a block, because it needs access to the capture objects:
1 2 3 4 |
# 'abc' -> 'acb'
sub transposition($w) {
gather take $w.subst(/<at($_)>(.)(.)/,{"$1$0"}) for ^$w.chars
}
|
Basically just take chars N and N+1, and swap them.
Quite simple and nice implementations, in my opinion. The only problem is that at the moment pugs does not allow interpolation of variables in regex, so all this won't work.
For a working implementation, wait for part two of this article, or try to come up with your own :)
Sorry, comments are closed for this article.