A spell Corrector in perl6 part 3
(See part 1 and part 2 for explanations of the following code).
Now, this code should work. If I copy and paste it into pugs it works as expected. If I run it it has some failures in the final tests for correct() that I can't explain.
The part in the beinning shows how to count different words in a file, it depends on proper handling of unicode in pugs, so it may or may not work at the moment,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
sub words($file) { slurp($file).lc.comb(/<alpha>+/) }
sub train(@words) {
my %res;
for @words -> $w { %res{$w}++ }
%res
}
#my %NWORDS = train(words('/home/rff/Desktop/big.txt'));
my %NWORDS={'ciao'=>4,'c'=>3,'cibo'=>1,'ciaao'=>1,'ccc'=>1,'cia'=>1};
my @ALPHA = 'a'..'z';
# 'abc' -> 'ac'
sub deletion($word) {
(^$word.chars).map: {substr(my $tmp = $word,$_,1)='';$tmp};
}
# 'abc' -> 'adc'
sub substitution($word) {
gather {
for (0..$word.chars-1) X @ALPHA {
substr(my $tmp = $word,$_[0],1)=$_[1];
take $tmp;
}
}
}
# 'abc' -> 'abbc'
sub insertion($word) {
gather {
for (0..$word.chars) X @ALPHA {
substr(my $tmp = $word,$_[0],0)=$_[1];
take $tmp;
}
}
}
# 'abc' -> 'acb'
sub transposition($w) {
gather for ^$w.chars {
my $tmp=$w;
my $removed =(substr($tmp,$_,1)='');
substr($tmp,$_+1,0)=$removed;
take $tmp;
}
}
sub edits1($w) {
# all these are different, no need to use a set
transposition($w),insertion($w),substitution($w),deletion($w)
}
sub known_edits2($words) {
my @ary = gather {
for edits1($words) -> $e1 {
for edits1($e1) -> $e2 {
take $e2 if %NWORDS{$e2}
}
}
}
any(@ary).values
}
sub known(@words) {
gather for @words {take $_ if %NWORDS{$_}} ;
}
sub correct($w) {
my @values = known([$w]) or known(edits1($w)) or known_edits2($w) or [$w];
# single argument max() doesn't work yet
say @values.perl;
@values.max: {%NWORDS{$^a} <=> %NWORDS{$^b}}
}
|
A spell Corrector in perl6 part 2
In the first part we saw how we could write routines to generate string mutations by using perl6's regexen, but we also noticed that the implementation won't run on pugs, yet.
So, here is an alternative implementation of the same methods, based on Str.substr. The only thing to keep in mind is that this method does side effet modifying the current string, so in each case we have to create a copy (which for strings in perl we can do via simple assignment). The code should be quite self-explaining:
# 'abc' -> 'ac'
sub deletion($word) {
(^$word.chars).map: {substr(my $tmp = $word,$_,1)=''; $tmp};
}
# 'abc' -> 'adc'
sub substitution($word) {
gather {
for (0..$word.chars-1) X @ALPHA {
substr(my $tmp = $word,$_[0],1)=$_[1];
take $tmp;
}
}
}
# 'abc' -> 'abbc'
sub insertion($word) {
gather {
for (0..$word.chars) X @ALPHA {
substr(my $tmp = $word,$_[0],0)=$_[1];
take $tmp;
}
}
}
# 'abc' -> 'acb'
sub transposition($w) {
gather for ^$w.chars {
my $tmp=$w;
my $removed =(substr($tmp,$_,1)='');
substr($tmp,$_+1,0)=$removed;
take $tmp;
}
}
Remember that the value returned by the function is just the last computed one, so we always return implicitly the value of the gather (or of mao in the first function.
Now we can define a edits1 function that returns all the possible 1-letter variations of a word like this:
sub edits1($w) {
transposition($w),insertion($w),substitution($w),deletion($w)
}
the comma operator concatenates the arrays so edits1 simply returns a big array of all the variations.
If you compare this code with norvig's you will notice that in edits1 he is already using Sets while we keep using simple lists. Since perl6 doesn't provide a builtin Set type we will use the any() junction which provide basically the same behaviour (i.e. we can check for the presence of an object, there are no duplicates and we can iterate over it).
Now we need a known(@array) routine that selects the elements of the list that actually exist. Rmember that we have an Hash variable %WORDS[word]=>number that we have initialized somehow with the existing english words and their frequency.
Our function is thus very easy:
sub known(@words) {
@words.grep: { %NWORDS{$_}
}
The correct($word) function then becomes:
sub correct($word) {
my @values = known([$word]) or # the word is known
known(edits1($word)) or # a single letter variation is known
[$word]; # we don't have suggestions
# single argument max() doesn't work yet
@values.max: {%NWORDS{$^a} <=> %NWORDS{$^b}}
}
The last line coud be prettier, but I have the feeling the code is quite nice and clean.
For the last bit of code please wait a little bit more, but hey, it should be easy enough for you to write :)
See all commentsA spell Corrector in perl6 part 1
Back to blogging! Ok, I'll try to bring this blog back to his old glory^W^W^Wnormal activity, maybe not too much posting, but enough to be able to say that I actually have a blog in english.
At the moment I'm doing a little bit of perl6 hacking just to get familiar with all the new cool stuff that Larry Wall has been putting inside the language. I think that perl6 could be a lot simpler and use less punctuation, but I believe that a lot of the stuff that is being put into the design is extremely intersting.
So, after solving a couple of the 99problems I looked for something slightly bigger, and settled on a port of Norvig's spell corrector.
The main idea of the thing is quite simple: you build a huge database of existing words and their frequency, then correct("word") simply looks if "word" is in that database, or if there is a variation of it, or a variation of a variation, and returns the one with maximum frequency.
If nothing is found we surrender and return "word" unchanged.
The possible variations are:
- insertion: 'helo' -> 'hello'
- substitution: 'edd' -> 'odd'
- deletion: 'housse' -> 'house'
- transposition: ''hots' -> 'host'
In the python code these are implemented as list comprehensions plus handling string as arrays. Each variation is stored inside a Set object, which is an enumerable object that doesn't hold duplicates.
In perl6 the we could implement each of the operations easily with Str.subst(regexp,replacement), use gather+for for the iteration and result extraction, and use Junction objects in place of Sets.
Let's look at the for operations in detail:
1 2 3 4 |
# 'abc' -> 'ac'
sub deletion($w) {
gather take $w.subst(/<at($_)>./,'') for ^$w.chars
}
|
The gather sets a kind of scope where take can operate. Each call to take puts the argument into an impicit array that is then returnd by gather. The for statement modifer iterates over an iterable object. The object is built from ^ which is the upto operator, which builds a range 0..N from a number N.
The regex is quite simple, too: at(N) is a zero-width assertion that matches at the N-th element, then we capture it with "." and replace it with nothing, deleting it.
Doing substitution is simple too:
1 2 3 4 |
# 'abc' -> 'adc'
sub substitution($w) {
gather take $w.subst(/<at($_[0])>./,$_[1]) for $w.chars X @ALPHA
}
|
here the only difference is that we use the "X" operator, or cross opertator. The cross operator builds a cartesian product of two iterables, so iterating over 'a'..b' X 1..2 menas iterate over (('a',1),('a',2),('b',1),('b',2)). The inner code is then: replace N-th char with each one in the alphabet.
Insertion, on the same line, does the same thing withouth deleting a character, and doing one more iteration to add a character after the end of the string:
1 2 3 4 |
# 'abc' -> 'abbc'
sub insertion($w) {
gather take $w.subst(/<at($_[0])>/,$_[1]) for ^($w.chars+1) X @ALPHA
}
|
Finally, trnasposition ues a slightly different form of subst, by passing a block, because it needs access to the capture objects:
1 2 3 4 |
# 'abc' -> 'acb'
sub transposition($w) {
gather take $w.subst(/<at($_)>(.)(.)/,{"$1$0"}) for ^$w.chars
}
|
Basically just take chars N and N+1, and swap them.
Quite simple and nice implementations, in my opinion. The only problem is that at the moment pugs does not allow interpolation of variables in regex, so all this won't work.
For a working implementation, wait for part two of this article, or try to come up with your own :)
See all comments