portable perspectives

a blog about programming, life, universe and everything

Spell Corrector in AliceML

It seems that writing a spell correcor is becoming my new post hello world program, so well, after having this in my hard disk for a while I thought I would publish my new solution, wrote using AliceML.

I'm really not good at ML, so this code must not be considered a good example of how to code, but I thing it's a good thing to keep track of my learning :)


import structure MkRedBlackSet  from "x-alice:/lib/data/MkRedBlackSet";
import structure MkRedBlackMap from "x-alice:/lib/data/MkRedBlackMap";

structure Set = MkRedBlackSet String;
structure Map = MkRedBlackMap String;

val wordlist = "/home/rff/wordlist.txt";
val alphabet = String.explode "abcdefghijklmnopqrstuvwxyz";

open List;

fun take2(i1,i2) = (fn cl => (take (cl,i1), drop (cl,i2)))

fun delete cl i = 
  let
    val (first, second) = take2 (i,i+1) cl
  in
    String.implode (first @ second)
  end

fun deletions cl = tabulate ((length cl), (delete cl));


fun substitute cl i = 
  let 
    val (first,second) = take2 (i, i+1) cl
    fun sub_char c = String.implode (first @ [c] @ second)
  in  
    map sub_char alphabet
  end 

fun substitutions cl =  concat( tabulate ((length cl), (substitute cl)));

fun insert cl i = 
  let 
    val (first, second) = take2 (i,i) cl
    fun ins_char c = String.implode (first @ [c] @ second)
  in  
    map ins_char alphabet
  end 

fun insertions cl =  concat( tabulate ((length cl)+1, (insert cl)));

fun transpose cl i = 
  let 
    val (first,second) = take2 (i,i+2) cl
    val c1 = nth (cl,i)
    val c2 = nth (cl,i+1)
  in  
    String.implode (first @ [c2] @[c1] @ second)
  end 

fun transpositions cl =  tabulate ((length cl)-1, (transpose cl));


fun edits1 word = 
  let 
    val w = String.explode word 
    val ws = deletions w @ transpositions w @ insertions w @  substitutions w
  in 
    foldl (Fn.flip Set.insert) (Set.empty) ws
  end 

fun insertOrUpdate map key = 
  case Map.lookup (map, key) of
     NONE => Map.insert (map, key, ref 1)
    |SOME value => (value := (!value +1) ; map)


(* train *)
fun train io =
  let 
    val chop = fn str => String.substring (str, 0, (String.size str) - 1)
  in
    case TextIO.inputLine io of
       NONE => Map.empty
      |SOME line => insertOrUpdate (train io) (chop line)
    end  

val nwords = train ( TextIO.openIn wordlist);

fun known_edits2  word = 
  let 
    fun set_edits set = Set.fold (fn (str,set) => Set.union (set, known (edits1 str))) Set.empty set
  in
    set_edits (edits1 word)
  end

fun look k = !(valOf (Map.lookup  (nwords, k)));


fun correct word =
  let 
    val candidates = find (not o Set.isEmpty) 
                           [ known (Set.fromList [word]),
                             known (edits1 word),
                             known_edits2 word]
     val sort = sort (fn (a,b)=>Int.compare ((look b), (look a))) o Set.toList
   in
     case candidates of
       SOME words => hd(sort words)
      |NONE     => word
   end  

correct "ciao" (*ciao match*);
correct "cao"  (*ciao edit 1*);
correct "mioo" (*miao edit 1*);
correct "miaoo"(*miao edit 1*);
correct "mooo" (*mao  edit 2 max value*);
correct "quux" (*quux no match*);

The worse part is the string manipulation, I think, and maybe the fact that I use references (aka mutable variables) when building the frequency count Map, but in the end I'm quite satisfied. Feel free to point out obvious errors, if you see them.

See all comments

A spell Corrector in perl6 part 3

(See part 1 and part 2 for explanations of the following code).

Now, this code should work. If I copy and paste it into pugs it works as expected. If I run it it has some failures in the final tests for correct() that I can't explain.

The part in the beinning shows how to count different words in a file, it depends on proper handling of unicode in pugs, so it may or may not work at the moment,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

sub words($file) { slurp($file).lc.comb(/<alpha>+/) }

sub train(@words) {
  my %res;
  for @words -> $w { %res{$w}++ }
  %res
}


#my %NWORDS = train(words('/home/rff/Desktop/big.txt'));
my %NWORDS={'ciao'=>4,'c'=>3,'cibo'=>1,'ciaao'=>1,'ccc'=>1,'cia'=>1};

my @ALPHA = 'a'..'z';


# 'abc' -> 'ac'
sub deletion($word) {
  (^$word.chars).map: {substr(my $tmp = $word,$_,1)='';$tmp};
}

# 'abc' -> 'adc'
sub substitution($word) {
  gather {
    for (0..$word.chars-1) X @ALPHA {
      substr(my $tmp = $word,$_[0],1)=$_[1];
      take $tmp;
    }
  }
}

# 'abc' -> 'abbc'
sub insertion($word) {
  gather {
    for (0..$word.chars) X @ALPHA {
      substr(my $tmp = $word,$_[0],0)=$_[1];
      take $tmp;
     }
  } 
}

# 'abc' -> 'acb'
sub transposition($w) {
  gather for ^$w.chars {
    my $tmp=$w;
    my $removed =(substr($tmp,$_,1)='');
    substr($tmp,$_+1,0)=$removed;
    take $tmp;
  }
}

sub edits1($w) {
  # all these are different, no need to use a set
  transposition($w),insertion($w),substitution($w),deletion($w)
}

    
sub known_edits2($words) { 
  my @ary = gather {
    for edits1($words) -> $e1 {
      for edits1($e1) -> $e2 {
        take $e2 if %NWORDS{$e2} 
      }
    }
  }
  any(@ary).values
}

sub known(@words) { 
  gather for @words {take $_ if %NWORDS{$_}} ;
}

sub correct($w) {
  my @values = known([$w]) or known(edits1($w)) or known_edits2($w) or [$w];
  # single argument max() doesn't work yet
  say @values.perl;
  @values.max: {%NWORDS{$^a} <=> %NWORDS{$^b}}

}

See all comments

A spell Corrector in perl6 part 2

In the first part we saw how we could write routines to generate string mutations by using perl6's regexen, but we also noticed that the implementation won't run on pugs, yet.

So, here is an alternative implementation of the same methods, based on Str.substr. The only thing to keep in mind is that this method does side effet modifying the current string, so in each case we have to create a copy (which for strings in perl we can do via simple assignment). The code should be quite self-explaining:


# 'abc' -> 'ac'
sub deletion($word) {
  (^$word.chars).map: {substr(my $tmp = $word,$_,1)=''; $tmp};
}

# 'abc' -> 'adc'
sub substitution($word) {
  gather {
    for (0..$word.chars-1) X @ALPHA {
      substr(my $tmp = $word,$_[0],1)=$_[1];
      take $tmp;
    }
  }
}

# 'abc' -> 'abbc'
sub insertion($word) {
  gather {
    for (0..$word.chars) X @ALPHA {
      substr(my $tmp = $word,$_[0],0)=$_[1];
      take $tmp;
     }
  } 
}

# 'abc' -> 'acb'
sub transposition($w) {
  gather for ^$w.chars {
    my $tmp=$w;
    my $removed =(substr($tmp,$_,1)='');
    substr($tmp,$_+1,0)=$removed;
    take $tmp;
  }
}

Remember that the value returned by the function is just the last computed one, so we always return implicitly the value of the gather (or of mao in the first function.

Now we can define a edits1 function that returns all the possible 1-letter variations of a word like this:


sub edits1($w) {
  transposition($w),insertion($w),substitution($w),deletion($w)
}

the comma operator concatenates the arrays so edits1 simply returns a big array of all the variations.

If you compare this code with norvig's you will notice that in edits1 he is already using Sets while we keep using simple lists. Since perl6 doesn't provide a builtin Set type we will use the any() junction which provide basically the same behaviour (i.e. we can check for the presence of an object, there are no duplicates and we can iterate over it).

Now we need a known(@array) routine that selects the elements of the list that actually exist. Rmember that we have an Hash variable %WORDS[word]=>number that we have initialized somehow with the existing english words and their frequency.

Our function is thus very easy:


sub known(@words) { 
  @words.grep:  { %NWORDS{$_}
}

The correct($word) function then becomes:


sub correct($word) {
  my @values =  known([$word]) or             # the word is known
                              known(edits1($word)) or # a single letter variation is known
                              [$word];                                # we don't have suggestions

  # single argument max() doesn't work yet
  @values.max: {%NWORDS{$^a} <=> %NWORDS{$^b}}
}

The last line coud be prettier, but I have the feeling the code is quite nice and clean.

For the last bit of code please wait a little bit more, but hey, it should be easy enough for you to write :)

See all comments

A spell Corrector in perl6 part 1

Back to blogging! Ok, I'll try to bring this blog back to his old glory^W^W^Wnormal activity, maybe not too much posting, but enough to be able to say that I actually have a blog in english.

At the moment I'm doing a little bit of perl6 hacking just to get familiar with all the new cool stuff that Larry Wall has been putting inside the language. I think that perl6 could be a lot simpler and use less punctuation, but I believe that a lot of the stuff that is being put into the design is extremely intersting.

So, after solving a couple of the 99problems I looked for something slightly bigger, and settled on a port of Norvig's spell corrector.

The main idea of the thing is quite simple: you build a huge database of existing words and their frequency, then correct("word") simply looks if "word" is in that database, or if there is a variation of it, or a variation of a variation, and returns the one with maximum frequency.

If nothing is found we surrender and return "word" unchanged.

The possible variations are:

  • insertion: 'helo' -> 'hello'
  • substitution: 'edd' -> 'odd'
  • deletion: 'housse' -> 'house'
  • transposition: ''hots' -> 'host'

In the python code these are implemented as list comprehensions plus handling string as arrays. Each variation is stored inside a Set object, which is an enumerable object that doesn't hold duplicates.

In perl6 the we could implement each of the operations easily with Str.subst(regexp,replacement), use gather+for for the iteration and result extraction, and use Junction objects in place of Sets.

Let's look at the for operations in detail:

1
2
3
4
# 'abc' -> 'ac'
sub deletion($w) {
    gather take $w.subst(/<at($_)>./,'') for ^$w.chars
}

The gather sets a kind of scope where take can operate. Each call to take puts the argument into an impicit array that is then returnd by gather. The for statement modifer iterates over an iterable object. The object is built from ^ which is the upto operator, which builds a range 0..N from a number N.

The regex is quite simple, too: at(N) is a zero-width assertion that matches at the N-th element, then we capture it with "." and replace it with nothing, deleting it.

Doing substitution is simple too:

1
2
3
4
# 'abc' -> 'adc'
sub substitution($w) {
    gather take $w.subst(/<at($_[0])>./,$_[1]) for $w.chars X @ALPHA
}

here the only difference is that we use the "X" operator, or cross opertator. The cross operator builds a cartesian product of two iterables, so iterating over 'a'..b' X 1..2 menas iterate over (('a',1),('a',2),('b',1),('b',2)). The inner code is then: replace N-th char with each one in the alphabet.

Insertion, on the same line, does the same thing withouth deleting a character, and doing one more iteration to add a character after the end of the string:

1
2
3
4
# 'abc' -> 'abbc'
sub insertion($w) {
    gather take $w.subst(/<at($_[0])>/,$_[1]) for ^($w.chars+1) X @ALPHA
}

Finally, trnasposition ues a slightly different form of subst, by passing a block, because it needs access to the capture objects:

1
2
3
4
# 'abc' -> 'acb'
sub transposition($w) {
    gather take $w.subst(/<at($_)>(.)(.)/,{"$1$0"})  for ^$w.chars
}

Basically just take chars N and N+1, and swap them.

Quite simple and nice implementations, in my opinion. The only problem is that at the moment pugs does not allow interpolation of variables in regex, so all this won't work.

For a working implementation, wait for part two of this article, or try to come up with your own :)

See all comments

AddThis Social Bookmark Button