Spell Corrector in AliceML
It seems that writing a spell correcor is becoming my new post hello world program, so well, after having this in my hard disk for a while I thought I would publish my new solution, wrote using AliceML.
I'm really not good at ML, so this code must not be considered a good example of how to code, but I thing it's a good thing to keep track of my learning :)
import structure MkRedBlackSet from "x-alice:/lib/data/MkRedBlackSet";
import structure MkRedBlackMap from "x-alice:/lib/data/MkRedBlackMap";
structure Set = MkRedBlackSet String;
structure Map = MkRedBlackMap String;
val wordlist = "/home/rff/wordlist.txt";
val alphabet = String.explode "abcdefghijklmnopqrstuvwxyz";
open List;
fun take2(i1,i2) = (fn cl => (take (cl,i1), drop (cl,i2)))
fun delete cl i =
let
val (first, second) = take2 (i,i+1) cl
in
String.implode (first @ second)
end
fun deletions cl = tabulate ((length cl), (delete cl));
fun substitute cl i =
let
val (first,second) = take2 (i, i+1) cl
fun sub_char c = String.implode (first @ [c] @ second)
in
map sub_char alphabet
end
fun substitutions cl = concat( tabulate ((length cl), (substitute cl)));
fun insert cl i =
let
val (first, second) = take2 (i,i) cl
fun ins_char c = String.implode (first @ [c] @ second)
in
map ins_char alphabet
end
fun insertions cl = concat( tabulate ((length cl)+1, (insert cl)));
fun transpose cl i =
let
val (first,second) = take2 (i,i+2) cl
val c1 = nth (cl,i)
val c2 = nth (cl,i+1)
in
String.implode (first @ [c2] @[c1] @ second)
end
fun transpositions cl = tabulate ((length cl)-1, (transpose cl));
fun edits1 word =
let
val w = String.explode word
val ws = deletions w @ transpositions w @ insertions w @ substitutions w
in
foldl (Fn.flip Set.insert) (Set.empty) ws
end
fun insertOrUpdate map key =
case Map.lookup (map, key) of
NONE => Map.insert (map, key, ref 1)
|SOME value => (value := (!value +1) ; map)
(* train *)
fun train io =
let
val chop = fn str => String.substring (str, 0, (String.size str) - 1)
in
case TextIO.inputLine io of
NONE => Map.empty
|SOME line => insertOrUpdate (train io) (chop line)
end
val nwords = train ( TextIO.openIn wordlist);
fun known_edits2 word =
let
fun set_edits set = Set.fold (fn (str,set) => Set.union (set, known (edits1 str))) Set.empty set
in
set_edits (edits1 word)
end
fun look k = !(valOf (Map.lookup (nwords, k)));
fun correct word =
let
val candidates = find (not o Set.isEmpty)
[ known (Set.fromList [word]),
known (edits1 word),
known_edits2 word]
val sort = sort (fn (a,b)=>Int.compare ((look b), (look a))) o Set.toList
in
case candidates of
SOME words => hd(sort words)
|NONE => word
end
correct "ciao" (*ciao match*);
correct "cao" (*ciao edit 1*);
correct "mioo" (*miao edit 1*);
correct "miaoo"(*miao edit 1*);
correct "mooo" (*mao edit 2 max value*);
correct "quux" (*quux no match*);
The worse part is the string manipulation, I think, and maybe the fact that I use references (aka mutable variables) when building the frequency count Map, but in the end I'm quite satisfied. Feel free to point out obvious errors, if you see them.
See all commentsdarcs pull vs git pull
I mean, I'm not going to say that you should use $VCS1 over $VCS2, I'm not proficient enough with either. I just think that:
rff@ut:~/ramaze$ darcs pull
Pulling from \"http://manveru.mine.nu/ramaze\"...
Wed Sep 12 14:52:55 CEST 2007 Michael Fellinger <mail@foo.com>
* Adding path for OSX to tool/tidy and improve readability of the spec for it a bit.
Shall I pull this patch? (1/1) [ynWvpxqadjk], or ? for help: v
[Adding path for OSX to tool/tidy and improve readability of the spec for it a bit.
Michael Fellinger <mail@foo.com>**20070912125255] {
hunk ./lib/ramaze/tool/tidy.rb 18
+ /usr/lib/libtidy.dylib
hunk ./spec/ramaze/tidy.rb 10
- Ramaze::Tool::Tidy.tidy(\"<html></html>\").should =~ %r{<html>\\s+<head>\\s+<meta name=\"generator\" content=\"HTML Tidy (.*?)\" />\\s+<title></title>\\s+</head>\\s+<body></body>\\s+</html>}
+ Ramaze::Tool::Tidy.tidy(\"<html></html>\").
+ should =~ %r{<html>\\s+<head>\\s+<meta name=\"generator\" content=\"HTML Tidy (.*?)\" />\\s+<title></title>\\s+</head>\\s+<body></body>\\s+</html>}
}
Wed Sep 12 14:52:55 CEST 2007 Michael Fellinger <mail@foo.com>
* Adding path for OSX to tool/tidy and improve readability of the spec for it a bit.
Shall I pull this patch? (1/1) [ynWvpxqadjk], or ? for help: y
Finished pulling and applying.
is slightly more clear than:
rff@ut:~/rubinius$ git pull Fetching refs/heads/master from http://git.rubini.us/code using http got 2fb92489ed4a0e58ea3d941a6f2ff8c7b8b44c2d walk 2fb92489ed4a0e58ea3d941a6f2ff8c7b8b44c2d got 6ae85832a699f3048ee260d9091aacc05c686fe4 got 6c0e0e63c0bb5979e2dd2b3df6e511b031656ef4 walk 6c0e0e63c0bb5979e2dd2b3df6e511b031656ef4 Getting alternates list for http://git.rubini.us/code Getting pack list for http://git.rubini.us/code Getting index for pack cf217ebdaa4f745cc827450e9cac23ec258d45f9 Getting pack cf217ebdaa4f745cc827450e9cac23ec258d45f9 which contains cf987cd736618331bf421603c1491030b0fc8e35 got c77671cc8a43049bd61dd22f13fbbbc1ab4f96ec ... many more.. got 933282575047f8924edc6cc426c0a234bdbe16a3 * refs/remotes/origin/master: fast forward to branch \'master\' of http://git.rubini.us/code old..new: e85ecf8..2fb9248 Updating e85ecf8..2fb9248 Fast forward .gitignore | 1 + Makefile | 2 +- bin/sirb.rb | 1 - compiler/bytecode/compiler.rb | 22 +- compiler/bytecode/primitive_names.rb | 2 +- compiler/bytecode/rubinius.rb | 1 + compiler/bytecode/system_hints.rb | 20 +- compiler/translation/local_scoping.rb | 7 +- compiler/translation/normalize.rb | 24 ++ compiler/translation/states.rb | 30 ++- kernel/core/compile.rb | 6 +- kernel/core/compiled_method.rb | 7 +- kernel/core/context.rb | 26 ++- kernel/core/io.rb | 3 + kernel/core/string.rb | 91 ++++++-- lib/bin/compile.rb | 176 ++++++++++++--- lib/bin/sirb.rb | 22 +- lib/ext/syck/build.rb | 6 + lib/ext/syck/rbxext.c | 12 +- lib/stringio.rb | 413 +++++++++++++++++++++++++++++++++ runtime/bootstrap.rba | Bin 29758 -> 29986 bytes runtime/compiler.rba | Bin 79208 -> 81314 bytes runtime/core.rba | Bin 162952 -> 168854 bytes runtime/loader.rbc | Bin 12565 -> 12651 bytes shotgun/lib/cpu.h | 13 +- shotgun/lib/cpu_primitives.c | 18 ++ shotgun/lib/cpu_task.c | 11 + shotgun/lib/instructions.rb | 13 +- shotgun/lib/methctx.c | 5 +- shotgun/lib/methctx.h | 2 +- shotgun/lib/primitives.rb | 75 +++---- spec/mini_rspec.rb | 8 +- 32 files changed, 849 insertions(+), 168 deletions(-) delete mode 120000 bin/sirb.rb create mode 100644 lib/ext/syck/build.rb create mode 100644 lib/stringio.rb
But, hei, maybe it's just me :)
See all comments