Writing a Shakespeare interpreter with Parrot
During the winter holidays I thought it would be fun to start learning something more about Parrot, and see how hard it would be to write an interpreter using it.
I thought that it could been nice to try implementing something simple but fun, and so I decided to try with the Shakespeare Programming Language (SPL).
Shakespeare is a beautiful language, as you can see from the code for printing fibonacci's numbers and the task proved interesting. Since parrot provides tools to generate a language skeleton it was damn fast to be up and running.
Parsing
Writing the grammar with PGE is pretty easy, for example this is the code to parse constants in the form "a pretty lady"
rule immediate {
<article>? [<adjective> ]* <noun> {*}
}
token noun {
|<positive_noun>
|<negative_noun>
|<neutral_noun>
|<nothing>
}
token adjective {
|<positive_adjective>
|<neutral_adjective>
|<negative_adjective>
|<first_person_possessive>
|<second_person_possessive>
|<third_person_possessive>
}
(the grammatical parts (nouns, adjectives etc) were easily generated from a script and some wordlists)
The only problem was understanding that rules and tokens in PGE have the "ratchet" property of never backtracking and that longest-token-matching is (still?) not working in PGE. This mean that for example the string "an empty house" would not parse correctly when the adjective rule is ['a' | 'the' | 'an'] because 'a' would match and the rule would never backtrack. Some reordering fixed the issue.
Generating the AST is again pretty easy: you basically associate a callback method in the actions file to each rule that you want to transform and use the standard PAST::* classes to generate it, which can be as simple as this
# generating constants
method immediate($/) {
my $value := 1;
if $<noun><negative_noun> {
$value := -1;
}
elsif $<noun><nothing> {
$value := 0;
}
for $<adjective> {
# "your" and "my" ignored
unless $_<first_person_possessive> ||
$_<second_person_possessive> ||
$_<third_person_possessive> {
$value:= $value*2;
}
}
make PAST::Val.new( :value( $value ) );
}
method test($/, $k) {
my $test := PAST::Var.new(:name('condition'));
make PAST::Op.new( $test,
$($<sentence>),
:pasttype($k)); # $k = "if" or "unless"
}
The actions file is written in Not Quite Perl which is a pretty basic language although it lacks sme things that would be useful in general IMHO, such list flattening. If needed though they can be used by associating the NQP classes with classes of another parrot language, such as perl6's or ruby's or lua's.
Another small issue I had was trying to conflate capture of single and multiple occurrences of a string, such as
|'exit' <character>**{1} {*} #= exit
|'exeunt' <character> ['and' <character>]+ {*} #= exit
without explicit quantification (**{1}), the action method would get a scalar reference for the first case and a list reference for the second, and since I cannot write @($character) (NQP does not know about this) I needed an additional conditional in the action code. By explicitly quantifying I always get a list and everything is nice.
The runtime
Basic runtime functions can be written using the PIR intermediate representation, a form of high level assembly that knows about objects, namespaces, exceptions and more. Some of the code is slightly involuted because of the shakespeare language nature: most operations actually refer to both the speaker and the other character currently on stage, but it is usually quite simple to follow
.sub 'enter'
.param string char
get_global $P0, char
unless null $P0 goto fin
die "no such character in the cast!"
fin:
$P0['onstage'] = 1
.end
I also tweaked a little the main routine (autogenerated) to add an additional stage to the compiler. Namely, since I could not set a global "case insensitive" flag for the grammar, I introduced a pre-parsing pass to transform the input in lower case. There are probably better ways to do this, but it seems to work.
Testing
Obviously, I needed to test this. Parrot provides some tools to manage this, namely checking that the compiler outputs the right things at each stage, or that the program output is the one expected. I decided to go through another route, again wonderfully supported, of supporting TAP natively in the language.
This means you have test support builtin in the shakespeare language, bringing it up to date to the best software engineering practices.
This was admittedly extremely easy: I just added some syntax and a couple of runtime functions to print the plan ("1...10") and the test result ("ok 2", "ok 3" etc). Parrot's test harness knows how to manage that and so everything works fine.
Since I did not want to copy the existing source code I wrote everything from scratch (bar the wordlists) and used the supplied examples as reference. So I also added those to the test suite.
It seems, though, that parrot's Parrot::Test does not manage program input, so I used a ruby script as a driver. Which gets actually executed from perl, who knows about the "#!/usr/bin/env ruby" header better than me.
I was trying to port it to perl, but I'm not satisfied with the outcome, I must investigate a bit more
Conclusions
You can get the code through bitbucket, and it should work with today's (january '09) Parrot. All in all this was fun, people in #parrot and #perl6 are nice and helpful, parrot already offers great support for writing your own language and, well, you should try it too :)
Oh, and I passed my last university exam. too :D
See all commentsA spell Corrector in perl6 part 3
(See part 1 and part 2 for explanations of the following code).
Now, this code should work. If I copy and paste it into pugs it works as expected. If I run it it has some failures in the final tests for correct() that I can't explain.
The part in the beinning shows how to count different words in a file, it depends on proper handling of unicode in pugs, so it may or may not work at the moment,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
sub words($file) { slurp($file).lc.comb(/<alpha>+/) }
sub train(@words) {
my %res;
for @words -> $w { %res{$w}++ }
%res
}
#my %NWORDS = train(words('/home/rff/Desktop/big.txt'));
my %NWORDS={'ciao'=>4,'c'=>3,'cibo'=>1,'ciaao'=>1,'ccc'=>1,'cia'=>1};
my @ALPHA = 'a'..'z';
# 'abc' -> 'ac'
sub deletion($word) {
(^$word.chars).map: {substr(my $tmp = $word,$_,1)='';$tmp};
}
# 'abc' -> 'adc'
sub substitution($word) {
gather {
for (0..$word.chars-1) X @ALPHA {
substr(my $tmp = $word,$_[0],1)=$_[1];
take $tmp;
}
}
}
# 'abc' -> 'abbc'
sub insertion($word) {
gather {
for (0..$word.chars) X @ALPHA {
substr(my $tmp = $word,$_[0],0)=$_[1];
take $tmp;
}
}
}
# 'abc' -> 'acb'
sub transposition($w) {
gather for ^$w.chars {
my $tmp=$w;
my $removed =(substr($tmp,$_,1)='');
substr($tmp,$_+1,0)=$removed;
take $tmp;
}
}
sub edits1($w) {
# all these are different, no need to use a set
transposition($w),insertion($w),substitution($w),deletion($w)
}
sub known_edits2($words) {
my @ary = gather {
for edits1($words) -> $e1 {
for edits1($e1) -> $e2 {
take $e2 if %NWORDS{$e2}
}
}
}
any(@ary).values
}
sub known(@words) {
gather for @words {take $_ if %NWORDS{$_}} ;
}
sub correct($w) {
my @values = known([$w]) or known(edits1($w)) or known_edits2($w) or [$w];
# single argument max() doesn't work yet
say @values.perl;
@values.max: {%NWORDS{$^a} <=> %NWORDS{$^b}}
}
|
A spell Corrector in perl6 part 2
In the first part we saw how we could write routines to generate string mutations by using perl6's regexen, but we also noticed that the implementation won't run on pugs, yet.
So, here is an alternative implementation of the same methods, based on Str.substr. The only thing to keep in mind is that this method does side effet modifying the current string, so in each case we have to create a copy (which for strings in perl we can do via simple assignment). The code should be quite self-explaining:
# 'abc' -> 'ac'
sub deletion($word) {
(^$word.chars).map: {substr(my $tmp = $word,$_,1)=''; $tmp};
}
# 'abc' -> 'adc'
sub substitution($word) {
gather {
for (0..$word.chars-1) X @ALPHA {
substr(my $tmp = $word,$_[0],1)=$_[1];
take $tmp;
}
}
}
# 'abc' -> 'abbc'
sub insertion($word) {
gather {
for (0..$word.chars) X @ALPHA {
substr(my $tmp = $word,$_[0],0)=$_[1];
take $tmp;
}
}
}
# 'abc' -> 'acb'
sub transposition($w) {
gather for ^$w.chars {
my $tmp=$w;
my $removed =(substr($tmp,$_,1)='');
substr($tmp,$_+1,0)=$removed;
take $tmp;
}
}
Remember that the value returned by the function is just the last computed one, so we always return implicitly the value of the gather (or of mao in the first function.
Now we can define a edits1 function that returns all the possible 1-letter variations of a word like this:
sub edits1($w) {
transposition($w),insertion($w),substitution($w),deletion($w)
}
the comma operator concatenates the arrays so edits1 simply returns a big array of all the variations.
If you compare this code with norvig's you will notice that in edits1 he is already using Sets while we keep using simple lists. Since perl6 doesn't provide a builtin Set type we will use the any() junction which provide basically the same behaviour (i.e. we can check for the presence of an object, there are no duplicates and we can iterate over it).
Now we need a known(@array) routine that selects the elements of the list that actually exist. Rmember that we have an Hash variable %WORDS[word]=>number that we have initialized somehow with the existing english words and their frequency.
Our function is thus very easy:
sub known(@words) {
@words.grep: { %NWORDS{$_}
}
The correct($word) function then becomes:
sub correct($word) {
my @values = known([$word]) or # the word is known
known(edits1($word)) or # a single letter variation is known
[$word]; # we don't have suggestions
# single argument max() doesn't work yet
@values.max: {%NWORDS{$^a} <=> %NWORDS{$^b}}
}
The last line coud be prettier, but I have the feeling the code is quite nice and clean.
For the last bit of code please wait a little bit more, but hey, it should be easy enough for you to write :)
See all commentsA spell Corrector in perl6 part 1
Back to blogging! Ok, I'll try to bring this blog back to his old glory^W^W^Wnormal activity, maybe not too much posting, but enough to be able to say that I actually have a blog in english.
At the moment I'm doing a little bit of perl6 hacking just to get familiar with all the new cool stuff that Larry Wall has been putting inside the language. I think that perl6 could be a lot simpler and use less punctuation, but I believe that a lot of the stuff that is being put into the design is extremely intersting.
So, after solving a couple of the 99problems I looked for something slightly bigger, and settled on a port of Norvig's spell corrector.
The main idea of the thing is quite simple: you build a huge database of existing words and their frequency, then correct("word") simply looks if "word" is in that database, or if there is a variation of it, or a variation of a variation, and returns the one with maximum frequency.
If nothing is found we surrender and return "word" unchanged.
The possible variations are:
- insertion: 'helo' -> 'hello'
- substitution: 'edd' -> 'odd'
- deletion: 'housse' -> 'house'
- transposition: ''hots' -> 'host'
In the python code these are implemented as list comprehensions plus handling string as arrays. Each variation is stored inside a Set object, which is an enumerable object that doesn't hold duplicates.
In perl6 the we could implement each of the operations easily with Str.subst(regexp,replacement), use gather+for for the iteration and result extraction, and use Junction objects in place of Sets.
Let's look at the for operations in detail:
1 2 3 4 |
# 'abc' -> 'ac'
sub deletion($w) {
gather take $w.subst(/<at($_)>./,'') for ^$w.chars
}
|
The gather sets a kind of scope where take can operate. Each call to take puts the argument into an impicit array that is then returnd by gather. The for statement modifer iterates over an iterable object. The object is built from ^ which is the upto operator, which builds a range 0..N from a number N.
The regex is quite simple, too: at(N) is a zero-width assertion that matches at the N-th element, then we capture it with "." and replace it with nothing, deleting it.
Doing substitution is simple too:
1 2 3 4 |
# 'abc' -> 'adc'
sub substitution($w) {
gather take $w.subst(/<at($_[0])>./,$_[1]) for $w.chars X @ALPHA
}
|
here the only difference is that we use the "X" operator, or cross opertator. The cross operator builds a cartesian product of two iterables, so iterating over 'a'..b' X 1..2 menas iterate over (('a',1),('a',2),('b',1),('b',2)). The inner code is then: replace N-th char with each one in the alphabet.
Insertion, on the same line, does the same thing withouth deleting a character, and doing one more iteration to add a character after the end of the string:
1 2 3 4 |
# 'abc' -> 'abbc'
sub insertion($w) {
gather take $w.subst(/<at($_[0])>/,$_[1]) for ^($w.chars+1) X @ALPHA
}
|
Finally, trnasposition ues a slightly different form of subst, by passing a block, because it needs access to the capture objects:
1 2 3 4 |
# 'abc' -> 'acb'
sub transposition($w) {
gather take $w.subst(/<at($_)>(.)(.)/,{"$1$0"}) for ^$w.chars
}
|
Basically just take chars N and N+1, and swap them.
Quite simple and nice implementations, in my opinion. The only problem is that at the moment pugs does not allow interpolation of variables in regex, so all this won't work.
For a working implementation, wait for part two of this article, or try to come up with your own :)
See all comments