Adding the BREAK and CONTINUE statements to the Kandal compiler prototype

 

The name for the pluggable compiler that should emerge from this prototype, is: kandal.

The word 'kandal' means "central" or "middle" in Khmer (Cambodian) language. In comparison to LISP, the main effort in building this compiler, will have gone into the infix to postfix translation. I have been living in Cambodia for the last few years. So, I think this should be an appropriate name. The current implementation in Php should only be a bootstrapping step. That should eventually help silencing the "of all languages in the world, why Php" criticism.

 

1. Download the compiler prototype with break and continue

 

2. Recapitulation of previous blog posts

 

The biggest trouble I have run into, up till now, is the fact that the order of regular expressions in PCRE matters. PCRE does no attempt to match longer patterns, if an alternative specified earlier, has been matched already. For example, ifIdo is an identifier while if is a keyword. The regular expression should not do the following:

 

  • Incorrectly identify the if keyword as an identifier
  • Partially match the if as a keyword in the ifIdo identifier

 

The current regex looks as following:


Show/Hidden php code
View source
 
/(?P<COMMA>,)|(?P<BRACKET_LEFT>\()|(?P<BRACKET_RIGHT>\))|
(?P<CONTINUE>\bcontinue\b)|(?P<BREAK>\bbreak\b)|(?P<WHILE>\bwhile\b)|
(?P<ELSE>\belse\b)|(?P<FOR>\bfor\b)|(?P<IF>\bif\b)|(?P<DO>\bdo\b)|
(?P<WHITESPACE>\s+)|(?P<COMMENT1>\/\/[^\n]*?(?=\n))|(?P<COMMENT2>\/\*.*?\*\/)|
(?P<FALSE>\bfalse\b)|(?P<TRUE>\btrue\b)|(?P<NULL>\bnull\b)|
(?P<STRING_DOUBLE>".*?(?<!\\)")|(?P<STRING_SINGLE>'.*?(?<!\\)')|
(?P<NUMBER>-?\d+(\.\d+)?)|(?P<IDENTIFIER>\b[a-zA-Z_$][a-zA-Z0-9_$]*\b)|
(?P<CBRACKET_LEFT>\{)|(?P<CBRACKET_RIGHT>\})|(?P<NOT>!)|(?P<SEMICOLON>;)|
(?P<GE>\>\=)|(?P<NE>\!\=)|(?P<OR>\|\|)|(?P<LE>\<\=)|(?P<EQ>\=\=)|
(?P<CONCAT>\.)|(?P<MINUS>\-)|(?P<GT>\>)|(?P<DIVIDE>\/)|(?P<AND>&&)|
(?P<ASSIGN>\=)|(?P<PLUS>\+)|(?P<EXPONENTIATE>\^)|(?P<MULTIPLY>\*)|
(?P<LT>\<)|(?P<MODULO>%)/s
length: 761
 

 

I probably need to reform the way in which the regular expressions are specified to PCRE, because it ends up causing errors, depending on the order in which the regular expression alternatives are specified. The order in which the expressions are specified should actually not matter. I am still baffled at the fact that the regex order matters in PCRE.

 

3. What about program correctness?

Program correctness is a very important issue. However, static typing, does not contribute to program correctness. It only contributes to program performance, and even then, only for primitive types. The more complex the type, the less the practice of static typing contributes to performance.

Even though we can infer from the Curry-Howard correspondence that a completely strongly-typed Turing machine would indeed behave as a perfect proof system, and therefore, that every program bug, amounts to a mis-specification or mis-implementation of a type, we also know from Gödel's incompleteness theorems that the laws of nature do not allow us to build such perfect proof system; as soon as our program uses even just basic arithmetic.

Beyond a certain point, any attempt to specify the types in the program more completely or more completely, will not lead to additional program correctness, but only to additional program complexity. The main reason for program bugs is: no-longer-understood complexity. Therefore, there seems to be some kind of maximum attainable correctness. At that point, language features do not help or detract from program correctness any longer.

Ousterhout writes: It might seem that the typeless nature of scripting languages could allow errors to go undetected, but in practice scripting languages are just as safe [or unsafe] as system programming languages.

As we have seen in the C++, Java, and C# histories, the practice of static typing eventually leads to introducing generics. In terms of program correctness, templates are much more part of the problem than part of the solution: Here are three primary drawbacks to the use of templates: compiler support, poor error messages, and code bloat.[...]So the indiscriminate use of templates can lead to code bloat, resulting in excessively large executables.[...] Almost all compilers produce confusing, long, or sometimes unhelpful error messages when errors are detected in code that uses templates. This can make templates difficult to develop.[...] Unfortunately, compilers historically generate somewhat esoteric, long, and unhelpful error messages for this sort of error.

Runaway complexity is the central enemy of program reliability. Generics are clearly a friend of this central enemy.

C++, Java, and C# have kept increasing their complexity without really increasing the expressiveness of the language. They were not able to avoid, however,  to re-introduce scripting-like dynamic typing and LISP-like first-class functions, because that is where the true power lies. Everything else they did, is just a pile of complicating ballast.  Eventually they will end up with very complex languages that are not capable of doing more, or doing faster, that what scripting languages such as Javascript can do out-of-the-box, but then without all the ballast. Adding fix upon fix, to features that were useless in the first place, is not the solution either. Eventually, C++, Java, and C# will simply crash and burn under the weight of the growing useless luggage that they are dragging around with them.

As I mentioned in my previous blog post that, on the road to LISP, we will abstain from doing the following. There will be:

  • no prefix instead of infix notation
  • no pushing recursion as a preferred looping approach
  • no straying from the C-style syntax
  • no Python-style significant whitespace
  • no turning the builtin statements into functions

 

We want to come as close as we can to LISP and to the full power of the lambda calculus, but without rocking the boat.

We do keep, of course, the features that make scripting the flexible beast it truly is: The LISP-like lists of lists, with hash tables being just a particular type of list in which we can lookup efficiently by key. Hash tables are indeed a performance concession:

 

Show/Hidden php code
View source
 
a = {key1:value1, key2:value2, ... keyN:valueN};
 

 

This is a performance hack for:

 

Show/Hidden php code
View source
 
a = [ [key1, value1], [key2, value2], ... [keyN, valueN]];
 

 

The hash table is one of the few concessions in contemporary scripting, made to performance. Just like C generally makes no concessions in terms of performance in order to make things easier, the general rule is that scripting should not make any concessions to flexibility in order to make things faster. Fortunately, the hash table concession does not really make things more complicated.

In line with the Ousterhout philosophy, if speed is the main concern, use C, and not scripting. In all other cases, use scripting. However, this strategy puts a tremendous burden on the scripting-to-C interface, that is, the foreign function interface.

It needs to be convenient to use a C function inside a script. However, the SWIG project shows that interfacing a scripting language with C, takes substantial effort. In the kandal project, I will also make efforts to try to look at the reverse problem too: It needs to be convenient to use scripting functions from C, and from there, in other scripting languages. There is hardly anything more balkanizing in the world of scripting today than the inability to use each others libraries. Just try to "reuse" a Python function from within Perl or the other way around.

 

4. The example program

The following example program uses BREAK and CONTINUE inside existing loop constructs: 


Show/Hidden php code
View source
 
//This is a an example of the BREAK and CONTINUE statements in the FOR, WHILE, and DO WHILE statements
 
for(x=0; x < 10; x=x+1)
{
        if(x==6)
        {
                println('continuing loop at '.x);
                continue;
        }
        else if(x==8)
        {
                println('breaking loop at '.x);
                break;
        }
        println('(for loop) x:'.x);
} 
 
x=0;
while(x < 10)
{
        x=x+1;
        if(x==6)
        {
                println('continuing loop at '.x);
                continue;
        }
        else if(x==8)
        {
                println('breaking loop at '.x);
                break;
        }
        println('(while loop) x:'.x);
}
 
x=0;
do
{
        x=x+1;
        if(x==6)
        {
                println('continuing loop at '.x);
                continue;
        }
        else if(x==8)
        {
                println('breaking loop at '.x);
                break;
        }
        println('(do while loop) x:'.x);
}
while(x < 10);
 

5. The reformed label stack mechanism

The most important change required to implement the BREAK and CONTINUE statements, can be found in the LabelStack mechanism. Every breakable or continuable loop construct must register with the LabelStack mechanism, by jumping to what label the BREAK statement can break out of it , and the CONTINUE statement can continue it. If it is not possible to break or continue the loop construct, the loop construct should simply register a null.

In order to find the label to jump to, the system will descend the label stack and look for the next innermost loop that has registered a non-null value.

The label stack items pushed and popped from the LabelStack now look as following:


Show/Hidden php code
View source
 
class LabelStackItem
{
        var $semantics=null; //for debugging
        var $labelNumber=null;
        var $labelPrefixForBreak=null;
        var $labelPrefixForContinue=null;
 
        function __construct($semantics,$labelNumber,$labelPrefixForBreak,$labelPrefixForContinue)
        {
                $this->semantics=$semantics;
                $this->labelNumber=$labelNumber;
                $this->labelPrefixForBreak=$labelPrefixForBreak;
                $this->labelPrefixForContinue=$labelPrefixForContinue;
        }
}
 

 

The compiler extension function that descends the label stack to find the applicable innermost loop, looks like this:



Show/Hidden php code
View source
 
function labelDescendStack($compiler,$stackItemFieldName)
{
        $labelIndex=lastLabelIndex($compiler);
        $label=lastLabelStackItem($compiler);
        while($label->$stackItemFieldName==null && $labelIndex>=0)
        {
                $labelIndex--;
                $label=$compiler->labelStack[$labelIndex];
        }
        return $label;
}
 

6. No new VM instructions

The BREAK and CONTINUE statements only cause the virtual machine to jump around in the bytecode. The JUMP instruction has already been implemented in the virtual machine earlier on. It is funny to see that the typical built-in language constructs are all just doing some jumping. They are all just proxies for the unconditional JUMP and the conditional IF_TRUE and IF_FALSE stack machine instructions. Since Dijkstra's diatribe in 1968 against the Go To statement, we do no longer (dare to) write those manually. Instead, we use built-in control structures that are guaranteed to nest cleanly.

 

7. The semantic plugins for the BREAK and CONTINUE statements

Using the new label stack mechanism, we can implement the BREAK and CONTINUE statements as following:


Show/Hidden php code
View source
 
<?php
class _Break implements ICompilerSemanticPlugin
{
        function onGeneratingGrammar($generator)
        {
                $generator->addToken('BREAK');
                $generator->addGrammarRule('statement: break_statement');
                $generator->addGrammarRule('break_statement: BREAK SEMICOLON');
        }
 
        function beforeLexing($compiler) 
        {
                $compiler->addLexerKeywordPatternRule(3,'BREAK','break');
        }
 
        function beforeParsing($compiler)
        {
                $compiler->associateRuleNumberForLHSSymbol('BREAK_STATEMENT','break_statement');
 
                $compiler->registerNamedReduceCallback(
                        'BREAK_STATEMENT',
                        function($me,$stackStates)
                        {
                                $label=labelDescendStack($me,'labelPrefixForBreak');
                                if($label==null) $me->error("nothing to break out from");
                                emitJump($me,$stackStates,$label->labelPrefixForBreak,$label->labelNumber);
                        }
                );
        }
}
?>
 

 

The semantic plugin for the CONTINUE statement has pretty much the same structure:

 

Show/Hidden php code
View source
 
<?php
class _Continue implements ICompilerSemanticPlugin
{
        function onGeneratingGrammar($generator)
        {
                $generator->addToken('CONTINUE');
                $generator->addGrammarRule('statement: continue_statement');
                $generator->addGrammarRule('continue_statement: CONTINUE SEMICOLON');
        }
 
        function beforeLexing($compiler) 
        {
                $compiler->addLexerKeywordPatternRule(3,'CONTINUE','continue');
        }
 
        function beforeParsing($compiler)
        {
 
                $compiler->associateRuleNumberForLHSSymbol('CONTINUE_STATEMENT','continue_statement');
 
                $compiler->registerNamedReduceCallback(
                        'CONTINUE_STATEMENT',
                        function($me,$stackStates)
                        {
                                $label=labelDescendStack($me,'labelPrefixForContinue');
                                if($label==null) $me->error("nothing to continue");
                                emitJump($me,$stackStates,$label->labelPrefixForContinue,$label->labelNumber);
                        }
                );
        }
 
}
?>
 

8. Output of the example program

The example program's output becomes:


$ ./tl examples/example3.tl

(for loop) x:0
(for loop) x:1
(for loop) x:2
(for loop) x:3
(for loop) x:4
(for loop) x:5
continuing loop at 6
(for loop) x:7
breaking loop at 8
(while loop) x:1
(while loop) x:2
(while loop) x:3
(while loop) x:4
(while loop) x:5
continuing loop at 6
(while loop) x:7
breaking loop at 8
(do while loop) x:1
(do while loop) x:2
(do while loop) x:3
(do while loop) x:4
(do while loop) x:5
continuing loop at 6
(do while loop) x:7
breaking loop at 8

 

9. Conclusion

The compiler's plugin structure has not suffered under the changes. It has been able to absorb the new plugins quite well using the existing internal APIs. I suspect that the label stack mechanism has now also become quite stable. The next semantic plugin, in my next blog post, will be for the SWITCH statement.