Dissecting the Micro-Soft BASIC Interpreter
I'll be talking about re-implementing the BASIC interpreter from the Commodore C-64 in Ruby at the Baltic Ruby conference in Malmö in less than two weeks (if time permits, I'll try to betatest it at the Ruby Unconf in Hamburg where I'll be MCing).
In this post I want to summarize my high level findings and the challenges.
Reading 6502 assembler code (which is what Paul Allen and Bill Gates used to implement their BASIC interpreter) and trying to understand what's going on is tricky. Assembler code, even from back in the early days of the 6502 is full of mad scientist level tricks that are often necessary to get something relatively complex like a full BASIC interpreter running with the very limited resources available.
Some really good resources
Luckily, there are a ton of resources that explain the code, the best one probably being the multi source commented ROM listing from Pagetable that pulls in commentary from various sources.
There are two main parts to the functionality of Commodore BASIC V2 as this version is called: The editor loop and the interpreter loop.
The editor loop, as the name suggests is, a very simple, yet somewhat effective line editor for both BASIC programs and for executing BASIC code directly. Think of this as if you would be writing your Ruby code in IRB. The line editor takes input and copies it over into a buffer until you hit Enter. After that, the line is “tokenized” which means that keywords are turned into single bytes (sitting in the upper half of the 8 bit spectrum). Everything else is more or less untouched. If the line started with a number, that number is interpreted as a line number and saved to the BASIC memory. If no line number is added, the code is directly thrown at the interpreter itself and executed.
The interpreter
The interpreter itself can be divided into several interesting pieces of code. The inner loop is actually relatively short, it checks for the before mentioned tokens and uses a jump table to run the commands behind the tokens. If non-token characters turn up in the inner loop, there aren't a ton of possibilities: normal letters are probably variable assignments (the LET keyword is optional here), colons divide commands, etc.
Expression evaluation
Commands consume their own arguments and one of the most important routines to do that is the formula evaluator that condenses down expressions to a single value (either a string or a number) in the end. This is the part I am currently working on and it is infuriatingly difficult to trace different inputs through the code. The inner code already starts with this neat trick:
6502 asm
.BYTE $24
PHA
On first run, when the code goes through this, the byte 0x24
(the BIT
opcode) turns the following PHA
into the argument for BIT, thus skipping the PHA
. The interpreter later jumps directly to the PHA
opcode instead, so this is only skipped on the first run.
About the Status Register
One interesting bit about assembler (and specifically about 6502 machine code) is that a ton of instructions set processor flags in the so called Status Register (SR) like the Zero Bit, the Carry bit and so on. This is used a lot and often a ton of tests follow after the initial operation that set the flags. If reading the code, it is important to keep in mind all the different instructions that do change the flags, because if you don't, the code does not make any sense. For example, a simple LDA
that loads a value (or the contents of a memory address) into the accumulator, already sets the Zero flag and the Negative flag depending on the incoming value. So basically, a zero test is “free” in the sense that it does not involve an extra opcode for the test.
That RTS trick.
Another thing that happens regularly is to “misuse” JSR
and RTS
. JSR
is “Jump to Subroutine” and it works by placing the program counter (PC
) on the stack, then jumping to the address given. RTS
takes the address from the stack, jumps back to it (and increases the PC
afterwards so that you end up on the instruction after the JSR
). There are two important things here: When you put things on the stack yourself (with PHA
), you need to make sure you don't accidentally end up at an RTS
instruction and you've made yourself a nice bug jumping into whoknowswhere. The other part is that of course you can use this for some trickery. If you, say, in a subroutine put an address on the stack you created by referencing a lookup table, and you RTS
, then the address gets called and if that subroutine also ends with an RTS
, it will jump directly back to your original call site. Neat. This trick gets used in Micro-Soft BASIC a lot and not always in that rather simple form I just described.
This all does make the formula evaluator really tricky to fully understand, but admittedly, it does have a pretty big job: It does comparisons, math, string concatenation, evaluates functions and needs to respect precedence rules (think: multiplication before addition etc.), so it more or less resolves the various arguments and operations into a form that resembles reverse polish notation on the stack which then can be collapsed into a single result.
Global state of emergency
The last thing I wanted to quickly mention is the amount of global state that is at play here. The 6502 has a concept of a “zero page”, which are the first 256 bytes of memory, which then can be addressed directly. There are separate opcodes for instructions using the zero page and using these as much as possible is beneficial because it saves space, even though they are only sometimes faster. If you look at how the basic interpreter uses the zero page, for example at the memory map at Pagetable, you can see that it is filled to the brim.
It's a really interesting challenge to somehow transfer this to somewhat idiomatic ruby code and still convey some of the trickery. Only time will tell if I did a good job but it definitely feels like I will need to do a few loops of refactoring to make it readable and instructive and still somewhat in the spirit of the original.