Zolatron 64 6502 homebrew – parsing inputs in assembly

Okay, so this is going to be a long one. Grab a coffee, settle down and we’ll begin…

The last time we talked, I was able to send messages to the Zolatron 64 6502 homebrew computer via its serial port. But all it could do was print those messages to the LCD screen. It’s time to actually do something with that input.

That means parsing the incoming text. If I were working in Go or Python, that would be a fairly simple job. Slice up the input using spaces as delimiters, shove the pieces into string arrays and run the whole lot through a big switch statement.

But not so fast, bud. All we have here is 8-bit assembly code. No handy string types or array functions. Life in 6502 assembly is pretty much one byte at a time. Sometimes less.

Bit by bit

When input arrives via the serial port, it sits in a buffer, with a null byte (ASCII 0) marking the end of the input bytes. These bytes will be integer values consisting of the ASCII codes of the characters typed.

I’m making the assumption that the first word (ie, characters up to but not including the first space or terminator) will be a command. We could maybe come up with a method of comparing this first word with a list of words – and that’s almost the method I went for. But I used a tweak that makes things a lot more efficient.

I didn’t invent this method – I discovered it by reading the source code for EhBasic and looking at how it parses commands in Basic programs (see line 8273 onward).

The trick is to start with just the first character, and use that to look up a group of commands with a jump table. Let’s walk through it.

On the table

This is all based around the cunning use of tables – which is to say, collections of data arranged contiguously in memory. You access an item in a table by determining some offset (in bytes) from the start of the table and adding that number to the address at which the table starts. Hopefully, this will all become clearer soon.

There are two key elements to this – a table of ‘tokens’ representing commands and another table of the corresponding character strings. A token is nothing more than a one-byte unsigned integer. Each command has its corresponding number. The commands are numbered in strict alphabetical order, and the token numbers start at value $80 (128 in decimal). There’s a good reason for starting there – it means the most significant bit (bit 7) is always set – something that easily distinguishes a token from a typed character (as we’re only using 7-bit characters on the Zolatron – no fancy extended ASCII for us).

I’ve created tokens for just a few commands so far, and I won’t be using all of these. They are just for playing around with at this stage. Here’s how the tokens are created:

CMD_TKN_NUL = $00 ; This is what happens when you just hit RTN
CMD_TKN_FAIL = $01                 ; syntax error & whatnot
CMD_TKN_STAR = $80                 ; *
CMD_TKN_LM = CMD_TKN_STAR + 1      ; LM - list memory
CMD_TKN_LP = CMD_TKN_LM + 1        ; LP - list memory page
CMD_TKN_PRT = CMD_TKN_LP + 1       ; PRT - print string to LCD
CMD_TKN_VERBOSE = CMD_TKN_PRT + 1  ; VERBOSE - just for testing
CMD_TKN_VERS = CMD_TKN_VERBOSE + 1 ; VERS - show version

The first two are a bit special. As I said, all the others start at $80 and go up from there. As well as being in alphabetical order, if one command shares all of its letters with the first letters of another, then the longer one would have to come first. That’s not the case in my current list, but let’s assume that you want to implement the commands FOO and FOOBAR – FOOBAR would have to come first. This is connected with how the commands are searched.

Command tables

Now let’s look at the main tables. If you want to look at the code in its natural habit, in my Github repo, you’ll find this stuff in the data_tables.asm file.

.cmd_ch1_tbl ; table of command first characters
  equs "*LPV"
  equb EOTBL_MKR          ; end of table marker

.cmd_ptrs                 ; pointers to command table entries
  equw cmd_tbl_STAR       ; commands starting '*'
  equw cmd_tbl_ASCL       ; commands starting 'L'
  equw cmd_tbl_ASCP       ; commands starting 'P'
  equw cmd_tbl_ASCV       ; commands starting 'V'

; Command table
.cmd_tbl_STAR             ; commands starting '*'
  equb CMD_TKN_STAR       ; not sure what I'm using this for yet
  equb EOCMD_SECTION      ; comes at end of each section
.cmd_tbl_ASCL             ; commands starting 'L'
  equs "M", CMD_TKN_LM    ; LM
  equs "P", CMD_TKN_LP    ; LP
.cmd_tbl_ASCP             ; commands starting 'P'
  equs "RT", CMD_TKN_PRT  ; PRT
.cmd_tbl_ASCV             ; commands starting 'V'
  equs "ERS", CMD_TKN_VERS       ; VERS

There are three sections to this. The first is defined with the label .cmd_ch1_tbl. This doesn’t look much like a table – it’s just a string. But it’s a set of characters starting at the address defined by the label. These are the first characters of the available commands, in alphabetical order. Each character need appear only once – for example, we have two commands starting with ‘V’ – VERS and VERBOSE – but we need include V just one time.

What happens is that the parsing code takes the first character in the input buffer and starts comparing it with each letter in this table, keeping track of how far through the table it has moved (numbering, as is usual, from 0). Let’s say we’ve used the command PRT. This will match with P, at which point our counter has the value 2.

Let’s briefly skip to the third section – the command table. You’ll see that each letter of the alphabet (and each special character, like ‘*’) gets its own section. For instance, commands starting with the letter ‘L’ are in the section that starts at an address labelled .cmd_tbl_ASCL (the last bit standing for ‘ASCII L’). Each entry in that section has the letters of the command (minus the first one, because we’ll have already matched on that) followed by the token for that command.

Now we can see how the second table comes into play. This is a table of address pointers – .cmd_ptrs. We have a sequence of two-byte addresses using the labels we just talked about.

So here’s how it works. The program finds a match for the first letter of the command. Because the list it searches isn’t the whole alphabet but just those characters relating to commands actually implemented, this is a quick search. With our PRT command example, we matched the third character (with the offset counter having incremented to 2 by this point).

Now we look up the address for the commands starting with ‘P’. We do this by multiplying the offset counter (2, in our example) by 2. That’s because the entries in the .cmd_ptrs table are each two bytes long. Then we add this number (4 in our example) to the start address of the .cmd_ptrs label. In our example, that sum gives us an address which belongs to the data created by the line that says: equw cmd_tbl_ASCP. The contents of those two bytes are the address defined by the label cmd_tbl_ASCP. Now we have an address where we can continue our search.

The code now starts comparing, one by one, the characters in the input buffer with the characters it finds at the location we’ve just calculated.

At each step, the code pulls the next character from the buffer and stashes it somewhere safe. It then pulls the next character from the command table. If this character has a value of $80 or more, then it’s not a character but the token. We’ve successfully matched.

If the match fails at any point, we go on to the next item in that section of the table (the ‘P’ section in our example). This is done by incrementing our search counter as we run through the remainder of the command currently being searched, until we hit the token value, then starting again.

If the next byte from the command table has the value matching the constant EOCMD_SECTION (which is defined elsewhere as 0), then we know we’ve got to the end of that command section without a match – an error, in other words.

Come again?

In case none of that is clear, let’s run through it again with another example. Let’s say I’ve entered the command VERS, which will make the machine print the current version of the ROM code.

The software picks up the first byte from the input buffer – the ASCII value for the letter ‘V’ (which is 86 in decimal and $56 in hex, if you’re curious). It starts running through the values it finds starting at the address defined by label .cmd_ch1_tbl, incrementing a counter at each step. It finds the letter ‘V’ on its fourth attempt, at which point the counter reads 3.

The code doubles this number (ie, it’s now 6) and adds this to the address defined by the label .cmd_ptrs. That gives a new address at which the program will find two bytes. These are the address of the label cmd_tbl_ASCV.

So now the code starts comparing bytes it finds at the address cmd_tbl_ASCV with the bytes in the buffer. We use two counters – X keeps track of our position in the command table and Y keeps track of where we are in the input buffer, starting at index 1 because we’ve already matched the first char. X will only ever increase, but Y’s a bit different.

So, we match the next character from the command buffer – ‘E’ – with the next from the input buffer, which also turns out to be ‘E’. Great, we increment both X and Y and use them as offsets to fetch the next character. This is ‘R’ in both input buffer and command table, so another match. Increment both counters and try again. Now we have a ‘B’ in the command buffer, but ‘S’ in the input buffer. Oh-oh. This is a fail. But all is not lost.

We reset the input buffer counter, Y, to 1. We also fast-forward the X counter until the character it indicates in the command table has a value of $80 or more. This indicates it’s the token value for the command we’ve just been checking, which didn’t match (VERBOSE in this case). We increment X again once more so that now it’s pointing to the start of the next item in the ‘V’ section.

But first, we check the value of this next item. If it’s 0, we’ve actually reached the end of the section, so that’s a fail. Otherwise, we start the matching process again.

Quick and easy

There’s a couple of things to note. Mismatching happens quickly. What I mean by that is that, if we’ve typed something that isn’t a valid command, we’ll find out pretty quickly. If the first character doesn’t match any command, the process ends very quickly. And the only time we match a whole command is where it succeeds.

There are more steps. If a match occurs, the code checks to see that the next character in the input buffer is a space or a null. The first happens when the command is followed by parameters. The second applies to standalone commands, like VERS. If the next char isn’t a space or null, that indicates an error.

And we need a section of code to implement each command. If that commands takes parameters, they’ll need to be pulled from the input buffer and parsed. I’ll deal with that in a future post but – spoiler alert – it uses jump tables too.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.