Tokenizing in Java

So Im playing with the idea of making a compiler built on Java. All reasons why that is a horrible idea aside, I want to.

Problem:
Tokenize a dirty source file.
Ive worked with bison and flex for C++ and know the basics of how they work, I plan on mimicking that functionality.

Situation:
We have a 'Source string' which is the code we are tokenizing(analyzing and breaking down into data thats easier to work with) and we have regex-token pairs. The regex determines what part of the source the token represents.

Boring:
I have a file of regex and the tokens they produce. Just to make some progress I looped through the regex-token pairs, replaced all the regex matches with their respective token's and was left with with a string of tokens. My tokens however where just a few characters surrounded by sideways-carrots (yes thats what I call them) with the literals added something like "<string_constant value="test">".

Fun:
All I can say is that previous part depressed me by how boring it was. So lets have some fun. Instead of repeatedly having to parse text, lets throw each token's data into an object as they're matched by the regex. If we have a collection of these token objects they need to be sorted by their occurrence in the source string, not by the order by which the regex evaluates them. Instead of replacing the token with a string, lets remove the regex and replace it with a CHAR, say, #01 (not null but also not an actual character). Then we take that token object and and put in a map where the key is the memory location of the dummy char. As the source string gets evaluated is should break down into a string of #01 characters. After that, all you have to do is iterate through the string, for each character, pull out the corresponding token object from the map and add it to you're favorite sorted list (mine's ArrayList). BAM!

Although, this is Java, and Java doesn't like to share addresses. I think its time to bust out the java.misc.Unsafe.

No comments:

Post a Comment