|
| 1 | +--- |
| 2 | +slug: lexer |
| 3 | +title: "Internals: State Machine Lexer" |
| 4 | +authors: jcubic |
| 5 | +image: /img/lexer.png |
| 6 | +tags: [lips, scheme, lexer, internals] |
| 7 | +--- |
| 8 | + |
| 9 | +The first version of LIPS Scheme had regex based tokenizer. It was using a single regex to split the |
| 10 | +input string into tokens. In this article I will show the internals of the new |
| 11 | +[Lexer](https://en.wikipedia.org/wiki/Lexical_analysis) in LIPS Scheme. |
| 12 | + |
| 13 | +<!-- truncate --> |
| 14 | + |
| 15 | +You can still find the first version of what later become LIPS on CodePen: |
| 16 | + |
| 17 | +* [Simple Lisp interpreter in JavaScript](https://codepen.io/jcubic/pen/gvvzdp?editors=0011) |
| 18 | + |
| 19 | +When I started working on version 1.0 (you can read this story in article: [LIPS Scheme |
| 20 | +History](/blog/lips-history)), the code become more and more complex, the regular expression become |
| 21 | +dynamic, mostly because of [syntax extensions](/docs/lips/extension#syntax-extensions) that needed |
| 22 | +to update the regular expression and the tokenizer. |
| 23 | + |
| 24 | +You can see this code on GitHub on |
| 25 | +[old-version branch](https://github.com/jcubic/lips/blob/old-version/src/lips.js#L201-L204) |
| 26 | + |
| 27 | +```javascript |
| 28 | +function makeTokenRe() { |
| 29 | + var tokens = Object.keys(specials).map(escapeRegex).join('|'); |
| 30 | + return new RegExp(`("(?:\\\\[\\S\\s]|[^"])*"|\\/(?! )[^\\/\\\\]*(?:\\\\[\\S\\s][^\\/\\\\]*)*\\/[gimy]*(?=\\s|\\(|\\)|$)|\\(|\\)|'|"(?:\\\\[\\S\\s]|[^"])+|\\n|(?:\\\\[\\S\\s]|[^"])*"|;.*|(?:[-+]?(?:(?:\\.[0-9]+|[0-9]+\\.[0-9]+)(?:[eE][-+]?[0-9]+)?)|[0-9]+\\.)[0-9]|\\.{2,}|${tokens}|[^(\\s)]+)`, 'gim'); |
| 31 | +} |
| 32 | +``` |
| 33 | + |
| 34 | +At one point I realized that I need to change my approach into parsing and tokenization, |
| 35 | +because you could not add new syntax extensions in the same file that contained the code. |
| 36 | +Because the whole code was tokenized at once. |
| 37 | + |
| 38 | +## State Machine Lexer |
| 39 | + |
| 40 | +The limitation of syntax extension lead into introducing a new Lexer and a Streaming |
| 41 | +Parser (if you're interested in this topic I will be writing an article about this in the |
| 42 | +future). |
| 43 | + |
| 44 | +The new Lexer is much simpler and easier to maintain, only had one bug recenly related to |
| 45 | +Lexer inner working ([#433](https://github.com/jcubic/lips/issues/433)). |
| 46 | + |
| 47 | +The new Lexer is a class that have rules for the state machine, this is an example |
| 48 | +sequance of rules for a string: |
| 49 | + |
| 50 | +```javascript |
| 51 | +Lexer.string = Symbol.for('string'); |
| 52 | +Lexer.string_escape = Symbol.for('string_escape'); |
| 53 | +... |
| 54 | +Lexer._rules = [ |
| 55 | + ... |
| 56 | + [/"/, null, null, Lexer.string, null], |
| 57 | + [/"/, null, null, null, Lexer.string], |
| 58 | + [/"/, null, null, Lexer.string_escape, Lexer.string], |
| 59 | + [/\\/, null, null, Lexer.string, Lexer.string_escape], |
| 60 | + [/./, /\\/, null, Lexer.string_escape, Lexer.string], |
| 61 | + ... |
| 62 | +] |
| 63 | +``` |
| 64 | + |
| 65 | +The single rule is consisted of a currect character, next character, and a previous |
| 66 | +character (they can be single character strings or regular expressions). If the character |
| 67 | +is null it can be any character. The last two elements of the array are the starting and |
| 68 | +the ending state (they are symbols so they are unique values). |
| 69 | + |
| 70 | +The Lexer start with null state and iterate over every rule on every character until it |
| 71 | +find a match. If a rule enters a state and the state finish with null it means that the |
| 72 | +rule sequance was matched, and full token is created. |
| 73 | + |
| 74 | +If no rules match and the state is not null then the characters is collected and will be |
| 75 | +included in final token. |
| 76 | + |
| 77 | +That's why in above example there are no rule like this: |
| 78 | + |
| 79 | +```javascript |
| 80 | +[/./, null, null, Lexer.string, Lexer.string] |
| 81 | +``` |
| 82 | + |
| 83 | +This rule may be added in the future to speed up the Lexer. |
| 84 | + |
| 85 | +### Exmaple |
| 86 | + |
| 87 | +When we have a string like this: |
| 88 | + |
| 89 | +```javascript |
| 90 | +"foo\"bar" |
| 91 | +``` |
| 92 | + |
| 93 | +It matches the second rule because the first character is a quote, so it enters |
| 94 | +`Lexer.string` state. The first rule don't match becuase the inital state is null. For |
| 95 | +characters `foo` it collects the tokens becasue no rule match them. When it finds slash |
| 96 | +`\` it changes state from `Lexer.string` to `Lexer.string_escape`, and for next character |
| 97 | +it enters again `Lexer.string`. Then it consumes sequence of characters `bar`, and the |
| 98 | +last quote maches the first rule. And that's how we have the full token. |
| 99 | + |
| 100 | +### Syntax Extensions and Constants |
| 101 | + |
| 102 | +The static rules are located in `Lexer._rules`, but `Lexer.rules` is a getter that create the final |
| 103 | +rules dynamically by adding all tokens added as syntax extensions (they are called specials in the |
| 104 | +code). This is also where other constats that starts with hash are added like: `#t`, `#f`, or |
| 105 | +`#void`. They are added together with syntax extension to handle the rules matching order. |
| 106 | + |
| 107 | +The syntax extension create a lexer rule using `Lexer.literal_rule` that creates an array of rules |
| 108 | +that match literal characters in the token, passed as first character. |
| 109 | + |
| 110 | +Lexer is important not only when it reads input LIPS Scheme code, it's also used when reading from |
| 111 | +I/O ports. |
| 112 | + |
| 113 | +## Conclusion |
| 114 | + |
| 115 | +And that's it, this is whole Lexer. As you can see reading the above, it's very simple, easy to |
| 116 | +maintain. If you want to look how it works for yourself. You can jump into |
| 117 | +[the source code](https://github.com/jcubic/lips/tree/master/src). And |
| 118 | +search for `"class Lexer"`, `"Lexer._rule"`, `Object.defineProperty(Lexer, 'rules'`. |
| 119 | + |
| 120 | +The source code is in one file, so to navigate you need to use search. I've made an attempt to split |
| 121 | +the code into modules, but failed. Becuse of Rollup errors about circular dependencies. |
| 122 | + |
| 123 | +This was first part of articles about [LIPS Scheme Internals](https://github.com/jcubic/lips/issues/437). |
0 commit comments