(docs) add a blog post about Lexer jcubic#437

jcubic · jcubic · commit d611d433109e · 2025-02-20T15:14:15.000+01:00
diff --git a/docs/blog/2025-02-20-lexer.md b/docs/blog/2025-02-20-lexer.md
@@ -0,0 +1,123 @@
+---
+slug: lexer
+title: "Internals: State Machine Lexer"
+authors: jcubic
+image: /img/lexer.png
+tags: [lips, scheme, lexer, internals]
+---
+
+The first version of LIPS Scheme had regex based tokenizer. It was using a single regex to split the
+input string into tokens. In this article I will show the internals of the new
+[Lexer](https://en.wikipedia.org/wiki/Lexical_analysis) in LIPS Scheme.
+
+<!-- truncate -->
+
+You can still find the first version of what later become LIPS on CodePen:
+
+* [Simple Lisp interpreter in JavaScript](https://codepen.io/jcubic/pen/gvvzdp?editors=0011)
+
+When I started working on version 1.0 (you can read this story in article: [LIPS Scheme
+History](/blog/lips-history)), the code become more and more complex, the regular expression become
+dynamic, mostly because of [syntax extensions](/docs/lips/extension#syntax-extensions) that needed
+to update the regular expression and the tokenizer.
+
+You can see this code on GitHub on
+[old-version branch](https://github.com/jcubic/lips/blob/old-version/src/lips.js#L201-L204)
+
+```javascript
+function makeTokenRe() {
+    var tokens = Object.keys(specials).map(escapeRegex).join('|');
+    return new RegExp(`("(?:\\\\[\\S\\s]|[^"])*"|\\/(?! )[^\\/\\\\]*(?:\\\\[\\S\\s][^\\/\\\\]*)*\\/[gimy]*(?=\\s|\\(|\\)|$)|\\(|\\)|'|"(?:\\\\[\\S\\s]|[^"])+|\\n|(?:\\\\[\\S\\s]|[^"])*"|;.*|(?:[-+]?(?:(?:\\.[0-9]+|[0-9]+\\.[0-9]+)(?:[eE][-+]?[0-9]+)?)|[0-9]+\\.)[0-9]|\\.{2,}|${tokens}|[^(\\s)]+)`, 'gim');
+}
+```
+
+At one point I realized that I need to change my approach into parsing and tokenization,
+because you could not add new syntax extensions in the same file that contained the code.
+Because the whole code was tokenized at once.
+
+## State Machine Lexer
+
+The limitation of syntax extension lead into introducing a new Lexer and a Streaming
+Parser (if you're interested in this topic I will be writing an article about this in the
+future).
+
+The new Lexer is much simpler and easier to maintain, only had one bug recenly related to
+Lexer inner working ([#433](https://github.com/jcubic/lips/issues/433)).
+
+The new Lexer is a class that have rules for the state machine, this is an example
+sequance of rules for a string:
+
+```javascript
+Lexer.string = Symbol.for('string');
+Lexer.string_escape = Symbol.for('string_escape');
+...
+Lexer._rules = [
+    ...
+    [/"/, null, null, Lexer.string, null],
+    [/"/, null, null, null, Lexer.string],
+    [/"/, null, null, Lexer.string_escape, Lexer.string],
+    [/\\/, null, null, Lexer.string, Lexer.string_escape],
+    [/./, /\\/, null, Lexer.string_escape, Lexer.string],
+    ...
+]
+```
+
+The single rule is consisted of a currect character, next character, and a previous
+character (they can be single character strings or regular expressions).  If the character
+is null it can be any character. The last two elements of the array are the starting and
+the ending state (they are symbols so they are unique values).
+
+The Lexer start with null state and iterate over every rule on every character until it
+find a match.  If a rule enters a state and the state finish with null it means that the
+rule sequance was matched, and full token is created.
+
+If no rules match and the state is not null then the characters is collected and will be
+included in final token.
+
+That's why in above example there are no rule like this:
+
+```javascript
+[/./, null, null, Lexer.string, Lexer.string]
+```
+
+This rule may be added in the future to speed up the Lexer.
+
+### Exmaple
+
+When we have a string like this:
+
+```javascript
+"foo\"bar"
+```
+
+It matches the second rule because the first character is a quote, so it enters
+`Lexer.string` state.  The first rule don't match becuase the inital state is null. For
+characters `foo` it collects the tokens becasue no rule match them. When it finds slash
+`\` it changes state from `Lexer.string` to `Lexer.string_escape`, and for next character
+it enters again `Lexer.string`.  Then it consumes sequence of characters `bar`, and the
+last quote maches the first rule. And that's how we have the full token.
+
+### Syntax Extensions and Constants
+
+The static rules are located in `Lexer._rules`, but `Lexer.rules` is a getter that create the final
+rules dynamically by adding all tokens added as syntax extensions (they are called specials in the
+code). This is also where other constats that starts with hash are added like: `#t`, `#f`, or
+`#void`. They are added together with syntax extension to handle the rules matching order.
+
+The syntax extension create a lexer rule using `Lexer.literal_rule` that creates an array of rules
+that match literal characters in the token, passed as first character.
+
+Lexer is important not only when it reads input LIPS Scheme code, it's also used when reading from
+I/O ports.
+
+## Conclusion
+
+And that's it, this is whole Lexer. As you can see reading the above, it's very simple, easy to
+maintain. If you want to look how it works for yourself. You can jump into
+[the source code](https://github.com/jcubic/lips/tree/master/src).  And
+search for `"class Lexer"`, `"Lexer._rule"`, `Object.defineProperty(Lexer, 'rules'`.
+
+The source code is in one file, so to navigate you need to use search. I've made an attempt to split
+the code into modules, but failed. Becuse of Rollup errors about circular dependencies.
+
+This was first part of articles about [LIPS Scheme Internals](https://github.com/jcubic/lips/issues/437).