Skip to content

Commit d611d43

Browse files
committed
(docs) add a blog post about Lexer jcubic#437
1 parent 4711830 commit d611d43

File tree

1 file changed

+123
-0
lines changed

1 file changed

+123
-0
lines changed

docs/blog/2025-02-20-lexer.md

+123
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
---
2+
slug: lexer
3+
title: "Internals: State Machine Lexer"
4+
authors: jcubic
5+
image: /img/lexer.png
6+
tags: [lips, scheme, lexer, internals]
7+
---
8+
9+
The first version of LIPS Scheme had regex based tokenizer. It was using a single regex to split the
10+
input string into tokens. In this article I will show the internals of the new
11+
[Lexer](https://en.wikipedia.org/wiki/Lexical_analysis) in LIPS Scheme.
12+
13+
<!-- truncate -->
14+
15+
You can still find the first version of what later become LIPS on CodePen:
16+
17+
* [Simple Lisp interpreter in JavaScript](https://codepen.io/jcubic/pen/gvvzdp?editors=0011)
18+
19+
When I started working on version 1.0 (you can read this story in article: [LIPS Scheme
20+
History](/blog/lips-history)), the code become more and more complex, the regular expression become
21+
dynamic, mostly because of [syntax extensions](/docs/lips/extension#syntax-extensions) that needed
22+
to update the regular expression and the tokenizer.
23+
24+
You can see this code on GitHub on
25+
[old-version branch](https://github.com/jcubic/lips/blob/old-version/src/lips.js#L201-L204)
26+
27+
```javascript
28+
function makeTokenRe() {
29+
var tokens = Object.keys(specials).map(escapeRegex).join('|');
30+
return new RegExp(`("(?:\\\\[\\S\\s]|[^"])*"|\\/(?! )[^\\/\\\\]*(?:\\\\[\\S\\s][^\\/\\\\]*)*\\/[gimy]*(?=\\s|\\(|\\)|$)|\\(|\\)|'|"(?:\\\\[\\S\\s]|[^"])+|\\n|(?:\\\\[\\S\\s]|[^"])*"|;.*|(?:[-+]?(?:(?:\\.[0-9]+|[0-9]+\\.[0-9]+)(?:[eE][-+]?[0-9]+)?)|[0-9]+\\.)[0-9]|\\.{2,}|${tokens}|[^(\\s)]+)`, 'gim');
31+
}
32+
```
33+
34+
At one point I realized that I need to change my approach into parsing and tokenization,
35+
because you could not add new syntax extensions in the same file that contained the code.
36+
Because the whole code was tokenized at once.
37+
38+
## State Machine Lexer
39+
40+
The limitation of syntax extension lead into introducing a new Lexer and a Streaming
41+
Parser (if you're interested in this topic I will be writing an article about this in the
42+
future).
43+
44+
The new Lexer is much simpler and easier to maintain, only had one bug recenly related to
45+
Lexer inner working ([#433](https://github.com/jcubic/lips/issues/433)).
46+
47+
The new Lexer is a class that have rules for the state machine, this is an example
48+
sequance of rules for a string:
49+
50+
```javascript
51+
Lexer.string = Symbol.for('string');
52+
Lexer.string_escape = Symbol.for('string_escape');
53+
...
54+
Lexer._rules = [
55+
...
56+
[/"/, null, null, Lexer.string, null],
57+
[/"/, null, null, null, Lexer.string],
58+
[/"/, null, null, Lexer.string_escape, Lexer.string],
59+
[/\\/, null, null, Lexer.string, Lexer.string_escape],
60+
[/./, /\\/, null, Lexer.string_escape, Lexer.string],
61+
...
62+
]
63+
```
64+
65+
The single rule is consisted of a currect character, next character, and a previous
66+
character (they can be single character strings or regular expressions). If the character
67+
is null it can be any character. The last two elements of the array are the starting and
68+
the ending state (they are symbols so they are unique values).
69+
70+
The Lexer start with null state and iterate over every rule on every character until it
71+
find a match. If a rule enters a state and the state finish with null it means that the
72+
rule sequance was matched, and full token is created.
73+
74+
If no rules match and the state is not null then the characters is collected and will be
75+
included in final token.
76+
77+
That's why in above example there are no rule like this:
78+
79+
```javascript
80+
[/./, null, null, Lexer.string, Lexer.string]
81+
```
82+
83+
This rule may be added in the future to speed up the Lexer.
84+
85+
### Exmaple
86+
87+
When we have a string like this:
88+
89+
```javascript
90+
"foo\"bar"
91+
```
92+
93+
It matches the second rule because the first character is a quote, so it enters
94+
`Lexer.string` state. The first rule don't match becuase the inital state is null. For
95+
characters `foo` it collects the tokens becasue no rule match them. When it finds slash
96+
`\` it changes state from `Lexer.string` to `Lexer.string_escape`, and for next character
97+
it enters again `Lexer.string`. Then it consumes sequence of characters `bar`, and the
98+
last quote maches the first rule. And that's how we have the full token.
99+
100+
### Syntax Extensions and Constants
101+
102+
The static rules are located in `Lexer._rules`, but `Lexer.rules` is a getter that create the final
103+
rules dynamically by adding all tokens added as syntax extensions (they are called specials in the
104+
code). This is also where other constats that starts with hash are added like: `#t`, `#f`, or
105+
`#void`. They are added together with syntax extension to handle the rules matching order.
106+
107+
The syntax extension create a lexer rule using `Lexer.literal_rule` that creates an array of rules
108+
that match literal characters in the token, passed as first character.
109+
110+
Lexer is important not only when it reads input LIPS Scheme code, it's also used when reading from
111+
I/O ports.
112+
113+
## Conclusion
114+
115+
And that's it, this is whole Lexer. As you can see reading the above, it's very simple, easy to
116+
maintain. If you want to look how it works for yourself. You can jump into
117+
[the source code](https://github.com/jcubic/lips/tree/master/src). And
118+
search for `"class Lexer"`, `"Lexer._rule"`, `Object.defineProperty(Lexer, 'rules'`.
119+
120+
The source code is in one file, so to navigate you need to use search. I've made an attempt to split
121+
the code into modules, but failed. Becuse of Rollup errors about circular dependencies.
122+
123+
This was first part of articles about [LIPS Scheme Internals](https://github.com/jcubic/lips/issues/437).

0 commit comments

Comments
 (0)