OILS / doctools / micro-syntax.md View on Github | oilshell.org

210 lines, 160 significant
1Micro Syntax
2============
3
4Lightweight, polyglot syntax analysis.
5
6Motivations:
7
8- YSH needs syntax highlighters, and this code is a GUIDE to writing one.
9 - The lexer should run on its own. Generated parsers like TreeSitter
10 require such a lexer. In contrast to recursive descent, grammars can't
11 specify lexer modes.
12
13Our own dev tools:
14
15- The Github source viewer is too slow. We want to publish a fast version of
16 our source code to view.
17 - Our docs need to link to link source code.
18 - Github source viewing APPROXIMATE anyway, because they don't execute your
19 build; they don't have ENV. They would have to "solve the halting problem"
20 - So let's be FAST and approximate, not SLOW and approximate.
21
22- Multiple attempts at this polyglot problem
23 - github/semantic in Haskell
24 - facebook/pfff -- semgrep heritage
25
26- Aesthetics
27 - I don't like noisy keyword highlighting. Just comments and string
28 literals looks surprisingly good.
29 - Can use this on the blog too.
30- HTML equivalent of showsh, showpy -- quickly jump to definitions
31- I think I can generate better ctags than `devtools/ctags.sh`! It's a simple
32 format.
33- I realized that "sloccount" is the same problem as syntax highlighting --
34 you exclude comments, whitespace, and lines with only string literals.
35 - sloccount is a huge Perl codebase, and we can stop depending on that.
36
37- could be used to spell check comments?
38 - look at the tool sed in the PR from Martin
39
40Other:
41
42- Because re2c is fun, and I wanted to experiment with writing it directly.
43- Ideas
44 - use this on your blog?
45 - embed in a text editor? Can it be incremental?
46
47## Related
48
49Positively inspired:
50
51- uchex static analysis paper (2016)
52- ctags
53
54(and re2c itself)
55
56Also see my comment on: Rust is the future of JavaScript infrastructure -- you
57need Rust/C++ semantics to be fast. We're using C++ because it's already in
58our codebase, but Rust is probably better for collaboration. (I trust myself
59to use ASAN and develop with it on, but I don't want to review other people's
60code who haven't used ASAN :-P )
61
62
63Negatively inspired:
64
65- Github source viewer
66- tree-sitter-bash, and to some degree seeing semgrep using tree-sitter-bash
67- huge amount of Perl code in sloccount
68- to some extent, also ctags -- low-level C code
69
70## TODO
71
72- `--long-flags` in C++, probably
73- Export to parser combinators
74 - Export to ctags
75
76## Algorithm Notes
77
78Two pass algorithm with StartLine:
79
80First pass:
81
82- Lexer modes with no lookahead or lookbehind
83- This is "Pre-structuring", as we do in Oils!
84
85Second pass:
86
87- Python - StartLine WS -> Indent/Dedent
88- C++ - StartLine MaybePreproc LineCont -> preprocessor
89
90Q: Are here docs first pass or second pass?
91
92TODO:
93
94- C++
95 - arbitrary raw strings R"zZXx(
96- Shell
97 - YSH multi-line strings
98
99Parsing:
100
101- Name tokens should also have contents?
102 - at least for Python and C++
103 - shell: we want these at start of line:
104 - proc X, func X, f()
105 - not echo proc X
106- Some kind of parser combinator library to match definitions
107 - like showpy, showsh, but you can export to HTML with line numbers, and
108 anchor
109
110### Design Question
111
112- can they be made incremental?
113 - run on every keystroke? Supposedly IntelliJ does that.
114 - <https://www.jetbrains.com/help/resharper/sdk/ImplementingLexers.html#strongly-typed-lexers>
115- but if you reuse Python's lexer, it's probably not incremental
116 - see Python's tokenize.py
117
118## Notes
119
120Why not reuse off-the-shelf tools?
121
1221. Because we are a POLYGLOT codebase.
1231. Because we care about speed. (e.g. Github's source viewer is super slow
124 now!)
125 - and I think we can do a little bit better that `devtools/ctags.sh`.
126 - That is, we can generate a better tags file.
127
128We output 2 things:
129
1301. A list of spans
131 - type. TODO: see Vim and textmate types: comment, string, definition
132 - location: line, begin:end col
1332. A list of "OTAGS"
134 - SYMBOL FILENAME LINE
135 - generate ctags from this
136 - generate HTML or JSON from this
137 - recall Woboq code browser was entirely static, in C++
138 - they used `compile_commands.json`
139
140- Leaving out VARIABLES, because those are local.
141 - I think the 'use' lexer is dynamic, sort of like it is in Vim.
142 - 'find uses' can be approximated with `grep -n`? I think that simplifies
143 things a lot
144 - it's a good practice for code to be greppable
145
146### Languages
147
148Note: All our source code, and generated Python and C++ code, should be lexable
149like this. Put it all in `src-tree.wwz`.
150
151- Shell:
152 - comments
153 - `'' "" $''` string literals
154 - here docs
155 - functions
156 - understand `{ }` matching?
157
158- YSH
159 - strings `j""`
160 - multiline strings `''' """ j"""`
161 - proc def
162 - func def
163
164- Python
165 - # comments
166 - `"" ''` strings
167 - multi-line strings
168 - these may require INDENT/DEDENT tokens
169 - class
170 - def
171 - does it understand `state.Mem`? Probably
172 - vim only understands `Mem` though. We might be able to convince it to.
173 - Reference:
174 - We may also need a fast whole-file lexer for `var_name` and `package.Var`,
175 which does dynamic lookup.
176
177- C++
178 - `//` comments
179 - `/* */` comments
180 - preprocessor `#if #define`
181 - multi-line strings in generated code
182 - Parsing:
183 - `class` declarations, with method declarations
184 - function declarations (prototypes)
185 - these are a bit hard - do they require parsing?
186 - function and method definition
187 - including templates?
188
189- ASDL
190 - # comments
191 - I guess every single type can have a line number
192 - it shouldn't jump to Python file
193 - `value_e.Str` and `value.Str` and `value_t` can jump to the right
194 definition
195
196- R # comments and "\n" strings
197
198### More languages
199
200- JS // and `/* */` and `` for templates
201- CSS `/* */`
202 - there's no real symbols to extract here
203- YAML - `#` and strings
204 - there's no parsing, just highlighting
205- Markdown
206 - the headings would be nice -- other stuff is more complex
207 - the `==` and `--` styles require lookahead; they're not line-based
208 - so it needs a different model than `ScanOne()`
209
210- spec tests