1 | Micro Syntax
|
2 | ============
|
3 |
|
4 | Lightweight, polyglot syntax analysis.
|
5 |
|
6 | Motivations:
|
7 |
|
8 | - YSH needs syntax highlighters, and this code is a GUIDE to writing one.
|
9 | - The lexer should run on its own. Generated parsers like TreeSitter
|
10 | require such a lexer. In contrast to recursive descent, grammars can't
|
11 | specify lexer modes.
|
12 |
|
13 | Our own dev tools:
|
14 |
|
15 | - The Github source viewer is too slow. We want to publish a fast version of
|
16 | our source code to view.
|
17 | - Our docs need to link to link source code.
|
18 | - Github source viewing APPROXIMATE anyway, because they don't execute your
|
19 | build; they don't have ENV. They would have to "solve the halting problem"
|
20 | - So let's be FAST and approximate, not SLOW and approximate.
|
21 |
|
22 | - Multiple attempts at this polyglot problem
|
23 | - github/semantic in Haskell
|
24 | - facebook/pfff -- semgrep heritage
|
25 |
|
26 | - Aesthetics
|
27 | - I don't like noisy keyword highlighting. Just comments and string
|
28 | literals looks surprisingly good.
|
29 | - Can use this on the blog too.
|
30 | - HTML equivalent of showsh, showpy -- quickly jump to definitions
|
31 | - I think I can generate better ctags than `devtools/ctags.sh`! It's a simple
|
32 | format.
|
33 | - I realized that "sloccount" is the same problem as syntax highlighting --
|
34 | you exclude comments, whitespace, and lines with only string literals.
|
35 | - sloccount is a huge Perl codebase, and we can stop depending on that.
|
36 |
|
37 | - could be used to spell check comments?
|
38 | - look at the tool sed in the PR from Martin
|
39 |
|
40 | Other:
|
41 |
|
42 | - Because re2c is fun, and I wanted to experiment with writing it directly.
|
43 | - Ideas
|
44 | - use this on your blog?
|
45 | - embed in a text editor? Can it be incremental?
|
46 |
|
47 | ## Related
|
48 |
|
49 | Positively inspired:
|
50 |
|
51 | - uchex static analysis paper (2016)
|
52 | - ctags
|
53 |
|
54 | (and re2c itself)
|
55 |
|
56 | Also see my comment on: Rust is the future of JavaScript infrastructure -- you
|
57 | need Rust/C++ semantics to be fast. We're using C++ because it's already in
|
58 | our codebase, but Rust is probably better for collaboration. (I trust myself
|
59 | to use ASAN and develop with it on, but I don't want to review other people's
|
60 | code who haven't used ASAN :-P )
|
61 |
|
62 |
|
63 | Negatively inspired:
|
64 |
|
65 | - Github source viewer
|
66 | - tree-sitter-bash, and to some degree seeing semgrep using tree-sitter-bash
|
67 | - huge amount of Perl code in sloccount
|
68 | - to some extent, also ctags -- low-level C code
|
69 |
|
70 | ## TODO
|
71 |
|
72 | - `--long-flags` in C++, probably
|
73 | - Export to parser combinators
|
74 | - Export to ctags
|
75 |
|
76 | ## Algorithm Notes
|
77 |
|
78 | Two pass algorithm with StartLine:
|
79 |
|
80 | First pass:
|
81 |
|
82 | - Lexer modes with no lookahead or lookbehind
|
83 | - This is "Pre-structuring", as we do in Oils!
|
84 |
|
85 | Second pass:
|
86 |
|
87 | - Python - StartLine WS -> Indent/Dedent
|
88 | - C++ - StartLine MaybePreproc LineCont -> preprocessor
|
89 |
|
90 | Q: Are here docs first pass or second pass?
|
91 |
|
92 | TODO:
|
93 |
|
94 | - C++
|
95 | - arbitrary raw strings R"zZXx(
|
96 | - Shell
|
97 | - YSH multi-line strings
|
98 |
|
99 | Parsing:
|
100 |
|
101 | - Name tokens should also have contents?
|
102 | - at least for Python and C++
|
103 | - shell: we want these at start of line:
|
104 | - proc X, func X, f()
|
105 | - not echo proc X
|
106 | - Some kind of parser combinator library to match definitions
|
107 | - like showpy, showsh, but you can export to HTML with line numbers, and
|
108 | anchor
|
109 |
|
110 | ### Design Question
|
111 |
|
112 | - can they be made incremental?
|
113 | - run on every keystroke? Supposedly IntelliJ does that.
|
114 | - <https://www.jetbrains.com/help/resharper/sdk/ImplementingLexers.html#strongly-typed-lexers>
|
115 | - but if you reuse Python's lexer, it's probably not incremental
|
116 | - see Python's tokenize.py
|
117 |
|
118 | ## Notes
|
119 |
|
120 | Why not reuse off-the-shelf tools?
|
121 |
|
122 | 1. Because we are a POLYGLOT codebase.
|
123 | 1. Because we care about speed. (e.g. Github's source viewer is super slow
|
124 | now!)
|
125 | - and I think we can do a little bit better that `devtools/ctags.sh`.
|
126 | - That is, we can generate a better tags file.
|
127 |
|
128 | We output 2 things:
|
129 |
|
130 | 1. A list of spans
|
131 | - type. TODO: see Vim and textmate types: comment, string, definition
|
132 | - location: line, begin:end col
|
133 | 2. A list of "OTAGS"
|
134 | - SYMBOL FILENAME LINE
|
135 | - generate ctags from this
|
136 | - generate HTML or JSON from this
|
137 | - recall Woboq code browser was entirely static, in C++
|
138 | - they used `compile_commands.json`
|
139 |
|
140 | - Leaving out VARIABLES, because those are local.
|
141 | - I think the 'use' lexer is dynamic, sort of like it is in Vim.
|
142 | - 'find uses' can be approximated with `grep -n`? I think that simplifies
|
143 | things a lot
|
144 | - it's a good practice for code to be greppable
|
145 |
|
146 | ### Languages
|
147 |
|
148 | Note: All our source code, and generated Python and C++ code, should be lexable
|
149 | like this. Put it all in `src-tree.wwz`.
|
150 |
|
151 | - Shell:
|
152 | - comments
|
153 | - `'' "" $''` string literals
|
154 | - here docs
|
155 | - functions
|
156 | - understand `{ }` matching?
|
157 |
|
158 | - YSH
|
159 | - strings `j""`
|
160 | - multiline strings `''' """ j"""`
|
161 | - proc def
|
162 | - func def
|
163 |
|
164 | - Python
|
165 | - # comments
|
166 | - `"" ''` strings
|
167 | - multi-line strings
|
168 | - these may require INDENT/DEDENT tokens
|
169 | - class
|
170 | - def
|
171 | - does it understand `state.Mem`? Probably
|
172 | - vim only understands `Mem` though. We might be able to convince it to.
|
173 | - Reference:
|
174 | - We may also need a fast whole-file lexer for `var_name` and `package.Var`,
|
175 | which does dynamic lookup.
|
176 |
|
177 | - C++
|
178 | - `//` comments
|
179 | - `/* */` comments
|
180 | - preprocessor `#if #define`
|
181 | - multi-line strings in generated code
|
182 | - Parsing:
|
183 | - `class` declarations, with method declarations
|
184 | - function declarations (prototypes)
|
185 | - these are a bit hard - do they require parsing?
|
186 | - function and method definition
|
187 | - including templates?
|
188 |
|
189 | - ASDL
|
190 | - # comments
|
191 | - I guess every single type can have a line number
|
192 | - it shouldn't jump to Python file
|
193 | - `value_e.Str` and `value.Str` and `value_t` can jump to the right
|
194 | definition
|
195 |
|
196 | - R # comments and "\n" strings
|
197 |
|
198 | ### More languages
|
199 |
|
200 | - JS // and `/* */` and `` for templates
|
201 | - CSS `/* */`
|
202 | - there's no real symbols to extract here
|
203 | - YAML - `#` and strings
|
204 | - there's no parsing, just highlighting
|
205 | - Markdown
|
206 | - the headings would be nice -- other stuff is more complex
|
207 | - the `==` and `--` styles require lookahead; they're not line-based
|
208 | - so it needs a different model than `ScanOne()`
|
209 |
|
210 | - spec tests
|