doctools/micro-syntax.md

OILS / doctools / micro-syntax.md View on Github | oilshell.org

210 lines, 160 significant

1	Micro Syntax
2	============
3
4	Lightweight, polyglot syntax analysis.
5
6	Motivations:
7
8	- YSH needs syntax highlighters, and this code is a GUIDE to writing one.
9	- The lexer should run on its own. Generated parsers like TreeSitter
10	require such a lexer. In contrast to recursive descent, grammars can't
11	specify lexer modes.
12
13	Our own dev tools:
14
15	- The Github source viewer is too slow. We want to publish a fast version of
16	our source code to view.
17	- Our docs need to link to link source code.
18	- Github source viewing APPROXIMATE anyway, because they don't execute your
19	build; they don't have ENV. They would have to "solve the halting problem"
20	- So let's be FAST and approximate, not SLOW and approximate.
21
22	- Multiple attempts at this polyglot problem
23	- github/semantic in Haskell
24	- facebook/pfff -- semgrep heritage
25
26	- Aesthetics
27	- I don't like noisy keyword highlighting. Just comments and string
28	literals looks surprisingly good.
29	- Can use this on the blog too.
30	- HTML equivalent of showsh, showpy -- quickly jump to definitions
31	- I think I can generate better ctags than `devtools/ctags.sh`! It's a simple
32	format.
33	- I realized that "sloccount" is the same problem as syntax highlighting --
34	you exclude comments, whitespace, and lines with only string literals.
35	- sloccount is a huge Perl codebase, and we can stop depending on that.
36
37	- could be used to spell check comments?
38	- look at the tool sed in the PR from Martin
39
40	Other:
41
42	- Because re2c is fun, and I wanted to experiment with writing it directly.
43	- Ideas
44	- use this on your blog?
45	- embed in a text editor? Can it be incremental?
46
47	## Related
48
49	Positively inspired:
50
51	- uchex static analysis paper (2016)
52	- ctags
53
54	(and re2c itself)
55
56	Also see my comment on: Rust is the future of JavaScript infrastructure -- you
57	need Rust/C++ semantics to be fast. We're using C++ because it's already in
58	our codebase, but Rust is probably better for collaboration. (I trust myself
59	to use ASAN and develop with it on, but I don't want to review other people's
60	code who haven't used ASAN :-P )
61
62
63	Negatively inspired:
64
65	- Github source viewer
66	- tree-sitter-bash, and to some degree seeing semgrep using tree-sitter-bash
67	- huge amount of Perl code in sloccount
68	- to some extent, also ctags -- low-level C code
69
70	## TODO
71
72	- `--long-flags` in C++, probably
73	- Export to parser combinators
74	- Export to ctags
75
76	## Algorithm Notes
77
78	Two pass algorithm with StartLine:
79
80	First pass:
81
82	- Lexer modes with no lookahead or lookbehind
83	- This is "Pre-structuring", as we do in Oils!
84
85	Second pass:
86
87	- Python - StartLine WS -> Indent/Dedent
88	- C++ - StartLine MaybePreproc LineCont -> preprocessor
89
90	Q: Are here docs first pass or second pass?
91
92	TODO:
93
94	- C++
95	- arbitrary raw strings R"zZXx(
96	- Shell
97	- YSH multi-line strings
98
99	Parsing:
100
101	- Name tokens should also have contents?
102	- at least for Python and C++
103	- shell: we want these at start of line:
104	- proc X, func X, f()
105	- not echo proc X
106	- Some kind of parser combinator library to match definitions
107	- like showpy, showsh, but you can export to HTML with line numbers, and
108	anchor
109
110	### Design Question
111
112	- can they be made incremental?
113	- run on every keystroke? Supposedly IntelliJ does that.
114	- <https://www.jetbrains.com/help/resharper/sdk/ImplementingLexers.html#strongly-typed-lexers>
115	- but if you reuse Python's lexer, it's probably not incremental
116	- see Python's tokenize.py
117
118	## Notes
119
120	Why not reuse off-the-shelf tools?
121
122	1. Because we are a POLYGLOT codebase.
123	1. Because we care about speed. (e.g. Github's source viewer is super slow
124	now!)
125	- and I think we can do a little bit better that `devtools/ctags.sh`.
126	- That is, we can generate a better tags file.
127
128	We output 2 things:
129
130	1. A list of spans
131	- type. TODO: see Vim and textmate types: comment, string, definition
132	- location: line, begin:end col
133	2. A list of "OTAGS"
134	- SYMBOL FILENAME LINE
135	- generate ctags from this
136	- generate HTML or JSON from this
137	- recall Woboq code browser was entirely static, in C++
138	- they used `compile_commands.json`
139
140	- Leaving out VARIABLES, because those are local.
141	- I think the 'use' lexer is dynamic, sort of like it is in Vim.
142	- 'find uses' can be approximated with `grep -n`? I think that simplifies
143	things a lot
144	- it's a good practice for code to be greppable
145
146	### Languages
147
148	Note: All our source code, and generated Python and C++ code, should be lexable
149	like this. Put it all in `src-tree.wwz`.
150
151	- Shell:
152	- comments
153	- `'' "" $''` string literals
154	- here docs
155	- functions
156	- understand `{ }` matching?
157
158	- YSH
159	- strings `j""`
160	- multiline strings `''' """ j"""`
161	- proc def
162	- func def
163
164	- Python
165	- # comments
166	- `"" ''` strings
167	- multi-line strings
168	- these may require INDENT/DEDENT tokens
169	- class
170	- def
171	- does it understand `state.Mem`? Probably
172	- vim only understands `Mem` though. We might be able to convince it to.
173	- Reference:
174	- We may also need a fast whole-file lexer for `var_name` and `package.Var`,
175	which does dynamic lookup.
176
177	- C++
178	- `//` comments
179	- `/* */` comments
180	- preprocessor `#if #define`
181	- multi-line strings in generated code
182	- Parsing:
183	- `class` declarations, with method declarations
184	- function declarations (prototypes)
185	- these are a bit hard - do they require parsing?
186	- function and method definition
187	- including templates?
188
189	- ASDL
190	- # comments
191	- I guess every single type can have a line number
192	- it shouldn't jump to Python file
193	- `value_e.Str` and `value.Str` and `value_t` can jump to the right
194	definition
195
196	- R # comments and "\n" strings
197
198	### More languages
199
200	- JS // and `/* */` and `` for templates
201	- CSS `/* */`
202	- there's no real symbols to extract here
203	- YAML - `#` and strings
204	- there's no parsing, just highlighting
205	- Markdown
206	- the headings would be nice -- other stuff is more complex
207	- the `==` and `--` styles require lookahead; they're not line-based
208	- so it needs a different model than `ScanOne()`
209
210	- spec tests