| 1 | Micro Syntax
 | 
| 2 | ============
 | 
| 3 | 
 | 
| 4 | Lightweight, polyglot syntax analysis.
 | 
| 5 | 
 | 
| 6 | Motivations:
 | 
| 7 | 
 | 
| 8 | - YSH needs syntax highlighters, and this code is a GUIDE to writing one.
 | 
| 9 |   - The lexer should run on its own.  Generated parsers like TreeSitter
 | 
| 10 |     require such a lexer.  In contrast to recursive descent, grammars can't
 | 
| 11 |     specify lexer modes.
 | 
| 12 | 
 | 
| 13 | Our own dev tools:
 | 
| 14 | 
 | 
| 15 | - The Github source viewer is too slow.  We want to publish a fast version of
 | 
| 16 |   our source code to view.
 | 
| 17 |   - Our docs need to link to link source code.
 | 
| 18 |   - Github source viewing APPROXIMATE anyway, because they don't execute your
 | 
| 19 |     build; they don't have ENV.  They would have to "solve the halting problem"
 | 
| 20 |     - So let's be FAST and approximate, not SLOW and approximate.
 | 
| 21 | 
 | 
| 22 | - Multiple attempts at this polyglot problem
 | 
| 23 |   - github/semantic in Haskell
 | 
| 24 |   - facebook/pfff -- semgrep heritage
 | 
| 25 | 
 | 
| 26 | - Aesthetics
 | 
| 27 |   - I don't like noisy keyword highlighting.  Just comments and string
 | 
| 28 |     literals looks surprisingly good.
 | 
| 29 |   - Can use this on the blog too.
 | 
| 30 | - HTML equivalent of showsh, showpy -- quickly jump to definitions
 | 
| 31 | - I think I can generate better ctags than `devtools/ctags.sh`!  It's a simple
 | 
| 32 |   format.
 | 
| 33 | - I realized that "sloccount" is the same problem as syntax highlighting --
 | 
| 34 |   you exclude comments, whitespace, and lines with only string literals.
 | 
| 35 |   - sloccount is a huge Perl codebase, and we can stop depending on that.
 | 
| 36 | 
 | 
| 37 | - could be used to spell check comments?
 | 
| 38 |   - look at the tool sed in the PR from Martin
 | 
| 39 | 
 | 
| 40 | Other:
 | 
| 41 | 
 | 
| 42 | - Because re2c is fun, and I wanted to experiment with writing it directly.
 | 
| 43 | - Ideas
 | 
| 44 |   - use this on your blog?
 | 
| 45 |   - embed in a text editor?  Can it be incremental?
 | 
| 46 | 
 | 
| 47 | ## Related
 | 
| 48 | 
 | 
| 49 | Positively inspired:
 | 
| 50 | 
 | 
| 51 | - uchex static analysis paper (2016)
 | 
| 52 | - ctags
 | 
| 53 | 
 | 
| 54 | (and re2c itself)
 | 
| 55 | 
 | 
| 56 | Also see my comment on: Rust is the future of JavaScript infrastructure -- you
 | 
| 57 | need Rust/C++ semantics to be fast.  We're using C++ because it's already in
 | 
| 58 | our codebase, but Rust is probably better for collaboration.  (I trust myself
 | 
| 59 | to use ASAN and develop with it on, but I don't want to review other people's
 | 
| 60 | code who haven't used ASAN :-P )
 | 
| 61 | 
 | 
| 62 | 
 | 
| 63 | Negatively inspired:
 | 
| 64 | 
 | 
| 65 | - Github source viewer
 | 
| 66 | - tree-sitter-bash, and to some degree seeing semgrep using tree-sitter-bash
 | 
| 67 | - huge amount of Perl code in sloccount
 | 
| 68 | - to some extent, also ctags -- low-level C code
 | 
| 69 | 
 | 
| 70 | ## TODO
 | 
| 71 | 
 | 
| 72 | - `--long-flags` in C++, probably
 | 
| 73 | - Export to parser combinators
 | 
| 74 |   - Export to ctags
 | 
| 75 | 
 | 
| 76 | ## Algorithm Notes
 | 
| 77 | 
 | 
| 78 | Two pass algorithm with StartLine:
 | 
| 79 | 
 | 
| 80 | First pass:
 | 
| 81 | 
 | 
| 82 | - Lexer modes with no lookahead or lookbehind
 | 
| 83 | - This is "Pre-structuring", as we do in Oils!
 | 
| 84 | 
 | 
| 85 | Second pass:
 | 
| 86 | 
 | 
| 87 | - Python - StartLine WS -> Indent/Dedent
 | 
| 88 | - C++ - StartLine MaybePreproc LineCont -> preprocessor
 | 
| 89 | 
 | 
| 90 | Q: Are here docs first pass or second pass?
 | 
| 91 | 
 | 
| 92 | TODO:
 | 
| 93 | 
 | 
| 94 | - C++
 | 
| 95 |   - arbitrary raw strings R"zZXx(
 | 
| 96 | - Shell
 | 
| 97 |   - YSH multi-line strings
 | 
| 98 | 
 | 
| 99 | Parsing:
 | 
| 100 | 
 | 
| 101 | - Name tokens should also have contents?
 | 
| 102 |   - at least for Python and C++
 | 
| 103 |   - shell: we want these at start of line:
 | 
| 104 |     - proc X, func X, f()
 | 
| 105 |     - not echo proc X
 | 
| 106 | - Some kind of parser combinator library to match definitions
 | 
| 107 |   - like showpy, showsh, but you can export to HTML with line numbers, and
 | 
| 108 |     anchor
 | 
| 109 | 
 | 
| 110 | ### Design Question
 | 
| 111 | 
 | 
| 112 | - can they be made incremental?
 | 
| 113 |   - run on every keystroke?  Supposedly IntelliJ does that.
 | 
| 114 |   - <https://www.jetbrains.com/help/resharper/sdk/ImplementingLexers.html#strongly-typed-lexers>
 | 
| 115 | - but if you reuse Python's lexer, it's probably not incremental
 | 
| 116 |   - see Python's tokenize.py
 | 
| 117 | 
 | 
| 118 | ## Notes
 | 
| 119 | 
 | 
| 120 | Why not reuse off-the-shelf tools?
 | 
| 121 | 
 | 
| 122 | 1. Because we are a POLYGLOT codebase.
 | 
| 123 | 1. Because we care about speed.  (e.g. Github's source viewer is super slow
 | 
| 124 |    now!)
 | 
| 125 |    - and I think we can do a little bit better that `devtools/ctags.sh`.
 | 
| 126 |    - That is, we can generate a better tags file.
 | 
| 127 | 
 | 
| 128 | We output 2 things:
 | 
| 129 | 
 | 
| 130 | 1. A list of spans
 | 
| 131 |    - type. TODO: see Vim and textmate types: comment, string, definition
 | 
| 132 |    - location: line, begin:end col
 | 
| 133 | 2. A list of "OTAGS"
 | 
| 134 |    - SYMBOL FILENAME LINE
 | 
| 135 |    - generate ctags from this
 | 
| 136 |    - generate HTML or JSON from this
 | 
| 137 |      - recall Woboq code browser was entirely static, in C++
 | 
| 138 |      - they used `compile_commands.json`
 | 
| 139 | 
 | 
| 140 | - Leaving out VARIABLES, because those are local.
 | 
| 141 |   - I think the 'use' lexer is dynamic, sort of like it is in Vim.
 | 
| 142 |   - 'find uses' can be approximated with `grep -n`?  I think that simplifies
 | 
| 143 |     things a lot
 | 
| 144 |     - it's a good practice for code to be greppable
 | 
| 145 | 
 | 
| 146 | ### Languages
 | 
| 147 | 
 | 
| 148 | Note: All our source code, and generated Python and C++ code, should be lexable
 | 
| 149 | like this.  Put it all in `src-tree.wwz`.
 | 
| 150 | 
 | 
| 151 | - Shell:
 | 
| 152 |   - comments
 | 
| 153 |   - `'' "" $''` string literals
 | 
| 154 |   - here docs
 | 
| 155 |   - functions
 | 
| 156 |     - understand `{ }` matching?
 | 
| 157 | 
 | 
| 158 | - YSH
 | 
| 159 |   - strings `j""`
 | 
| 160 |   - multiline strings `''' """ j"""`
 | 
| 161 |   - proc def
 | 
| 162 |   - func def
 | 
| 163 | 
 | 
| 164 | - Python
 | 
| 165 |   - # comments
 | 
| 166 |   - `"" ''` strings
 | 
| 167 |   - multi-line strings
 | 
| 168 |   - these may require INDENT/DEDENT tokens
 | 
| 169 |     - class
 | 
| 170 |     - def
 | 
| 171 |   - does it understand `state.Mem`?  Probably
 | 
| 172 |     - vim only understands `Mem` though.  We might be able to convince it to.
 | 
| 173 |   - Reference:
 | 
| 174 |     - We may also need a fast whole-file lexer for `var_name` and `package.Var`,
 | 
| 175 |       which does dynamic lookup.
 | 
| 176 |    
 | 
| 177 | - C++
 | 
| 178 |   - `//` comments
 | 
| 179 |   - `/* */` comments
 | 
| 180 |   - preprocessor `#if #define`
 | 
| 181 |   - multi-line strings in generated code
 | 
| 182 |   - Parsing:
 | 
| 183 |     - `class` declarations, with method declarations
 | 
| 184 |     - function declarations (prototypes)
 | 
| 185 |       - these are a bit hard - do they require parsing?
 | 
| 186 |     - function and method definition
 | 
| 187 |       - including templates?
 | 
| 188 | 
 | 
| 189 | - ASDL
 | 
| 190 |   - # comments
 | 
| 191 |   - I guess every single type can have a line number
 | 
| 192 |     - it shouldn't jump to Python file
 | 
| 193 |     - `value_e.Str` and `value.Str` and `value_t` can jump to the right
 | 
| 194 |       definition
 | 
| 195 | 
 | 
| 196 | - R   # comments and "\n" strings
 | 
| 197 | 
 | 
| 198 | ### More languages
 | 
| 199 | 
 | 
| 200 | - JS  // and `/* */` and `` for templates
 | 
| 201 | - CSS `/* */`
 | 
| 202 |   - there's no real symbols to extract here
 | 
| 203 | - YAML - `#` and strings
 | 
| 204 |   - there's no parsing, just highlighting
 | 
| 205 | - Markdown 
 | 
| 206 |   - the headings would be nice -- other stuff is more complex
 | 
| 207 |   - the `==` and `--` styles require lookahead; they're not line-based
 | 
| 208 |   - so it needs a different model than `ScanOne()`
 | 
| 209 | 
 | 
| 210 | - spec tests
 |