| 1 | Lazy, Lossless Lexing Libraries
 | 
| 2 | ===============================
 | 
| 3 | 
 | 
| 4 | Right now we're using this Python code to process HTML in Oil docs.  Logically,
 | 
| 5 | the doc pipeline looks like:
 | 
| 6 | 
 | 
| 7 | (Hand-written Markdown with embedded HTML) <br/>
 | 
| 8 | → CommonMark renderer → <br/>
 | 
| 9 | (HTML) <br/>
 | 
| 10 | → Filter with lazy lexing → <br/>
 | 
| 11 | (HTML) <br/>
 | 
| 12 | → Filter with lazy lexing → <br/>
 | 
| 13 | (HTML) <br/>
 | 
| 14 | <br/>
 | 
| 15 | *... repeat N times ...* <br/>
 | 
| 16 | <br/>
 | 
| 17 | (Final HTML) <br/>
 | 
| 18 | 
 | 
| 19 | Eventually it would be nice to expand this API design to more formats and make
 | 
| 20 | it available to Oil language users.
 | 
| 21 | 
 | 
| 22 | <div id="toc">
 | 
| 23 | </div>
 | 
| 24 | 
 | 
| 25 | ## Why?
 | 
| 26 | 
 | 
| 27 | - To report good parse errors with location info
 | 
| 28 | - To use like `sed` on HTML
 | 
| 29 | - To minimize allocations, i.e. don't construct a DOM, and don't even construct
 | 
| 30 |   substrings
 | 
| 31 | - So we don't "lose the stack", unlike callback-based parsing
 | 
| 32 |   - We get an iterator of events/spans
 | 
| 33 | - A layer to build a "lossless syntax tree" on top of.
 | 
| 34 | 
 | 
| 35 | ## Formats
 | 
| 36 | 
 | 
| 37 | ### HTML
 | 
| 38 | 
 | 
| 39 | HTML has two levels:
 | 
| 40 | 
 | 
| 41 | 1. The `<>` structure, i.e. tags, the DOCTYPE declaration, comments, and processing
 | 
| 42 |    instructions
 | 
| 43 | 2. The `name="value"` attributes inside start tags (and self-closing tags)
 | 
| 44 | 
 | 
| 45 | ### TSV2
 | 
| 46 | 
 | 
| 47 | - This format is **designed** to be read line-by-line (unlike CSV).
 | 
| 48 | - You can skip to any column, and optionally decode the field into an Bool,
 | 
| 49 |   Int, Float, or Str.
 | 
| 50 | 
 | 
| 51 | ### JSON
 | 
| 52 | 
 | 
| 53 | - py-yajl is event-based, but not lazy.  And it's a parser, not a lexer.
 | 
| 54 | - We could LEX `{}` and `[] and `""` `\` in the first step.  This is lexing and
 | 
| 55 |   not parsing.
 | 
| 56 | 
 | 
| 57 | 
 | 
| 58 | ## Links
 | 
| 59 | 
 | 
| 60 | - [pulldown-cmark][].  This is called a "pull parser" in reference to the *XML
 | 
| 61 |   Pull Parsing* API at <http://xmlpull.org>.  However I would call this a
 | 
| 62 |   *lexer* and not a *parser*.
 | 
| 63 |   - Hm I think this indicates a need for **lossless** lexers as well?
 | 
| 64 |     https://github.com/Byron/pulldown-cmark-to-cmark/blob/master/src/fmt.rs
 | 
| 65 |   - It appears to be used in mdbook
 | 
| 66 | - [simdjson][]: This pasrer is designed for extreme speed, and the  first stage
 | 
| 67 |   of it "lazy" and uses integer indices.  (We're only trying to avoid
 | 
| 68 |   allocation; we're not avoiding branches or using SIMD!  We also aim to
 | 
| 69 |   transform the underlying data, not just parse it.)
 | 
| 70 | 
 | 
| 71 | [simdjson]: https://branchfree.org/2019/02/25/paper-parsing-gigabytes-of-json-per-second/
 | 
| 72 | 
 | 
| 73 | [pulldown-cmark]: https://github.com/raphlinus/pulldown-cmark
 | 
| 74 | 
 | 
| 75 | ## Design Notes
 | 
| 76 | 
 | 
| 77 | ### Lessons/Claims
 | 
| 78 | 
 | 
| 79 | - You can parse HTML quickly and correctly with regexes!  It has a simple
 | 
| 80 |   syntax that's almost designed for this.
 | 
| 81 |   - Key point: We're not parsing them **only** with regexes.
 | 
| 82 |   - The parser is correct in the sense that its behavior on every input is
 | 
| 83 |     fully-defined.  There are no buffer overflows on edge cases -- that's the
 | 
| 84 |     magic of regexes and the corresponding state machines.  However it doesn't
 | 
| 85 |     recognize every weird document on the web.  It recognizes something close
 | 
| 86 |     to "well-formed XML" (but it's not XHTML.)
 | 
| 87 | - Parsing with spans / integer positions is efficient, **composable**, and
 | 
| 88 |   leads  to better **syntax errors**.
 | 
| 89 |   - Example: spec test format parse errors aren't good.  Info is lost.
 | 
| 90 |     Or ASDL parser?  I guess it has some line info.
 | 
| 91 | - The API is easier to use than SAX because you're not losing the stack.
 | 
| 92 |   (Previous transformations used Python's HTMLParser, which is a SAX API.)
 | 
| 93 | - It's more efficient than a DOM API.  DOM allocates a lot and loses
 | 
| 94 |   whitespace.  (Speed matters when rebuilding the whole site.  Each page has
 | 
| 95 |   multiple stages.  It makes the code cleaner to do multiple quick passes, like
 | 
| 96 |   a compiler.)
 | 
| 97 |   - In many instances, you can MODIFY the HTML doc correctly without
 | 
| 98 |     deserializing something.  For example, adding `<b>` tags around a line
 | 
| 99 |     doesn't require unquoting and quoting `>` within them.
 | 
| 100 | - Preserving whitespace is useful because I want to use 'diff' to test
 | 
| 101 |   correctness against the old pipeline.
 | 
| 102 | - Python's `pat.match(s, start_pos, end_pos)` is very useful for efficient
 | 
| 103 |   lexing.
 | 
| 104 |   - TODO: Convert to re2c to see how it does.  Need YYLIMIT.
 | 
| 105 | - TODO: Issue of non-greedy matches.
 | 
| 106 | - TODO: Issue of unquoting and quoting (escaping).
 | 
| 107 | - The triple backtick extension to Markdown (part of CommonMark) is useful.
 | 
| 108 |   - Although it should generate arbitrary `<div name="val">` instead.  This
 | 
| 109 |     allow natural plugins.  You can write `<div>` in Markdown, but it's
 | 
| 110 |     annoying to manually escape `<` to `>`, e.g. in graphviz or TeX.
 | 
| 111 |   HTML is analogous to shell.  A web site is a directory tree of text files!
 | 
| 112 |   - It's better than the Snip `-->` syntax, which didn't play well with syntax
 | 
| 113 |     highlighting.
 | 
| 114 | - Composable grammars: Is this the basis for Pulp?
 | 
| 115 | 
 | 
| 116 | ### Open Problems
 | 
| 117 | 
 | 
| 118 | - Python generators (`yield`) make the code more natural, but that's not
 | 
| 119 |   possible in C or C++.  (C++20 has coroutines though.)
 | 
| 120 |   - Could write a compiler?  Would be an excuse to clean up the OPy or mycpp
 | 
| 121 |     ASTs.
 | 
| 122 | - Input handling in C/shell:
 | 
| 123 |   - Unlike Python's regex engine, libc `regexec()` doesnt have an `end_pos`,
 | 
| 124 |     requires `NUL` termination.
 | 
| 125 |   - We're also using `re2c` this way.  Can se use `YYLIMIT`?
 | 
| 126 |   - Simple solution: copy the subrange, or temporarily mutate the buffer (would
 | 
| 127 |     cause copy-on-write)
 | 
| 128 |   - Is there a zero-copytehre a 
 | 
| 129 | 
 | 
| 130 | ### Comparisons
 | 
| 131 | 
 | 
| 132 | - mdbook (in Rust, using [pulldown-cmark][]).  Has Rust plugins.
 | 
| 133 | - pandoc.  Supports many formats, not just HTML.  Has plugins that take an AST
 | 
| 134 |   in JSON.  The AST represents pandoc-flavored Markdown.  pandoc differs from
 | 
| 135 |   Markdown in that it discourages HTML extensions.
 | 
| 136 | - ReStructuredText.  Sphinx has plugins.
 | 
| 137 |   - https://www.sphinx-doc.org/en/master/usage/extensions/index.html
 | 
| 138 | - XML Starlet (dormant, understands XPath and CSS Starlet)
 | 
| 139 | - R's Sweave?
 | 
| 140 |   - bookdown?
 | 
| 141 | - Jupyter notebooks have code and data.  Do they have plugins?
 | 
| 142 | - Are there tools in node.js?
 | 
| 143 |   - https://stackoverflow.com/questions/6657216/why-doesnt-node-js-have-a-native-dom
 | 
| 144 |   - jsdom?
 | 
| 145 | 
 | 
| 146 | ## To Build On Top
 | 
| 147 | 
 | 
| 148 | - CSS selectors
 | 
| 149 | - DOM
 | 
| 150 | 
 |