OILS / lazylex / README.md View on Github | oilshell.org

150 lines, 121 significant
1Lazy, Lossless Lexing Libraries
2===============================
3
4Right now we're using this Python code to process HTML in Oil docs. Logically,
5the doc pipeline looks like:
6
7(Hand-written Markdown with embedded HTML) <br/>
8&rarr; CommonMark renderer &rarr; <br/>
9(HTML) <br/>
10&rarr; Filter with lazy lexing &rarr; <br/>
11(HTML) <br/>
12&rarr; Filter with lazy lexing &rarr; <br/>
13(HTML) <br/>
14<br/>
15*... repeat N times ...* <br/>
16<br/>
17(Final HTML) <br/>
18
19Eventually it would be nice to expand this API design to more formats and make
20it available to Oil language users.
21
22<div id="toc">
23</div>
24
25## Why?
26
27- To report good parse errors with location info
28- To use like `sed` on HTML
29- To minimize allocations, i.e. don't construct a DOM, and don't even construct
30 substrings
31- So we don't "lose the stack", unlike callback-based parsing
32 - We get an iterator of events/spans
33- A layer to build a "lossless syntax tree" on top of.
34
35## Formats
36
37### HTML
38
39HTML has two levels:
40
411. The `<>` structure, i.e. tags, the DOCTYPE declaration, comments, and processing
42 instructions
432. The `name="value"` attributes inside start tags (and self-closing tags)
44
45### TSV2
46
47- This format is **designed** to be read line-by-line (unlike CSV).
48- You can skip to any column, and optionally decode the field into an Bool,
49 Int, Float, or Str.
50
51### JSON
52
53- py-yajl is event-based, but not lazy. And it's a parser, not a lexer.
54- We could LEX `{}` and `[] and `""` `\` in the first step. This is lexing and
55 not parsing.
56
57
58## Links
59
60- [pulldown-cmark][]. This is called a "pull parser" in reference to the *XML
61 Pull Parsing* API at <http://xmlpull.org>. However I would call this a
62 *lexer* and not a *parser*.
63 - Hm I think this indicates a need for **lossless** lexers as well?
64 https://github.com/Byron/pulldown-cmark-to-cmark/blob/master/src/fmt.rs
65 - It appears to be used in mdbook
66- [simdjson][]: This pasrer is designed for extreme speed, and the first stage
67 of it "lazy" and uses integer indices. (We're only trying to avoid
68 allocation; we're not avoiding branches or using SIMD! We also aim to
69 transform the underlying data, not just parse it.)
70
71[simdjson]: https://branchfree.org/2019/02/25/paper-parsing-gigabytes-of-json-per-second/
72
73[pulldown-cmark]: https://github.com/raphlinus/pulldown-cmark
74
75## Design Notes
76
77### Lessons/Claims
78
79- You can parse HTML quickly and correctly with regexes! It has a simple
80 syntax that's almost designed for this.
81 - Key point: We're not parsing them **only** with regexes.
82 - The parser is correct in the sense that its behavior on every input is
83 fully-defined. There are no buffer overflows on edge cases -- that's the
84 magic of regexes and the corresponding state machines. However it doesn't
85 recognize every weird document on the web. It recognizes something close
86 to "well-formed XML" (but it's not XHTML.)
87- Parsing with spans / integer positions is efficient, **composable**, and
88 leads to better **syntax errors**.
89 - Example: spec test format parse errors aren't good. Info is lost.
90 Or ASDL parser? I guess it has some line info.
91- The API is easier to use than SAX because you're not losing the stack.
92 (Previous transformations used Python's HTMLParser, which is a SAX API.)
93- It's more efficient than a DOM API. DOM allocates a lot and loses
94 whitespace. (Speed matters when rebuilding the whole site. Each page has
95 multiple stages. It makes the code cleaner to do multiple quick passes, like
96 a compiler.)
97 - In many instances, you can MODIFY the HTML doc correctly without
98 deserializing something. For example, adding `<b>` tags around a line
99 doesn't require unquoting and quoting `&gt;` within them.
100- Preserving whitespace is useful because I want to use 'diff' to test
101 correctness against the old pipeline.
102- Python's `pat.match(s, start_pos, end_pos)` is very useful for efficient
103 lexing.
104 - TODO: Convert to re2c to see how it does. Need YYLIMIT.
105- TODO: Issue of non-greedy matches.
106- TODO: Issue of unquoting and quoting (escaping).
107- The triple backtick extension to Markdown (part of CommonMark) is useful.
108 - Although it should generate arbitrary `<div name="val">` instead. This
109 allow natural plugins. You can write `<div>` in Markdown, but it's
110 annoying to manually escape `<` to `&gt;`, e.g. in graphviz or TeX.
111 HTML is analogous to shell. A web site is a directory tree of text files!
112 - It's better than the Snip `-->` syntax, which didn't play well with syntax
113 highlighting.
114- Composable grammars: Is this the basis for Pulp?
115
116### Open Problems
117
118- Python generators (`yield`) make the code more natural, but that's not
119 possible in C or C++. (C++20 has coroutines though.)
120 - Could write a compiler? Would be an excuse to clean up the OPy or mycpp
121 ASTs.
122- Input handling in C/shell:
123 - Unlike Python's regex engine, libc `regexec()` doesnt have an `end_pos`,
124 requires `NUL` termination.
125 - We're also using `re2c` this way. Can se use `YYLIMIT`?
126 - Simple solution: copy the subrange, or temporarily mutate the buffer (would
127 cause copy-on-write)
128 - Is there a zero-copytehre a
129
130### Comparisons
131
132- mdbook (in Rust, using [pulldown-cmark][]). Has Rust plugins.
133- pandoc. Supports many formats, not just HTML. Has plugins that take an AST
134 in JSON. The AST represents pandoc-flavored Markdown. pandoc differs from
135 Markdown in that it discourages HTML extensions.
136- ReStructuredText. Sphinx has plugins.
137 - https://www.sphinx-doc.org/en/master/usage/extensions/index.html
138- XML Starlet (dormant, understands XPath and CSS Starlet)
139- R's Sweave?
140 - bookdown?
141- Jupyter notebooks have code and data. Do they have plugins?
142- Are there tools in node.js?
143 - https://stackoverflow.com/questions/6657216/why-doesnt-node-js-have-a-native-dom
144 - jsdom?
145
146## To Build On Top
147
148- CSS selectors
149- DOM
150