OILS / doc / pretty-printing.md View on Github | oilshell.org

607 lines, 412 significant
1---
2default_highlighter: oils-sh
3---
4
5Up to 4 Pretty Printers?
6========================
7
8*(March 2024)*
9
10Oils **parses** a lot of text, and it's becoming apparent than we need a
11**print** a lot too! The text should be nicely formatted because a shell is a
12user interface.
13
14This doc describes 4 possible pretty printers in Oils. Traditional shells
15don't have appear to have any pretty printing.
16
17[OSH]: $xref
18[YSH]: $xref
19
20<div id="toc">
21</div>
22
23## Screenshots
24
25Let's be concrete first, because there's a lot of brainstorming below.
26
27[YSH]($xref) prints its JSON-like data structures with the `=` keyword, which
28takes a Python-like expression on the right:
29
30![ysh issues.json](https://app.oilshell.org/picdir/resize?name=14qke97__ysh-issues.png&max-width=600)
31
32Right now, it looks bad on big data structures. It should look something like
33`nodejs` or `jq`:
34
35![node.js issues.json](https://app.oilshell.org/picdir/resize?name=13b35jj__nodejs-issues.png&max-width=600)
36
37![jq issues.json](https://app.oilshell.org/picdir/resize?name=11wsgpm__jq-issues.png&max-width=600)
38
39We may want to omit the quotes, like `nodejs`. (This syntax isn't meant to be
40parsed. [JSON8]($xref) may have unquoted dict keys, although it's not
41essential).
42
43## Background - `go fmt` style versus PPL style
44
45To back up a bit, I'm writing this doc organize my thoughts, and to explain
46problems and requirements to contributors.
47
48There are at least two schools of thought on pretty printers, which
49this lobste.rs thread has a good discussion of:
50
51- *Why is Prettier Rock Solid?* <https://lobste.rs/s/aevptj/why_is_prettier_rock_solid>
52 - HN comments: <https://news.ycombinator.com/item?id=39437424>. Aside: a top
53 comment shows why we don't want to take responsibility for *all* formatting
54 decisions in our OSH-YSH printer! Users are opinionated. The JavaScript
55 formatter Prettier is criticized for a bug, and for being **slow**.
56
57More comments on a blog post by Justin Pombrio (which I circulated):
58
59- *A Twist on Wadler's Printer*
60 <https://lobste.rs/s/1r0aak/twist_on_wadler_s_printer>
61
62Let's call the two styles the "`go fmt` style" and the "PPL style" (functional
63pretty printing language).
64
65I was probably "biased" toward `go fmt`, because the two formatters we actually
66**use** in Oils are influenced by it:
67
681. `clang-format` for our C++ code.
69 - This is the best formatter I've used. It's fast, and e.g. does a good job
70 on C macros.
711. `yapf` for our Python code.
72 - It's intentionally a "clone" for `clang-format`. (It's slow, mostly due
73 to being written in Python, and creating lots of tiny objects!)
74
75(Details: they have line wrapping algorithms, while `go fmt` doesn't. I'm not
76calling them "graph search" printers, because I think of line wrapping as a
77**subproblem** of pretty printing.)
78
79<!--
80
81Related: a bunch of Zulip threads where I was learning about pretty printing:
82
83- TODO: Link to threads
84
85(I've written many parsers before, but only one ad hoc pretty printer, which I
86want to get rid of. Discussed below.)
87
88-->
89
90### Why PPL style?
91
92However, Justin's post helped me understand Wadler's printer, a popular example
93of the PPL style. This style might have some advantages for Oils:
94
95- There's no user-provided layout for data structures &mdash; either
96 [YSH]($xref) data or [Zephyr ASDL][ASDL] data. So we need to synthesize a
97 layout from scratch.
98- We have multiple languages to format, and the PPL style separates **policy**
99 (language rules) and **mechanism** (line wrapping). So we should try a more
100 principled architecture, hopefully without sacrificing quality.
101- The two styles may not be as distinct as they seem at first. They may be
102 complementary.
103 - We can probably use a PPL for the **expression subproblem** of the OSH-YSH
104 shell formatter. The rest of the formatter will have rules that **don't**
105 have to do with line wrapping (aligning EOL comments like `go fmt` does,
106 etc.)
107 - I think a **non-wrapping** pretty printer &mdash; an "indenter" &mdash; can
108 use something similar to the PPL IRs. Notes below.
109- PPLs *could be* slower (asymptotically and in practice) than the custom
110 algorithms, but I think that can be solved with a simple rule in practice:
111 1. Compute the total size of the data structure/doc up front.
112 1. If it's small, spend extra time to make it look pretty, by using an
113 **expressive and slow** PPL. We can be quadratic or worse, perhaps. We
114 might want `node.js`-like columnar layout.
115 1. If it's big, use a **simple and fast** PPL subset.
116
117(Does that last idea work?)
118
119[ASDL]: $xref:zephyr-asdl
120
121
122### Warning
123
124Sometimes I pile on too many requirements, which I mentioned in the latest
125release announcement:
126
127- [Oils 0.20.0 > Why I'm Working on JSON](https://www.oilshell.org/blog/2024/02/release-0.20.0.html#zulip-why-am-i-working-on-json)
128
129We should do the simplest things first, and I think the PPL approach will allow
130that.
131
132BTW there are many threads on `#data-languages` on [Zulip]($xref) where I'm
133brainstorming and learning about pretty printing.
134
135## Four Printers - What and Why?
136
137Here's a sketch of what I think we need. It goes from **concrete** &rarr;
138**experimental** and research-y.
139
140### Print YSH data types in a JSON-like format
141
142**What is it?** This is the `=` keyword shown in the screenshots above. (BTW,
143Lua uses the same syntax `=` to evaluate expressions.)
144
145**Motivation**: We should look as good as `node.js` or `jq`.
146
147---
148
149Currently we use our JSON printer with the options `SHOW_NON_DATA |
150SHOW_CYCLES`.
151
152- `SHOW_NOW_DATA` prints non-data objects, like `<Func 0x123>`. This syntax
153 can't be parsed.
154- `SHOW_CYCLES` prints cycles with `-->`, instead of raising an error, like
155 JSON does.
156
157Example:
158
159 ysh$ var d = {}
160 ysh$ setvar d.eggex = /dot+/ # Eggex object
161
162 ysh$ = d
163 (Dict 0x7feb87cb4050) {"eggex":<Eggex 0x7feb87dbfd00>}
164
165 ysh$ setvar d.cycle = d
166
167 ysh$ = d
168 (Dict 0x7feb87cb4050) {"eggex":<Eggex 0x7feb87dbfd00>,"cycle":{ --> 0x7feb87cb4050 }}
169
170We should **replace** this with a new pretty printer that wraps lines, and has
171<span style="color: darkcyan; font-weight: bold">COLOR</span>.
172
173### Replace our ad hoc Zephyr ASDL pretty printer
174
175**What is it?** [Zephyr ASDL][ASDL] is the statically typed schema language we
176use to implement Oils. It's "one level down" from the shell.
177
178We used it to define the syntax of shell with **algebraic data types**. We
179create a [lossless syntax tree]($xref:LST), which is also an **IR** for shell.
180
181**Motivation**: We already wrote an ad hoc pretty printer, and it should be
182replaced! It tries to fit records on a single line, and if that fails, it uses
183multiple lines. I think it's slow.
184
185If we already have a YSH data printer, this printer should "obviously" be
186unified with it. We should have a nice separation of policy and mechanism.
187
188---
189
190Use `osh -n myscript.sh` to see what it does:
191
192![osh -n](https://app.oilshell.org/picdir/resize?name=1lwb0bf__osh-n.png&max-width=600)
193
194Notes:
195
196- The algorithm is in [asdl/format.py]($oils-src), and the "homogeneous IR" is
197 in [asdl/hnode.asdl]($oils-src).
198 - Each generated class definition has a `PrettyTree()` method, which converts
199 the **typed** `self` or `this` to the homogeneous `hnode` tree.
200 - `AbbreviatedTree()` is a bit like the **modular** printers discussed in the
201 `lobste.rs` thread. It makes certain common data structures more
202 readable, with type-specific logic. It's in Python only &mdash; can that
203 logic also be in C++?
204 - The syntax tree is actually a **graph**, and I recently added logic to
205 **omit duplicate** nodes. This is unlike the JSON printer, which prints
206 duplicate nodes, as Python and node.js do.
207- The slowness hasn't really mattered, because this format isn't exposed to
208 users. It's only for **debugging** Oils itself. But it's useful, and we may
209 want to expose it.
210- Also used in `pp asdl (obj)`, another debugging feature.
211- TODO: Add a simple benchmark? The new printer will probably be faster than
212 the old one.
213 - `osh -n benchmarks/testdata/configure-coreutils` tests a huge shell file
214
215<!--
216- current state of `osh -n`
217 - timing of it -- I think this may take awhile. It was never optimized. It
218 produces MANY allocations.
219 - to be honest -- allocations and GC will probably **dominate** in Oils.
220-->
221
222### OSH-YSH Code Formatter
223
224**What is it?** A formatter for shell code. I think it's more essential to
225have a [YSH]($xref) formatter, but an [OSH]($xref) formatter is also possible.
226They both use the same [lossless syntax tree]($xref:LST).
227
228**Motivation** - Formatters make a new language easier to use, and there are
229many users who don't know shell well.
230
231For example, I don't know TypeScript well, and I had a good experience with
232`deno fmt`. It reduced the **mental load** of adopting a new tool.
233
234---
235
236Justin had a nice idea on on `lobste.rs` - we should create **coarse tree**
237with these elements:
238
239- `{ }` affect indentation in [YSH]($xref)
240 - In [OSH]($xref), we should also understand `then elif else fi`, `do done`,
241 etc.
242- `( )` in [YSH]($xref) changes the lexer from [command mode to expression
243 mode](command-vs-expression-mode.html)
244- **Newlines** can't appear in atomic `text()` / chunks
245- Comments need to be preserved at the end of lines
246 - They may also be aligned on consecutive lines (with heuristics)
247- Keywords like `while for if` begin blocks of code
248
249Why? We don't don't want to take responsibility for every formatting decision!
250
251I actually think the command mode / statement formatter should be
252**non-wrapping**, while expressions can wrap. Commands will likely be more
253common than expressions in most YSH code.
254
255---
256
257The formatter will be invoked with `ysh --tool format myfile.ysh`.
258
259It can also be used with the output of `osh --tool ysh-ify`, which roughly
260translates [OSH]($xref) to [YSH]($xref) (doesn't preserve semantics). This may
261help generate **test data**, since there's plenty of shell code in the wild,
262but not much [YSH][] code.
263
264### Experimental: Export the Oils "Syntax Graph" to Users with "NIL8"
265
266**What is it?** A more human AND machine-readable format for the syntax tree,
267which is actually a **graph**.
268
269**Motivation**: The pretty-printed AST could be a **parseable** format. Allow
270users to reuse all the hard work we did [parsing
271shell]($blog-tag:parsing-shell)!
272
273---
274
275Printing raw [ASDL][] data is useful, but it could be more readable with custom
276logic to print the natural **layers** of the graph. There are 4 logical layers
277in [frontend/syntax.asdl]($oils-src):
278
2791. `source_t` describes whether shell code comes from `foo.sh` or `ysh -c 'echo
280 mycode'`
2812. `SourceLine` represents physical lines of code
2823. `Token` refers to portions of lines
2834. The syntax tree of `command_t word_t word_part_t expr_t`. The leaves are
284 tokens.
285
286And instead of a pretty format meant for humans, we may want to print an
287s-expression-like format I'm calling **"NIL8"**.
288
289NIL8 can be parsed. You may want to write tree-shaking code to deploy
290[YSH][] into containers.
291
292More on NIL8 below. Again, it's experimental.
293
294## Implementation Notes
295
296### Do the printers depend on each other?
297
298- The [YSH][] printer (1) naturally comes before the [ASDL][] printer (2).
299- The code formatter (3) is concrete and useful to end users.
300- The NIL8 printer (4) comes after the [ASDL][] printer (2), but it's experimental.
301 - It depends on a bunch of other work in Oils/YSH.
302
303Risks:
304
305- No real risks for (1) and (2)
306 - They're "engineering" &mdash; Justin's blog post is very close! It could
307 be ported almost literally to typed Python. It will translate
308 automatically to C++. (And it would be interesting to compare our
309 garbage-collected C++ with Rust's `Rc<T>`)
310 - [ASDL][] involves code generation in both Python and C++. We have a custom
311 build system for this (using Ninja for C++).
312- The OSH/YSH formatter has non-trivial decisions
313 - End-of-line comments. (Shell doesn't have block comments, which simplifies
314 things.)
315 - Multi-line strings in YSH have a special rule -- the indentation of the
316 closing `'''` is significant
317 - Shell here docs may be tricky
318 - This formatter probably requires the most "elbow grease". This is why I
319 said that the statement formatter should initially be **non-wrapping** &mdash;
320 it reduces the scope of the problem.
321
322### Code Skeleton
323
324I added some stubs in the code:
325
326- [data_lang/pretty.asdl]($oils-src) - How we would express the IR
327- [data_lang/pretty.py]($oils-src) - YSH conversion.
328- [data_lang/pretty-benchmark.sh]($oils-src) - Our naive ASDL pretty printer is
329 slow. It can take more than 3 seconds on a big file, vs. ~100ms to parse it.
330 (It does print over 100 MB of text though.)
331
332To generate Python code from the ASDL schema, run `build/py.sh all`.
333Otherwise, Oils is a plain Python 2 program, with a few C extensions.
334
335C++ translation is a separate step, and it's now pretty polished.
336
337For new contributors:
338
339- [Contributing]($wiki) on the wiki
340- [Where Contributors Have Problems]($wiki)
341
342There is also a stub for the formatter:
343
344- [tools/fmt.py]($oils-src) - Stub file for the formatter.
345 - Code copied from [tools/ysh_ify.py]($oils-src).
346
347## Design Questions
348
349This section has some less concrete thoughts.
350
351### Non-Wrapping Printers aka "Indenters" - same PPL IR?
352
353I think the PPL IRs are also useful if you're not line wrapping? You can just
354fix indentation.
355
356### Columnar Layouts (spending more time)
357
358`nodejs` does this in its `console.log()`.
359
360![python node.js](https://app.oilshell.org/picdir/resize?name=uscdu6__python-nodejs.png&max-width=600)
361
362Future work? We should get the basic pretty printer working first.
363
364### Two Levels of Coarse Parsing for YSH? (not at first)
365
366Making the coarse tree has some similarity to syntax highlighting. I wrote a
367simple syntax highlighter for 5+ languages called `micro-syntax`, and it should
368support [YSH]($xref) too.
369
370Sketch:
371
3721. First make the **really coarse tree**, something like: `Comment | Code |
373 StringLiteral`
3742. Then make a **less coarse** tree for pretty printing:
375 - Lex code into `{} ()`
376 - Categorize comments into `EndLineComment` | `BeginLineComment`
377
378Then do a trivial linear pass to fix up indentation. The `{ }` or `do done`
379tokens determine indentation.
380
381---
382
383Though honestly it's probably better to just **reuse** our elaborate parser at
384first. I like to "compress" different algorithms together, but it may not be
385worth it here.
386
387### NIL8 - Uses cases for both Code and Data
388
389What is "NIL8"? We don't know if it's a good idea yet, but it may be part of
390[J8 Notation](j8-notation.html).
391
392Think:
393
394- A mash-up of [JSON][] and S-expressions
395 - *NIL8 Isn't Lisp*
396 - *Narrow Intermediate Language*
397- WebAssembly text format
398 - An IR for an **imperative** language, with a Lisp-y syntax.
399- An **exterior** S-expression format
400 - Blog: [Oils is Exterior-First](https://www.oilshell.org/blog/2023/06/ysh-design.html)
401 - I posted POSE (portable s-expressions) on lobste.rs for this reason:
402 <https://lobste.rs/s/lwf4jv/pose_portable_s_expressions_pose_spec> (no
403 comments)
404
405[JSON]: $xref
406
407If NIL8 can represent both the [lossless syntax tree]($xref:LST) and a new IR
408for a [mycpp]($xref) rewrite ("yaks"), that's a good test of the design.
409
410Note that the AST is a **statically typed** data structure, which means we may
411also want to export the [ASDL][] **schema** as NIL8!
412
413Links:
414
415- [#data-languages > NIL8 Infix Rule](https://oilshell.zulipchat.com/#narrow/stream/403584-data-languages/topic/NIL8.20Infix.20Rule)
416- [Commit describing infix rule](https://github.com/oilshell/oil/commit/b9cecf88838d4c89ce1dbd8f4bcdd8e92e10d442)
417
418At a high level, we're trying to nudge users toward a **small** set of syntaxes
419for shell-like programming, rather than inventing ad hoc syntax every time.
420String literals are a pain point: people often implement them badly, or not at
421all.
422
423## Conclusion
424
425I think we should have shared infrastructure for 3 printers:
426
4271. [YSH]($xref) data structures
4281. [ASDL][] data structures
4291. [OSH]($xref) and [YSH]($xref) code
430
431And then there's this idea of "replacing" or rationalizing the [ASDL][] syntax
432tree with "NIL8":
433
434- It can be parsed, not just printed.
435 - To parse, you can reuse an existing [JSON]($xref) string lexer. IMO, this
436 is the fiddliest part of parsing.
437- It can export a graph shape, in natural "layers".
438
439## Related
440
441### Docs
442
443- [Parser Architecture](parser-architecture.html) - describes issues like the
444 **"lossless invariant"**, which is affected by *re-parsing*.
445 - I recently updated it, and tested the invariant with
446 [test/lossless.sh]($oils-src).
447- The repo [README]($oils-doc:oils-repo/README.html) has an overview of the
448 code.
449
450### Fun Computer Science Problems in Oils
451
452Pretty printing is adjacent to other fun problems in Oils, like GC performance,
453"boxless" optimization, etc.
454
455- [#help-wanted > Fun Computer Science Problems](https://oilshell.zulipchat.com/#narrow/stream/417617-help-wanted/topic/Fun.20Computer.20Science.20Problems)
456
457Things to think about:
458
459(1) Unified Code Representation for Oils
460
461We want to pack all these tools into Oils:
462
463- The interpreter, for **execution**
464 - prints precise errors, ignores comment tokens
465- `ysh-ify` - a VERY rough **translation** of [OSH]($xref) to [YSH]($xref)
466 - doesn't respect semantics, because you really need static types for that
467 - uses "span ID"
468- **Pretty Printing** (this doc)
469 - will also use "span ID" to detect comment positions, I think
470- Syntax **highlighting**
471 - Interactive highlighting will help users learn the language
472 - It's **recursive**, because sublanguages are mutually recursive: string
473 literals, commands, expressions
474
475Related: Retrospective on the Go `ast` package.
476
477(2) Yaks - *mycpp from the bottom up*
478
479NIL8 leads into **"Yaks"**, which is an IR for garbage collected C++.
480
481- Yaks is of course a big yak shave, but it's **concrete** because we recently
482 **completed** the translation with [mycpp]($xref). (mycpp is a crappy
483 program which produces good results!)
484
485(3) Pretty printing will cause many small **allocations**.
486
487I think that naive implementations should be fast enough. If not, any slowness
488may be due to allocation, not necessarily the pretty printing algorithm itself.
489
490Some solutions:
491
492- We want to move towards a **tagged pointer** runtime, an ambitious
493 transformation of the entire interpreter.
494 - But we'll do it in steps. First step: Small String Optimization.
495 - Yaks goes along with a tagged pointer runtime. Yaks also has a principled
496 IR, which may be represented with many small objects.
497- GC rooting optimization should give a speed boost
498
499## Appendix: Older Notes
500
501This is about ref counting for printing **graph-shaped** data.
502
503I no longer think this is as important. I think we should **manually**
504construct 4 layers of the graph, as described in the section on formatter (4).
505
506### Dynamically Typed YSH Data (`value_t`)
507
508Similar to JSON / JSON8 printing, except we
509
5101. count references, and then print `...` instead of repeating
5111. line wrap
5121. assign colors
513 - for atoms, and possibly for balanced parens, to make it more readable
514
515#### Step 1: Count References
516
517This is a global pass that computes a Dict[int, int]
518
519 object ID -> number of times referenced in the graph
520
521The graph is specified by single root node, e.g. the argument to
522
523 pp line (obj)
524
525Pass this dict into the second step.
526
527#### Step 2: Convert To Homogeneous Representation
528
529 value.List -> hnode.Compound with []
530 value.Dict -> hnode.Compound with {}
531
532 null, true, false -> Atom
533 Cycle detected -> Atom, with { --> 4beef2 }
534 [ --> 4beef2 ]
535
536Repetition:
537
538 { key: { ... 4beef2 }, key2: { ... 4beef2 }
539
540Or maybe omit the type, since strings don't have that:
541
542 { key: ... 4beef2, key2: ... 4beef2 }
543
544#### hnode Schema
545
546The schema looks like this now?
547
548 hnode =
549 Atom(str s, color color) - # External objects can use this?
550 | Compound(hnode* items)
551
552The length of 'str s' is the input to line wrapping.
553
554#### Step 3: Figure out what's on each line
555
556TODO: what's the heuristic here? Is it global?
557
558Dynamic programming?
559
560do we insert hnode.Newline() or something?
561
562### Statically Typed ASDL Data
563
564Reduce it to the case above.
565
566#### Step 1 - Ref Counting / Cycle Detection?
567
568We do this all at once?
569
570Because to convert to value.Record, you have to do cycle detection anyway.
571
572And that's similar to ref counting.
573
574#### Step 2 - ASDL records -> value.Record
575
576 value =
577 ...
578 | Record(str type_name, Dict[str, value_t] fields)
579
580The special "-" key can be used for JSON:
581
582 {"-": "command.Simple, "name": "hi"}
583
584Though this loses some information, and it doesn't solve the problem with
585shared references. We would need Packle for that.
586
587#### Step 2a: Optional Abbreviation?
588
589Is this separate? Or part of step 2.
590
591We need something between value.Record and hnode.Compound
592to do ABBREVIATION:
593
594- Abbreviate type name, or omit it
595- Omit some field names (requires schema to record it)
596- Change () to <>
597
598Also need nodes for
599
600- `...` means already printed
601- `---` means CANNOT print, because it's a cycle
602- `@1f23` - ID if already printed, or in a cycle
603
604#### Steps 3 and 4 - Homogeneous Representation, Line Wrapping
605
606Identical to the dynamically typed case above.
607