OILS / doc / parser-architecture.md View on Github | oilshell.org

218 lines, 154 significant
1Parser Architecture
2===================
3
4This doc has rough notes on the architecture of the parser.
5
6[How to Parse Shell Like a Programming Language][parse-shell] (2019 blog post)
7covers some of the same material. (As of 2024, it's still pretty accurate,
8although there have been minor changes.)
9
10<div id="toc">
11</div>
12
13## The Lossless Invariant
14
15The test suite [test/lossless.sh]($oils-src) invokes `osh --tool lossless-cat
16$file`.
17
18The `lossless-cat` tool does this:
19
201. Parse the file
211. Collect **all** tokens, from 0 to N
221. Print the text of each token.
23
24Now, do the tokens "add up" to the original file? That's what we call the
25*lossless invariant*.
26
27It will be the foundation for tools that statically understand shell:
28
29- `--tool ysh-ify` - change style of `do done` &rarr; `{ }`, etc.
30- `--tool fmt` - fix indentation, maybe some line wrapping
31
32The sections on **re-parsing** explain some obstacles which we had to overcome.
33
34## Lexing
35
36[parse-shell]: https://www.oilshell.org/blog/2019/02/07.html
37
38### List of Regex-Based Lexers
39
40Oils uses regex-based lexers, which are turned into efficient C code with
41[re2c]($xref). We intentionally avoid hand-written code that manipulates
42strings char-by-char, since that strategy is error prone; it's inevitable that
43rare cases will be mishandled.
44
45The list of lexers can be found by looking at [native/fastlex.c]($oils-src).
46
47- The large, modal OSH/YSH lexer in [frontend/lexer_def.py]($oils-src).
48- Lexers for OSH sublanguages
49 - For `echo -e`
50 - For `PS1` backslash escapes.
51 - For history expansion, e.g. `!$`.
52 - For globs, to implement `${x/foo*/replace}` via conversion to ERE. We need
53 position information, and the `fnmatch()` API doesn't provide it, but
54 `regexec()` does.
55 - NOTE: We'll also need one for converting extended globs to EREs, for
56 portability.
57
58[re2c]: http://re2c.org/
59
60### Sublanguages We Don't Lex or Parse
61
62These constructs aren't recognized by the Oils front end. Instead, they're
63punted to [libc]($xref):
64
65- Glob patterns, e.g. `*.py` (in most cases)
66- Extended glob patterns, e.g. `@(*.py|*.sh)`
67- `strftime` format strings, e.g. `printf '%(%Y-%m-%d)T' $timestamp`
68
69### Lexer Unread
70
71[osh/word_parse.py]($oils-src) calls `lexer.MaybeUnreadOne()` to handle right
72parens in this case:
73
74```
75(case x in x) ;; esac )
76```
77
78This is sort of like the `ungetc()` I've seen in other shell lexers.
79
80## Parsing Issues
81
82This section is about extra passes / "irregularities" at **parse time**. In
83the "Runtime Issues" section below, we discuss cases that involve parsing after
84variable expansion, etc.
85
86### Re-parsing - reading text more than once
87
88We try to avoid re-parsing, but it happens in 4 places.
89
90It complicates error messages with source location info. It also implications
91for `--tool ysh-ify` and `--tool fmt`, because it affects the **"lossless invariant"**.
92
93This command is perhaps a quicker explanation than the text below:
94
95 $ grep do_lossless */*.py
96 ...
97 osh/cmd.py: ...
98 osh/word_parse.py: ...
99
100Where re-parse:
101
1021. [Here documents]($xref:here-doc): We first read lines, and then parse them.
103 - `VirtualLineReader` in [osh/cmd_parse.py]($oils-src)
104 - This is re-parsing from **lines**
105
1062. **Array L-values** like `a[x+1]=foo`. bash allows splitting arithmetic
107 expressions across word boundaries: `a[x + 1]=foo`. But I don't see this
108 used, and it would significantly complicate the OSH parser.
109 - `_MakeAssignPair` in [osh/cmd_parse.py]($oils-src) has `do_lossless` condition
110 - This is re-parsing from **tokens**
111
1123. **Backticks**, the legacy form of `$(command sub)`. There's an extra level
113 of backslash quoting that may happen compared with `$(command sub)`.
114 - `_ReadCommandSubPart` in [osh/word_parse.py]($oils-src) has `do_lossless`
115 condition
116 - This is re-parsing from **tokens**
117
118### Re-parsing that doesn't affect the `ysh-ify` or `fmt` tools
119
1204. `alias` expansion
121 - `SnipCodeString` in [osh/cmd_parse.py]($oils-src)
122 - This is re-parsing from **tokens**, but it only happens **after running**
123 something like `alias ls=foo`. So it doesn't affect the lossless
124 invariant that `--tool ysh-ify` and `--tool fmt` use.
125
126### Revisiting Tokens, Not Text
127
128These language constructs are handled statically, but not in a single pass of
129parsing:
130
131- Assignment vs. Env binding detection: `FOO=bar declare a[x]=1`.
132 We make another pass with `_SplitSimpleCommandPrefix()`.
133 - Related: `s=1` doesn't cause reparsing, but `a[x+1]=y` does.
134- Brace Detection in a few places: `echo {a,b}`
135- Tilde Detection: `echo ~bob`, `home=~bob`
136
137This is less problematic, since it doesn't affect error messages
138(`ctx_SourceCode`) or the lossless invariant.
139
140### Lookahead in Recursive Descent Parsers
141
142- `myfunc() { echo hi; }` vs. `myfunc=() # an array`
143- `shopt -s parse_equals`: For `x = 1 + 2*3`
144
145### Where Parsers are Instantiated
146
147- See [frontend/parse_lib.py]($oils-src) and its callers.
148
149## Runtime Parsing
150
151### Where OSH Dynamically Parses
152
1531. **Alias expansion** like `alias foo='ls | wc -l'`. Aliases are like
154"lexical macros".
1552. **Prompt strings**. `$PS1` and family first undergo `\` substitution, and
156then the resulting strings are parsed as words, with `$` escaped to `\$`.
1573. **Builtins**.
158 - `eval`
159 - `trap` builtin
160 - exit
161 - debug
162 - err
163 - signals
164 - `source` — the filename is formed dynamically, but the code is generally
165 static.
166
167### Where Bash Dynamically Parses (perhaps unintentionally)
168
169All of the cases above, plus:
170
171(1) Recursive **Arithmetic Evaluation**:
172
173```sh-prompt
174$ a='1+2'
175$ b='a+3'
176$ echo $(( b ))
1776
178```
179
180This also happens for the operands to `[[ x -eq x ]]`.
181
182Note that `a='$(echo 3)'` results in a **syntax error**. I believe this was
183due to the ShellShock mitigation.
184
185(2) The **`unset` builtin** takes an LValue. (not yet implemented in OSH)
186
187```sh-prompt
188$ a=(1 2 3 4)
189$ expr='a[1+1]'
190$ unset "$expr"
191$ argv "${a[@]}"
192['1', '2', '4']
193```
194
195(3) **printf -v** takes an "LValue".
196
197(4) **Var refs** with `${!x}` takes a "cell". (not yet implemented OSH.
198Relied on by `bash-completion`, as discovered by Greg Price)
199
200```sh-prompt
201$ a=(1 2 3 4)
202$ expr='a[$(echo 2 | tee BAD)]'
203$ echo ${!expr}
2043
205$ cat BAD
2062
207```
208
209(5) **test -v** takes a "cell".
210
211(6) ShellShock (removed from bash): `export -f`, all variables were checked for
212a certain pattern.
213
214### Parse Errors at Runtime (Need Line Numbers)
215
216- `test` / `[`, e.g. `[ -a -a -a ]`
217- Command line flag usage errors
218- [alias]($help) parse errors