Parser Architecture

This doc has rough notes on the architecture of the parser.

How to Parse Shell Like a Programming Language (2019 blog post) covers some of the same material. (As of 2024, it's still pretty accurate, although there have been minor changes.)

Table of Contents
The Lossless Invariant
Lexing
List of Regex-Based Lexers
Sublanguages We Don't Lex or Parse
Lexer Unread
Parsing Issues
Re-parsing - reading text more than once
Re-parsing that doesn't affect the ysh-ify or fmt tools
Revisiting Tokens, Not Text
Lookahead in Recursive Descent Parsers
Where Parsers are Instantiated
Runtime Parsing
Where OSH Dynamically Parses
Where Bash Dynamically Parses (perhaps unintentionally)
Parse Errors at Runtime (Need Line Numbers)

The Lossless Invariant

The test suite test/lossless.sh invokes osh --tool lossless-cat $file.

The lossless-cat tool does this:

  1. Parse the file
  2. Collect all tokens, from 0 to N
  3. Print the text of each token.

Now, do the tokens "add up" to the original file? That's what we call the lossless invariant.

It will be the foundation for tools that statically understand shell:

The sections on re-parsing explain some obstacles which we had to overcome.

Lexing

List of Regex-Based Lexers

Oils uses regex-based lexers, which are turned into efficient C code with re2c. We intentionally avoid hand-written code that manipulates strings char-by-char, since that strategy is error prone; it's inevitable that rare cases will be mishandled.

The list of lexers can be found by looking at native/fastlex.c.

Sublanguages We Don't Lex or Parse

These constructs aren't recognized by the Oils front end. Instead, they're punted to libc:

Lexer Unread

osh/word_parse.py calls lexer.MaybeUnreadOne() to handle right parens in this case:

(case x in x) ;; esac )

This is sort of like the ungetc() I've seen in other shell lexers.

Parsing Issues

This section is about extra passes / "irregularities" at parse time. In the "Runtime Issues" section below, we discuss cases that involve parsing after variable expansion, etc.

Re-parsing - reading text more than once

We try to avoid re-parsing, but it happens in 4 places.

It complicates error messages with source location info. It also implications for --tool ysh-ify and --tool fmt, because it affects the "lossless invariant".

This command is perhaps a quicker explanation than the text below:

$ grep do_lossless */*.py
...
osh/cmd.py: ...
osh/word_parse.py: ...

Where re-parse:

  1. Here documents: We first read lines, and then parse them.

  2. Array L-values like a[x+1]=foo. bash allows splitting arithmetic expressions across word boundaries: a[x + 1]=foo. But I don't see this used, and it would significantly complicate the OSH parser.

  3. Backticks, the legacy form of $(command sub). There's an extra level of backslash quoting that may happen compared with $(command sub).

Re-parsing that doesn't affect the ysh-ify or fmt tools

  1. alias expansion

Revisiting Tokens, Not Text

These language constructs are handled statically, but not in a single pass of parsing:

This is less problematic, since it doesn't affect error messages (ctx_SourceCode) or the lossless invariant.

Lookahead in Recursive Descent Parsers

Where Parsers are Instantiated

Runtime Parsing

Where OSH Dynamically Parses

  1. Alias expansion like alias foo='ls | wc -l'. Aliases are like "lexical macros".
  2. Prompt strings. $PS1 and family first undergo \ substitution, and then the resulting strings are parsed as words, with $ escaped to \$.
  3. Builtins.

Where Bash Dynamically Parses (perhaps unintentionally)

All of the cases above, plus:

(1) Recursive Arithmetic Evaluation:

$ a='1+2'
$ b='a+3'
$ echo $(( b ))
6

This also happens for the operands to [[ x -eq x ]].

Note that a='$(echo 3)' results in a syntax error. I believe this was due to the ShellShock mitigation.

(2) The unset builtin takes an LValue. (not yet implemented in OSH)

$ a=(1 2 3 4)
$ expr='a[1+1]'
$ unset "$expr"
$ argv "${a[@]}"
['1', '2', '4']

(3) printf -v takes an "LValue".

(4) Var refs with ${!x} takes a "cell". (not yet implemented OSH. Relied on by bash-completion, as discovered by Greg Price)

$ a=(1 2 3 4)
$ expr='a[$(echo 2 | tee BAD)]'
$ echo ${!expr}
3
$ cat BAD
2

(5) test -v takes a "cell".

(6) ShellShock (removed from bash): export -f, all variables were checked for a certain pattern.

Parse Errors at Runtime (Need Line Numbers)

Generated on Wed, 10 Jul 2024 02:01:51 +0000