| 1 | ---
|
| 2 | in_progress: yes
|
| 3 | ---
|
| 4 |
|
| 5 | Solutions to the Framing Problem
|
| 6 | ================================
|
| 7 |
|
| 8 | How do you write multiple **records** to a pipe, and how do you read them?
|
| 9 |
|
| 10 | You need a way of delimiting them. Let's call this the "framing problem"
|
| 11 | — a term borrowed from network engineering.
|
| 12 |
|
| 13 | This doc categorizes different formats, and shows how you handle them in YSH.
|
| 14 |
|
| 15 | YSH is meant for writing correct shell programs.
|
| 16 |
|
| 17 | <div id="toc">
|
| 18 | </div>
|
| 19 |
|
| 20 | ## A Length Prefix
|
| 21 |
|
| 22 | [Netstrings][netstring] are a simple format defined by Daniel J Bernstein.
|
| 23 |
|
| 24 | 3:foo, # ASCII length, colon, byte string, comma
|
| 25 |
|
| 26 | [netstring]: https://en.wikipedia.org/wiki/Netstring
|
| 27 |
|
| 28 | This format is easy to implement, and efficient to read and write.
|
| 29 |
|
| 30 | But the encoded output may contain binary data, which isn't readable by a human
|
| 31 | using a terminal (or GUI). This is significant!
|
| 32 |
|
| 33 | ---
|
| 34 |
|
| 35 | TODO: Implement `read --netstr` and `write --netstr`
|
| 36 |
|
| 37 | <!--
|
| 38 | Like [J8 Notation][], this format is "8-bit clean", but:
|
| 39 |
|
| 40 | - A netstring encoder is easier to write than a QSN encoder. This may be
|
| 41 | useful if you don't have a library handy.
|
| 42 | - It's more efficient to decode, in theory.
|
| 43 | -->
|
| 44 |
|
| 45 | ## Solutions Using a Delimiter
|
| 46 |
|
| 47 | Now let's look at traditional Unix solutions, and their **problems**.
|
| 48 |
|
| 49 | ### Fixed Delimiter: Newline or `NUL` byte
|
| 50 |
|
| 51 | In traditional Unix, newlines delimit "records". Here's how you read them in
|
| 52 | shell:
|
| 53 |
|
| 54 | while IFS='' read -r; do # confusing idiom!
|
| 55 | echo line=$REPLY
|
| 56 | break # remaining bytes are still in the pipe
|
| 57 | done
|
| 58 |
|
| 59 | YSH has a simpler idiom:
|
| 60 |
|
| 61 | while read --raw-line { # unbuffered
|
| 62 | echo line=$_reply
|
| 63 | break # remaining bytes are still in the pipe
|
| 64 | }
|
| 65 |
|
| 66 | Or you can read all lines:
|
| 67 |
|
| 68 | for line in (stdin) { # buffered
|
| 69 | echo line=$line
|
| 70 | break # remaining bytes may be lost in a buffer
|
| 71 | }
|
| 72 |
|
| 73 | **However**, in Unix, all of these strings may have newlines:
|
| 74 |
|
| 75 | - filenames
|
| 76 | - items in `argv`
|
| 77 | - values in `environ`
|
| 78 |
|
| 79 | ---
|
| 80 |
|
| 81 | But these C-style strings can't contain the `NUL` byte, aka `\0`. So GNU tools
|
| 82 | have evolved support for another format:
|
| 83 |
|
| 84 | find . -print0 # write data
|
| 85 | xargs -0 # read data; also --null
|
| 86 | grep -z # read data; also --null-data
|
| 87 | sort -z # read data; also --zero-terminated
|
| 88 | # (Why are all the names different?)
|
| 89 |
|
| 90 | In Oils, we added a `-0` flag to `read` to understands this:
|
| 91 |
|
| 92 | $ find . -print0 | { read -0 x; echo $x; read -0 x; echo $x; }
|
| 93 | foo # could contain newlines!
|
| 94 | bar
|
| 95 |
|
| 96 | ### Chosen Delimiter: Here docs and multipart MIME
|
| 97 |
|
| 98 | Shell has has here docs that look like this:
|
| 99 |
|
| 100 | cat <<EOF
|
| 101 | the string EOF
|
| 102 | can't start a line
|
| 103 | EOF
|
| 104 |
|
| 105 | So you **choose** the delimiter, with the "word" you write after `<<`.
|
| 106 |
|
| 107 | ---
|
| 108 |
|
| 109 | Similarly, when your browser POSTs a form, it uses [MIME multipart message
|
| 110 | format](https://en.wikipedia.org/wiki/MIME#Multipart_messages):
|
| 111 |
|
| 112 | MIME-Version: 1.0
|
| 113 | Content-Type: multipart/mixed; boundary=frontier
|
| 114 |
|
| 115 | This is a message with multiple parts in MIME format.
|
| 116 | --frontier
|
| 117 | Content-Type: text/plain
|
| 118 |
|
| 119 | This is the body of the message.
|
| 120 | --frontier
|
| 121 |
|
| 122 | So again, you **choose** a delimiter with `boundary=frontier`, and then you
|
| 123 | must recognize it later in the message.
|
| 124 |
|
| 125 | ## C-Style `\` escaping allows arbitrary bytes
|
| 126 |
|
| 127 | [JSON][] can express strings with newlines:
|
| 128 |
|
| 129 | "line 1 \n line 2"
|
| 130 |
|
| 131 | It can also express the zero code point, which isn't the same as NUL byte:
|
| 132 |
|
| 133 | "zero code point \u0000"
|
| 134 |
|
| 135 | [J8 Notation][] is an extension of JSON that fixes this:
|
| 136 |
|
| 137 | "NUL byte \y00"
|
| 138 |
|
| 139 | (We use `\y00` rather than `\x00`, because Python and JavaScript both confuse
|
| 140 | `\x00` with `U+0000`. The zero code point may be encoded as 2 or 4 `NUL`
|
| 141 | bytes.)
|
| 142 |
|
| 143 | [J8 Strings]: j8-notation.html
|
| 144 | [JSON]: $xref
|
| 145 |
|
| 146 | ### Escaping-Based Records
|
| 147 |
|
| 148 | TSV files are based on delimiters, but they aren't very readable in a terminal.
|
| 149 |
|
| 150 | TODO
|
| 151 |
|
| 152 | So TSV8 offers and "aligned" format:
|
| 153 |
|
| 154 | #.ssv8 flag desc type
|
| 155 | type Str Str Str
|
| 156 | --verbose "do it \t verbosely" bool
|
| 157 | --count "count only" int
|
| 158 |
|
| 159 | So this format combines two strategies:
|
| 160 |
|
| 161 | - Delimiter-based for the **rows** / lines
|
| 162 | - Escaping-based for the **cells**
|
| 163 |
|
| 164 | ## Conclusion
|
| 165 |
|
| 166 | Traditional shells mostly support newline-based records. YSH supports:
|
| 167 |
|
| 168 | 1. Length-prefixed records
|
| 169 | 1. Delimiter-based records
|
| 170 | - fixed delimiter: newline or `NUL`
|
| 171 | - chosen delimiter: TODO? with regex capture?
|
| 172 | 1. Escaping-based records with [JSON][] and the [J8 Notation][] extension.
|
| 173 | - But we avoid formats that are purely based on escaping.
|