| 1 | ---
 | 
| 2 | in_progress: yes
 | 
| 3 | ---
 | 
| 4 | 
 | 
| 5 | Solutions to the Framing Problem
 | 
| 6 | ================================
 | 
| 7 | 
 | 
| 8 | How do you write multiple **records** to a pipe, and how do you read them?
 | 
| 9 | 
 | 
| 10 | You need a way of delimiting them.  Let's call this the "framing problem"
 | 
| 11 | — a term borrowed from network engineering.
 | 
| 12 | 
 | 
| 13 | This doc categorizes different formats, and shows how you handle them in YSH.
 | 
| 14 | 
 | 
| 15 | YSH is meant for writing correct shell programs.
 | 
| 16 | 
 | 
| 17 | <div id="toc">
 | 
| 18 | </div>
 | 
| 19 | 
 | 
| 20 | ## A Length Prefix
 | 
| 21 | 
 | 
| 22 | [Netstrings][netstring] are a simple format defined by Daniel J Bernstein.
 | 
| 23 | 
 | 
| 24 |     3:foo,  # ASCII length, colon, byte string, comma
 | 
| 25 | 
 | 
| 26 | [netstring]: https://en.wikipedia.org/wiki/Netstring
 | 
| 27 | 
 | 
| 28 | This format is easy to implement, and efficient to read and write.
 | 
| 29 | 
 | 
| 30 | But the encoded output may contain binary data, which isn't readable by a human
 | 
| 31 | using a terminal (or GUI).  This is significant!
 | 
| 32 | 
 | 
| 33 | ---
 | 
| 34 | 
 | 
| 35 | TODO: Implement `read --netstr` and `write --netstr`
 | 
| 36 | 
 | 
| 37 | <!--
 | 
| 38 | Like [J8 Notation][], this format is "8-bit clean", but:
 | 
| 39 | 
 | 
| 40 | - A netstring encoder is easier to write than a QSN encoder.  This may be
 | 
| 41 |   useful if you don't have a library handy.
 | 
| 42 | - It's more efficient to decode, in theory.
 | 
| 43 | -->
 | 
| 44 | 
 | 
| 45 | ## Solutions Using a Delimiter
 | 
| 46 | 
 | 
| 47 | Now let's look at traditional Unix solutions, and their **problems**.
 | 
| 48 | 
 | 
| 49 | ### Fixed Delimiter: Newline or `NUL` byte
 | 
| 50 | 
 | 
| 51 | In traditional Unix, newlines delimit "records".  Here's how you read them in
 | 
| 52 | shell:
 | 
| 53 | 
 | 
| 54 |     while IFS='' read -r; do  # confusing idiom!
 | 
| 55 |       echo line=$REPLY
 | 
| 56 |       break                   # remaining bytes are still in the pipe
 | 
| 57 |     done
 | 
| 58 | 
 | 
| 59 | YSH has a simpler idiom:
 | 
| 60 | 
 | 
| 61 |     while read --raw-line {   # unbuffered
 | 
| 62 |       echo line=$_reply
 | 
| 63 |       break                   # remaining bytes are still in the pipe
 | 
| 64 |     }
 | 
| 65 | 
 | 
| 66 | Or you can read all lines:
 | 
| 67 | 
 | 
| 68 |     for line in (stdin) {     # buffered
 | 
| 69 |       echo line=$line
 | 
| 70 |       break                   # remaining bytes may be lost in a buffer
 | 
| 71 |     }
 | 
| 72 | 
 | 
| 73 | **However**, in Unix, all of these strings may have newlines:
 | 
| 74 | 
 | 
| 75 | - filenames
 | 
| 76 | - items in `argv`
 | 
| 77 | - values in `environ`
 | 
| 78 | 
 | 
| 79 | ---
 | 
| 80 | 
 | 
| 81 | But these C-style strings can't contain the `NUL` byte, aka `\0`.  So GNU tools
 | 
| 82 | have evolved support for another format:
 | 
| 83 | 
 | 
| 84 |     find . -print0  # write data
 | 
| 85 |     xargs -0        # read data; also --null
 | 
| 86 |     grep -z         # read data; also --null-data
 | 
| 87 |     sort -z         # read data; also --zero-terminated
 | 
| 88 |                     # (Why are all the names different?)
 | 
| 89 | 
 | 
| 90 | In Oils, we added a `-0` flag to `read` to understands this:
 | 
| 91 | 
 | 
| 92 |     $ find . -print0 | { read -0 x; echo $x; read -0 x; echo $x; }
 | 
| 93 |     foo  # could contain newlines!
 | 
| 94 |     bar
 | 
| 95 | 
 | 
| 96 | ### Chosen Delimiter: Here docs and multipart MIME
 | 
| 97 | 
 | 
| 98 | Shell has has here docs that look like this:
 | 
| 99 | 
 | 
| 100 |     cat <<EOF
 | 
| 101 |     the string EOF
 | 
| 102 |     can't start a line
 | 
| 103 |     EOF
 | 
| 104 | 
 | 
| 105 | So you **choose** the delimiter, with the "word" you write after `<<`.
 | 
| 106 | 
 | 
| 107 | ---
 | 
| 108 | 
 | 
| 109 | Similarly, when your browser POSTs a form, it uses [MIME multipart message
 | 
| 110 | format](https://en.wikipedia.org/wiki/MIME#Multipart_messages):
 | 
| 111 | 
 | 
| 112 |     MIME-Version: 1.0
 | 
| 113 |     Content-Type: multipart/mixed; boundary=frontier
 | 
| 114 |     
 | 
| 115 |     This is a message with multiple parts in MIME format.
 | 
| 116 |     --frontier
 | 
| 117 |     Content-Type: text/plain
 | 
| 118 |     
 | 
| 119 |     This is the body of the message.
 | 
| 120 |     --frontier
 | 
| 121 | 
 | 
| 122 | So again, you **choose** a delimiter with `boundary=frontier`, and then you
 | 
| 123 | must recognize it later in the message.
 | 
| 124 | 
 | 
| 125 | ## C-Style `\` escaping allows arbitrary bytes
 | 
| 126 | 
 | 
| 127 | [JSON][] can express strings with newlines:
 | 
| 128 | 
 | 
| 129 |     "line 1 \n line 2"
 | 
| 130 | 
 | 
| 131 | It can also express the zero code point, which isn't the same as NUL byte:
 | 
| 132 | 
 | 
| 133 |     "zero code point \u0000"
 | 
| 134 | 
 | 
| 135 | [J8 Notation][] is an extension of JSON that fixes this:
 | 
| 136 | 
 | 
| 137 |     "NUL byte \y00"
 | 
| 138 | 
 | 
| 139 | (We use `\y00` rather than `\x00`, because Python and JavaScript both confuse
 | 
| 140 | `\x00` with `U+0000`.  The zero code point may be encoded as 2 or 4 `NUL`
 | 
| 141 | bytes.)
 | 
| 142 | 
 | 
| 143 | [J8 Strings]: j8-notation.html
 | 
| 144 | [JSON]: $xref
 | 
| 145 | 
 | 
| 146 | ### Escaping-Based Records
 | 
| 147 | 
 | 
| 148 | TSV files are based on delimiters, but they aren't very readable in a terminal.
 | 
| 149 | 
 | 
| 150 | TODO
 | 
| 151 | 
 | 
| 152 | So TSV8 offers and "aligned" format:
 | 
| 153 | 
 | 
| 154 |     #.ssv8 flag      desc                 type
 | 
| 155 |     type   Str       Str                  Str
 | 
| 156 |            --verbose "do it \t verbosely" bool
 | 
| 157 |            --count   "count only"         int
 | 
| 158 | 
 | 
| 159 | So this format combines two strategies:
 | 
| 160 | 
 | 
| 161 | - Delimiter-based for the **rows** / lines
 | 
| 162 | - Escaping-based for the **cells**
 | 
| 163 | 
 | 
| 164 | ## Conclusion
 | 
| 165 | 
 | 
| 166 | Traditional shells mostly support newline-based records.  YSH supports:
 | 
| 167 | 
 | 
| 168 | 1. Length-prefixed records
 | 
| 169 | 1. Delimiter-based records
 | 
| 170 |   - fixed delimiter: newline or `NUL`
 | 
| 171 |   - chosen delimiter: TODO?  with regex capture?
 | 
| 172 | 1. Escaping-based records with [JSON][] and the [J8 Notation][] extension.
 | 
| 173 |   - But we avoid formats that are purely based on escaping.
 |