| 1 | --- | 
| 2 | in_progress: yes | 
| 3 | --- | 
| 4 |  | 
| 5 | Solutions to the Framing Problem | 
| 6 | ================================ | 
| 7 |  | 
| 8 | How do you write multiple **records** to a pipe, and how do you read them? | 
| 9 |  | 
| 10 | You need a way of delimiting them.  Let's call this the "framing problem" | 
| 11 | — a term borrowed from network engineering. | 
| 12 |  | 
| 13 | This doc categorizes different formats, and shows how you handle them in YSH. | 
| 14 |  | 
| 15 | YSH is meant for writing correct shell programs. | 
| 16 |  | 
| 17 | <div id="toc"> | 
| 18 | </div> | 
| 19 |  | 
| 20 | ## A Length Prefix | 
| 21 |  | 
| 22 | [Netstrings][netstring] are a simple format defined by Daniel J Bernstein. | 
| 23 |  | 
| 24 | 3:foo,  # ASCII length, colon, byte string, comma | 
| 25 |  | 
| 26 | [netstring]: https://en.wikipedia.org/wiki/Netstring | 
| 27 |  | 
| 28 | This format is easy to implement, and efficient to read and write. | 
| 29 |  | 
| 30 | But the encoded output may contain binary data, which isn't readable by a human | 
| 31 | using a terminal (or GUI).  This is significant! | 
| 32 |  | 
| 33 | --- | 
| 34 |  | 
| 35 | TODO: Implement `read --netstr` and `write --netstr` | 
| 36 |  | 
| 37 | <!-- | 
| 38 | Like [J8 Notation][], this format is "8-bit clean", but: | 
| 39 |  | 
| 40 | - A netstring encoder is easier to write than a QSN encoder.  This may be | 
| 41 | useful if you don't have a library handy. | 
| 42 | - It's more efficient to decode, in theory. | 
| 43 | --> | 
| 44 |  | 
| 45 | ## Solutions Using a Delimiter | 
| 46 |  | 
| 47 | Now let's look at traditional Unix solutions, and their **problems**. | 
| 48 |  | 
| 49 | ### Fixed Delimiter: Newline or `NUL` byte | 
| 50 |  | 
| 51 | In traditional Unix, newlines delimit "records".  Here's how you read them in | 
| 52 | shell: | 
| 53 |  | 
| 54 | while IFS='' read -r; do  # confusing idiom! | 
| 55 | echo line=$REPLY | 
| 56 | break                   # remaining bytes are still in the pipe | 
| 57 | done | 
| 58 |  | 
| 59 | YSH has a simpler idiom: | 
| 60 |  | 
| 61 | while read --raw-line {   # unbuffered | 
| 62 | echo line=$_reply | 
| 63 | break                   # remaining bytes are still in the pipe | 
| 64 | } | 
| 65 |  | 
| 66 | Or you can read all lines: | 
| 67 |  | 
| 68 | for line in (stdin) {     # buffered | 
| 69 | echo line=$line | 
| 70 | break                   # remaining bytes may be lost in a buffer | 
| 71 | } | 
| 72 |  | 
| 73 | **However**, in Unix, all of these strings may have newlines: | 
| 74 |  | 
| 75 | - filenames | 
| 76 | - items in `argv` | 
| 77 | - values in `environ` | 
| 78 |  | 
| 79 | --- | 
| 80 |  | 
| 81 | But these C-style strings can't contain the `NUL` byte, aka `\0`.  So GNU tools | 
| 82 | have evolved support for another format: | 
| 83 |  | 
| 84 | find . -print0  # write data | 
| 85 | xargs -0        # read data; also --null | 
| 86 | grep -z         # read data; also --null-data | 
| 87 | sort -z         # read data; also --zero-terminated | 
| 88 | # (Why are all the names different?) | 
| 89 |  | 
| 90 | In Oils, we added a `-0` flag to `read` to understands this: | 
| 91 |  | 
| 92 | $ find . -print0 | { read -0 x; echo $x; read -0 x; echo $x; } | 
| 93 | foo  # could contain newlines! | 
| 94 | bar | 
| 95 |  | 
| 96 | ### Chosen Delimiter: Here docs and multipart MIME | 
| 97 |  | 
| 98 | Shell has has here docs that look like this: | 
| 99 |  | 
| 100 | cat <<EOF | 
| 101 | the string EOF | 
| 102 | can't start a line | 
| 103 | EOF | 
| 104 |  | 
| 105 | So you **choose** the delimiter, with the "word" you write after `<<`. | 
| 106 |  | 
| 107 | --- | 
| 108 |  | 
| 109 | Similarly, when your browser POSTs a form, it uses [MIME multipart message | 
| 110 | format](https://en.wikipedia.org/wiki/MIME#Multipart_messages): | 
| 111 |  | 
| 112 | MIME-Version: 1.0 | 
| 113 | Content-Type: multipart/mixed; boundary=frontier | 
| 114 |  | 
| 115 | This is a message with multiple parts in MIME format. | 
| 116 | --frontier | 
| 117 | Content-Type: text/plain | 
| 118 |  | 
| 119 | This is the body of the message. | 
| 120 | --frontier | 
| 121 |  | 
| 122 | So again, you **choose** a delimiter with `boundary=frontier`, and then you | 
| 123 | must recognize it later in the message. | 
| 124 |  | 
| 125 | ## C-Style `\` escaping allows arbitrary bytes | 
| 126 |  | 
| 127 | [JSON][] can express strings with newlines: | 
| 128 |  | 
| 129 | "line 1 \n line 2" | 
| 130 |  | 
| 131 | It can also express the zero code point, which isn't the same as NUL byte: | 
| 132 |  | 
| 133 | "zero code point \u0000" | 
| 134 |  | 
| 135 | [J8 Notation][] is an extension of JSON that fixes this: | 
| 136 |  | 
| 137 | "NUL byte \y00" | 
| 138 |  | 
| 139 | (We use `\y00` rather than `\x00`, because Python and JavaScript both confuse | 
| 140 | `\x00` with `U+0000`.  The zero code point may be encoded as 2 or 4 `NUL` | 
| 141 | bytes.) | 
| 142 |  | 
| 143 | [J8 Strings]: j8-notation.html | 
| 144 | [JSON]: $xref | 
| 145 |  | 
| 146 | ### Escaping-Based Records | 
| 147 |  | 
| 148 | TSV files are based on delimiters, but they aren't very readable in a terminal. | 
| 149 |  | 
| 150 | TODO | 
| 151 |  | 
| 152 | So TSV8 offers and "aligned" format: | 
| 153 |  | 
| 154 | #.ssv8 flag      desc                 type | 
| 155 | type   Str       Str                  Str | 
| 156 | --verbose "do it \t verbosely" bool | 
| 157 | --count   "count only"         int | 
| 158 |  | 
| 159 | So this format combines two strategies: | 
| 160 |  | 
| 161 | - Delimiter-based for the **rows** / lines | 
| 162 | - Escaping-based for the **cells** | 
| 163 |  | 
| 164 | ## Conclusion | 
| 165 |  | 
| 166 | Traditional shells mostly support newline-based records.  YSH supports: | 
| 167 |  | 
| 168 | 1. Length-prefixed records | 
| 169 | 1. Delimiter-based records | 
| 170 | - fixed delimiter: newline or `NUL` | 
| 171 | - chosen delimiter: TODO?  with regex capture? | 
| 172 | 1. Escaping-based records with [JSON][] and the [J8 Notation][] extension. | 
| 173 | - But we avoid formats that are purely based on escaping. |