1 | ---
|
2 | in_progress: yes
|
3 | ---
|
4 |
|
5 | Solutions to the Framing Problem
|
6 | ================================
|
7 |
|
8 | How do you write multiple **records** to a pipe, and how do you read them?
|
9 |
|
10 | You need a way of delimiting them. Let's call this the "framing problem"
|
11 | — a term borrowed from network engineering.
|
12 |
|
13 | This doc categorizes different formats, and shows how you handle them in YSH.
|
14 |
|
15 | YSH is meant for writing correct shell programs.
|
16 |
|
17 | <div id="toc">
|
18 | </div>
|
19 |
|
20 | ## A Length Prefix
|
21 |
|
22 | [Netstrings][netstring] are a simple format defined by Daniel J Bernstein.
|
23 |
|
24 | 3:foo, # ASCII length, colon, byte string, comma
|
25 |
|
26 | [netstring]: https://en.wikipedia.org/wiki/Netstring
|
27 |
|
28 | This format is easy to implement, and efficient to read and write.
|
29 |
|
30 | But the encoded output may contain binary data, which isn't readable by a human
|
31 | using a terminal (or GUI). This is significant!
|
32 |
|
33 | ---
|
34 |
|
35 | TODO: Implement `read --netstr` and `write --netstr`
|
36 |
|
37 | <!--
|
38 | Like [J8 Notation][], this format is "8-bit clean", but:
|
39 |
|
40 | - A netstring encoder is easier to write than a QSN encoder. This may be
|
41 | useful if you don't have a library handy.
|
42 | - It's more efficient to decode, in theory.
|
43 | -->
|
44 |
|
45 | ## Solutions Using a Delimiter
|
46 |
|
47 | Now let's look at traditional Unix solutions, and their **problems**.
|
48 |
|
49 | ### Fixed Delimiter: Newline or `NUL` byte
|
50 |
|
51 | In traditional Unix, newlines delimit "records". Here's how you read them in
|
52 | shell:
|
53 |
|
54 | while IFS='' read -r; do # confusing idiom!
|
55 | echo line=$REPLY
|
56 | break # remaining bytes are still in the pipe
|
57 | done
|
58 |
|
59 | YSH has a simpler idiom:
|
60 |
|
61 | while read --raw-line { # unbuffered
|
62 | echo line=$_reply
|
63 | break # remaining bytes are still in the pipe
|
64 | }
|
65 |
|
66 | Or you can read all lines:
|
67 |
|
68 | for line in (stdin) { # buffered
|
69 | echo line=$line
|
70 | break # remaining bytes may be lost in a buffer
|
71 | }
|
72 |
|
73 | **However**, in Unix, all of these strings may have newlines:
|
74 |
|
75 | - filenames
|
76 | - items in `argv`
|
77 | - values in `environ`
|
78 |
|
79 | ---
|
80 |
|
81 | But these C-style strings can't contain the `NUL` byte, aka `\0`. So GNU tools
|
82 | have evolved support for another format:
|
83 |
|
84 | find . -print0 # write data
|
85 | xargs -0 # read data; also --null
|
86 | grep -z # read data; also --null-data
|
87 | sort -z # read data; also --zero-terminated
|
88 | # (Why are all the names different?)
|
89 |
|
90 | In Oils, we added a `-0` flag to `read` to understands this:
|
91 |
|
92 | $ find . -print0 | { read -0 x; echo $x; read -0 x; echo $x; }
|
93 | foo # could contain newlines!
|
94 | bar
|
95 |
|
96 | ### Chosen Delimiter: Here docs and multipart MIME
|
97 |
|
98 | Shell has has here docs that look like this:
|
99 |
|
100 | cat <<EOF
|
101 | the string EOF
|
102 | can't start a line
|
103 | EOF
|
104 |
|
105 | So you **choose** the delimiter, with the "word" you write after `<<`.
|
106 |
|
107 | ---
|
108 |
|
109 | Similarly, when your browser POSTs a form, it uses [MIME multipart message
|
110 | format](https://en.wikipedia.org/wiki/MIME#Multipart_messages):
|
111 |
|
112 | MIME-Version: 1.0
|
113 | Content-Type: multipart/mixed; boundary=frontier
|
114 |
|
115 | This is a message with multiple parts in MIME format.
|
116 | --frontier
|
117 | Content-Type: text/plain
|
118 |
|
119 | This is the body of the message.
|
120 | --frontier
|
121 |
|
122 | So again, you **choose** a delimiter with `boundary=frontier`, and then you
|
123 | must recognize it later in the message.
|
124 |
|
125 | ## C-Style `\` escaping allows arbitrary bytes
|
126 |
|
127 | [JSON][] can express strings with newlines:
|
128 |
|
129 | "line 1 \n line 2"
|
130 |
|
131 | It can also express the zero code point, which isn't the same as NUL byte:
|
132 |
|
133 | "zero code point \u0000"
|
134 |
|
135 | [J8 Notation][] is an extension of JSON that fixes this:
|
136 |
|
137 | "NUL byte \y00"
|
138 |
|
139 | (We use `\y00` rather than `\x00`, because Python and JavaScript both confuse
|
140 | `\x00` with `U+0000`. The zero code point may be encoded as 2 or 4 `NUL`
|
141 | bytes.)
|
142 |
|
143 | [J8 Strings]: j8-notation.html
|
144 | [JSON]: $xref
|
145 |
|
146 | ### Escaping-Based Records
|
147 |
|
148 | TSV files are based on delimiters, but they aren't very readable in a terminal.
|
149 |
|
150 | TODO
|
151 |
|
152 | So TSV8 offers and "aligned" format:
|
153 |
|
154 | #.ssv8 flag desc type
|
155 | type Str Str Str
|
156 | --verbose "do it \t verbosely" bool
|
157 | --count "count only" int
|
158 |
|
159 | So this format combines two strategies:
|
160 |
|
161 | - Delimiter-based for the **rows** / lines
|
162 | - Escaping-based for the **cells**
|
163 |
|
164 | ## Conclusion
|
165 |
|
166 | Traditional shells mostly support newline-based records. YSH supports:
|
167 |
|
168 | 1. Length-prefixed records
|
169 | 1. Delimiter-based records
|
170 | - fixed delimiter: newline or `NUL`
|
171 | - chosen delimiter: TODO? with regex capture?
|
172 | 1. Escaping-based records with [JSON][] and the [J8 Notation][] extension.
|
173 | - But we avoid formats that are purely based on escaping.
|