doc/framing.md

OILS / doc / framing.md View on Github | oilshell.org

173 lines, 112 significant

1	---
2	in_progress: yes
3	---
4
5	Solutions to the Framing Problem
6	================================
7
8	How do you write multiple records to a pipe, and how do you read them?
9
10	You need a way of delimiting them. Let's call this the "framing problem"
11	— a term borrowed from network engineering.
12
13	This doc categorizes different formats, and shows how you handle them in YSH.
14
15	YSH is meant for writing correct shell programs.
16
17	<div id="toc">
18	</div>
19
20	## A Length Prefix
21
22	[Netstrings][netstring] are a simple format defined by Daniel J Bernstein.
23
24	3:foo, # ASCII length, colon, byte string, comma
25
26	[netstring]: https://en.wikipedia.org/wiki/Netstring
27
28	This format is easy to implement, and efficient to read and write.
29
30	But the encoded output may contain binary data, which isn't readable by a human
31	using a terminal (or GUI). This is significant!
32
33	---
34
35	TODO: Implement `read --netstr` and `write --netstr`
36
37	<!--
38	Like [J8 Notation][], this format is "8-bit clean", but:
39
40	- A netstring encoder is easier to write than a QSN encoder. This may be
41	useful if you don't have a library handy.
42	- It's more efficient to decode, in theory.
43	-->
44
45	## Solutions Using a Delimiter
46
47	Now let's look at traditional Unix solutions, and their problems.
48
49	### Fixed Delimiter: Newline or `NUL` byte
50
51	In traditional Unix, newlines delimit "records". Here's how you read them in
52	shell:
53
54	while IFS='' read -r; do # confusing idiom!
55	echo line=$REPLY
56	break # remaining bytes are still in the pipe
57	done
58
59	YSH has a simpler idiom:
60
61	while read --raw-line { # unbuffered
62	echo line=$_reply
63	break # remaining bytes are still in the pipe
64	}
65
66	Or you can read all lines:
67
68	for line in (stdin) { # buffered
69	echo line=$line
70	break # remaining bytes may be lost in a buffer
71	}
72
73	However, in Unix, all of these strings may have newlines:
74
75	- filenames
76	- items in `argv`
77	- values in `environ`
78
79	---
80
81	But these C-style strings can't contain the `NUL` byte, aka `\0`. So GNU tools
82	have evolved support for another format:
83
84	find . -print0 # write data
85	xargs -0 # read data; also --null
86	grep -z # read data; also --null-data
87	sort -z # read data; also --zero-terminated
88	# (Why are all the names different?)
89
90	In Oils, we added a `-0` flag to `read` to understands this:
91
92	$ find . -print0 \| { read -0 x; echo $x; read -0 x; echo $x; }
93	foo # could contain newlines!
94	bar
95
96	### Chosen Delimiter: Here docs and multipart MIME
97
98	Shell has has here docs that look like this:
99
100	cat <<EOF
101	the string EOF
102	can't start a line
103	EOF
104
105	So you choose the delimiter, with the "word" you write after `<<`.
106
107	---
108
109	Similarly, when your browser POSTs a form, it uses [MIME multipart message
110	format](https://en.wikipedia.org/wiki/MIME#Multipart_messages):
111
112	MIME-Version: 1.0
113	Content-Type: multipart/mixed; boundary=frontier
114
115	This is a message with multiple parts in MIME format.
116	--frontier
117	Content-Type: text/plain
118
119	This is the body of the message.
120	--frontier
121
122	So again, you choose a delimiter with `boundary=frontier`, and then you
123	must recognize it later in the message.
124
125	## C-Style `\` escaping allows arbitrary bytes
126
127	[JSON][] can express strings with newlines:
128
129	"line 1 \n line 2"
130
131	It can also express the zero code point, which isn't the same as NUL byte:
132
133	"zero code point \u0000"
134
135	[J8 Notation][] is an extension of JSON that fixes this:
136
137	"NUL byte \y00"
138
139	(We use `\y00` rather than `\x00`, because Python and JavaScript both confuse
140	`\x00` with `U+0000`. The zero code point may be encoded as 2 or 4 `NUL`
141	bytes.)
142
143	[J8 Strings]: j8-notation.html
144	[JSON]: $xref
145
146	### Escaping-Based Records
147
148	TSV files are based on delimiters, but they aren't very readable in a terminal.
149
150	TODO
151
152	So TSV8 offers and "aligned" format:
153
154	#.ssv8 flag desc type
155	type Str Str Str
156	--verbose "do it \t verbosely" bool
157	--count "count only" int
158
159	So this format combines two strategies:
160
161	- Delimiter-based for the rows / lines
162	- Escaping-based for the cells
163
164	## Conclusion
165
166	Traditional shells mostly support newline-based records. YSH supports:
167
168	1. Length-prefixed records
169	1. Delimiter-based records
170	- fixed delimiter: newline or `NUL`
171	- chosen delimiter: TODO? with regex capture?
172	1. Escaping-based records with [JSON][] and the [J8 Notation][] extension.
173	- But we avoid formats that are purely based on escaping.