OILS / doc / qsn.md View on Github | oilshell.org

318 lines, 226 significant
1QSN: A Familiar String Interchange Format
2=========================================
3
4<style>
5.q {
6 color: darkred;
7}
8.comment {
9 color: green;
10 font-style: italic;
11}
12.terminal {
13 color: darkred;
14 font-family: monospace;
15}
16.an {
17 color: darkgreen;
18}
19
20.attention {
21 font-size: x-large;
22 background-color: #DEE;
23 margin-left: 1em;
24 margin-right: 1em;
25 padding-left: 1em;
26 padding-right: 1em;
27}
28</style>
29
30&nbsp;
31
32&nbsp;
33
34<div class=attention>
35
36&nbsp;
37
38As of January 2024, QSN has been replaced by [J8 Notation](j8-notation.html).
39They're very similar, but J8 Notation is more "harmonized" with JSON.
40
41&nbsp;
42
43</div>
44
45&nbsp;
46
47&nbsp;
48
49&nbsp;
50
51&nbsp;
52
53&nbsp;
54
55QSN ("quoted string notation") is a data format for **byte strings**.
56Examples:
57
58<pre>
59'' <span class=comment># empty string</span>
60'my favorite song.mp3'
61'bob<span class=q>\t</span>1.0<span class=q>\n</span>carol<span class=q>\t</span>2.0<span class=q>\n</span>' <span class=comment># tabs and newlines</span>
62'BEL = <span class=q>\x07</span>' <span class=comment># byte escape</span>
63'mu = <span class=q>\u{03bc}</span>' <span class=comment># Unicode char escape</span>
64'mu = &#x03bc;' <span class=comment># represented literally, not escaped</span>
65</pre>
66
67It's an adaptation of Rust's string literal syntax with a few use cases:
68
69- To print filenames to a terminal. Printing arbitrary bytes to a
70 terminal is bad, so programs like [coreutils]($xref) already have [informal
71 QSN-like formats][coreutils-quotes].
72- To exchange data between different programs, like [JSON][] or UTF-8. Note
73 that JSON can't express arbitrary byte strings.
74- To solve the "[framing problem](framing.html)" over pipes. QSN represents
75 newlines like `\n`, so literal newlines can be used to delimit records.
76
77Oil uses QSN because it's well-defined and parsable. It's both human- and
78machine-readable.
79
80Any programming language or tool that understands JSON should also understand
81QSN.
82
83[JSON]: https://json.org
84
85<div id="toc">
86</div>
87
88<!--
89### The Terminal Use Case
90
91Filenames may contain arbitrary bytes, including ones that will <span
92class=terminal>change your terminal color</span>, and more. Most command line
93programs need something like QSN, or they'll have subtle bugs.
94
95For example, as of 2016, [coreutils quotes funny filenames][coreutils] to avoid
96the same problem. However, they didn't specify the format so it can be parsed.
97In contrast, QSN can be parsed and printed like JSON.
98
99-->
100
101<!--
102The quoting only happens when `isatty()`, so it's not really meant
103to be parsed.
104-->
105
106## Important Properties
107
108- QSN can represent **any byte sequence**.
109- Given a QSN-encoded string, any 2 decoders must produce the same byte string.
110 (On the other hand, encoders have flexibility with regard to escaping.)
111- An encoded string always fits on a **single line**. Newlines must be encoded as
112 `\n`, not literal.
113- A encoded string always fits in a **TSV cell**. Tabs must be encoded as `\t`,
114 not literal.
115- An encoded string can itself be **valid UTF-8**.
116 - Example: `'μ \xff'` is valid UTF-8, even though the decoded string is not.
117- An encoded string can itself be **valid ASCII**.
118 - Example: `'\xce\xbc'` is valid ASCII, even though the decoded string is
119 not.
120
121## More QSN Use Cases
122
123- To pack arbitrary bytes on a **single line**, e.g. for line-based tools like
124 [grep]($xref), [awk]($xref), and [xargs]($xref). QSN strings never contain
125 literal newlines or tabs.
126- For `set -x` in shell. Like filenames, Unix `argv` arrays may contain
127 arbitrary bytes. There's an example in the appendix.
128 - `ps` has to display untrusted `argv` arrays.
129 - `ls` has to display untrusted filenames.
130 - `env` has to display untrusted byte strings. (Most versions of `env` don't
131 handle newlines well.)
132- As a building block for larger specifications, like [QTT][].
133- To transmit arbitrary bytes over channels that can only represent ASCII or
134 UTF-8 (e.g. e-mail, Twitter).
135
136[QTT]: qtt.html
137[surrogate pairs]: https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF
138[coreutils]: https://www.gnu.org/software/coreutils/quotes.html
139
140## Specification
141
142### A Short Description
143
1441. Start with [Rust String Literal Syntax](https://doc.rust-lang.org/reference/tokens.html#string-literals)
1452. Use **single quotes** instead of double quotes to surround the string. This
146 is mainly to to avoid confusion with JSON.
147
148### An Analogy
149
150<pre>
151
152 <span class=an>JavaScript Object Literals</span> are to <span class=an>JSON</span>
153as <span class=an>Rust String Literals</span> are to <span class=an>QSN</span>
154
155</pre>
156
157But QSN is **not** tied to either Rust or shell, just like JSON isn't tied to
158JavaScript.
159
160It's a **language-independent format** like UTF-8 or HTML. We're only
161borrowing a design, so that it's well-specified and familiar.
162
163### Full Spec
164
165TODO: The short description above should be sufficient, but we might want to
166write it out.
167
168- Special escapes:
169 - `\t` `\r` `\n`
170 - `\'` `\"`
171 - `\\`
172 - `\0`
173- Byte escapes: `\x7F`
174- Character escapes: `\u{03bc}` or `\u{0003bc}`. These are encoded as UTF-8.
175
176## Advantages Over JSON Strings
177
178- QSN can represent any byte string, like `'\x00\xff\x00'`. JSON can't
179 represent **binary data** directly.
180- QSN can represent any code point, like `'\u{01f600}'` for &#x01f600;. JSON
181 needs awkward [surrogate pairs][] to represent this code point.
182
183## Implementation Issues
184
185### How Does a QSN Encoder Deal with Unicode?
186
187The input to a QSN encoder is a raw **byte string**. However, the string may
188have additional structure, like being UTF-8 encoded.
189
190The encoder has three options to deal with this structure:
191
1921. **Don't decode** UTF-8. Walk through bytes one-by-one, showing unprintable
193 ones with escapes like `\xce\xbc`. Never emit escapes like `\u{3bc}` or
194 literals like <code>&#x03bc;</code>. This option is OK for machines, but
195 isn't friendly to humans who can read Unicode characters.
196
197Or **speculatively decode** UTF-8. After decoding a valid UTF-8 sequence,
198there are two options:
199
2002. Show **escaped code points**, like `\u{3bc}`. The encoded string is limited
201 to the ASCII subset, which is useful in some contexts.
202
2033. Show them **literally**, like <code>&#x03bc;</code>.
204
205QSN encoding should never fail; it should only fall back to byte escapes like
206`\xff`. TODO: Show the state machine for detecting and decoding UTF-8.
207
208Note: Strategies 2 and 3 indicate whether the string is valid UTF-8.
209
210### Which Bytes Should Be Hex-Escaped?
211
212The reference implementation has two functions:
213
214- `IsUnprintableLow`: any byte below an ASCII space `' '` is escaped
215- `IsUnprintableHigh`: the byte `\x7f` and all bytes above are escaped, unless
216 they're part of a valid UTF-8 sequence.
217
218In theory, only escapes like `\'` `\n` `\\` are strictly necessary, and no
219bytes need to be hex-escaped. But that strategy would defeat the purpose of
220QSN for many applications, like printing filenames in a terminal.
221
222### List of Syntax Errors
223
224QSN decoders must enforce (at least) these syntax errors:
225
226- Literal newline or tab in a string. Should be `\t` or `\n`. (The lack of
227 literal tabs and newlines is essential for [QTT][].)
228- Invalid character escape, e.g. `\z`
229- Invalid hex escape, e.g. `\xgg`
230- Invalid unicode escape, e.g. `\u{123` (incomplete)
231
232Separate messages aren't required for each error; the only requirement is that
233they not accept these sequences.
234
235## Reference Implementation in Oil
236
237- Oil's **encoder** is in [qsn_/qsn.py]($oils-src), including the state machine
238 for the UTF-8 strategies.
239- The **decoder** has a lexer in [frontend/lexer_def.py]($oils-src), and a
240 "parser" / validator in [qsn_/qsn_native.py]($oils-src). (Note that QSN is a
241 [regular language]($xref:regular-language)).
242
243The encoder has options to emit shell-compatible strings, which you probably
244**don't need**. That is, C-escaped strings in bash look `$'like this\n'`.
245
246A **subset** of QSN is compatible with this syntax. Example:
247
248 $'\x01\n' # A valid bash string. Removing $ makes it valid QSN.
249
250Something like `$'\0065'` is never emitted, because QSN doesn't contain octal
251escapes. It can be encoded with hex or character escapes.
252
253## Appendices
254
255### Design Notes
256
257The general idea: Rust string literals are like C and JavaScript string
258literals, without cruft like octal (`\755` or `\0755` &mdash; which is it?) and
259vertical tabs (`\v`).
260
261Comparison with shell strings:
262
263- `'Single quoted strings'` in shell can't represent arbitrary byte strings.
264- `$'C-style shell strings\n'` strings are similar to QSN, but have cruft like
265 octal and `\v`.
266- `"Double quoted strings"` have unneeded features like `$var` and `$(command
267 sub)`.
268
269Comparison with Python's `repr()`:
270
271- A single quote in Python is `"'"`, whereas it's `'\''` in QSN
272- Python has both `\uxxxx` and `\Uxxxxxxxx`, whereas QSN has the more natural
273 `\u{xxxxxx}`.
274
275### Related Links
276
277- [GNU Coreutils - Quoting File names][coreutils-quotes]. *Starting with GNU
278 coreutils version 8.25 (released Jan. 2016), ls's default output quotes
279 filenames with special characters*
280- [In-band signaling][in-band] is the fundamental problem with filenames and
281terminals. Code (control codes) and data are intermingled.
282- [QTT][] is a cleanup of CSV/TSV, built on top of QSN.
283
284[coreutils-quotes]: https://www.gnu.org/software/coreutils/quotes.html
285
286[in-band]: https://en.wikipedia.org/wiki/In-band_signaling
287
288
289### `set -x` example
290
291When arguments don't have any spaces, there's no ambiguity:
292
293
294 $ set -x
295 $ echo two args
296 + echo two args
297
298Here we need quotes to show that the `argv` array has 3 elements:
299
300 $ set -x
301 $ x='a b'
302 $ echo "$x" c
303 + echo 'a b' c
304
305And we want the trace to fit on a single line, so we print a QSN string with
306`\n`:
307
308 $ set -x
309 $ x=$'a\nb'
310 $ echo "$x" c
311 + echo $'a\nb' c
312
313Here's an example with unprintable characters:
314
315 $ set -x
316 $ x=$'\e\001'
317 $ echo "$x"
318 + echo $'\x1b\x01'