| 1 | QSN: A Familiar String Interchange Format
 | 
| 2 | =========================================
 | 
| 3 | 
 | 
| 4 | <style>
 | 
| 5 | .q {
 | 
| 6 |   color: darkred;
 | 
| 7 | }
 | 
| 8 | .comment {
 | 
| 9 |   color: green;
 | 
| 10 |   font-style: italic;
 | 
| 11 | }
 | 
| 12 | .terminal {
 | 
| 13 |   color: darkred;
 | 
| 14 |   font-family: monospace;
 | 
| 15 | }
 | 
| 16 | .an {
 | 
| 17 |   color: darkgreen;
 | 
| 18 | }
 | 
| 19 | 
 | 
| 20 | .attention {
 | 
| 21 |   font-size: x-large;
 | 
| 22 |   background-color: #DEE;
 | 
| 23 |   margin-left: 1em;
 | 
| 24 |   margin-right: 1em;
 | 
| 25 |   padding-left: 1em;
 | 
| 26 |   padding-right: 1em;
 | 
| 27 | }
 | 
| 28 | </style>
 | 
| 29 | 
 | 
| 30 |  
 | 
| 31 | 
 | 
| 32 |  
 | 
| 33 | 
 | 
| 34 | <div class=attention>
 | 
| 35 | 
 | 
| 36 |  
 | 
| 37 | 
 | 
| 38 | As of January 2024, QSN has been replaced by [J8 Notation](j8-notation.html).
 | 
| 39 | They're very similar, but J8 Notation is more "harmonized" with JSON.
 | 
| 40 | 
 | 
| 41 |  
 | 
| 42 | 
 | 
| 43 | </div>
 | 
| 44 | 
 | 
| 45 |  
 | 
| 46 | 
 | 
| 47 |  
 | 
| 48 | 
 | 
| 49 |  
 | 
| 50 | 
 | 
| 51 |  
 | 
| 52 | 
 | 
| 53 |  
 | 
| 54 | 
 | 
| 55 | QSN ("quoted string notation") is a data format for **byte strings**.
 | 
| 56 | Examples:
 | 
| 57 | 
 | 
| 58 | <pre>
 | 
| 59 | ''                           <span class=comment># empty string</span>
 | 
| 60 | 'my favorite song.mp3'
 | 
| 61 | 'bob<span class=q>\t</span>1.0<span class=q>\n</span>carol<span class=q>\t</span>2.0<span class=q>\n</span>'     <span class=comment># tabs and newlines</span>
 | 
| 62 | 'BEL = <span class=q>\x07</span>'                 <span class=comment># byte escape</span>
 | 
| 63 | 'mu = <span class=q>\u{03bc}</span>'              <span class=comment># Unicode char escape</span>
 | 
| 64 | 'mu = μ'                     <span class=comment># represented literally, not escaped</span>
 | 
| 65 | </pre>
 | 
| 66 | 
 | 
| 67 | It's an adaptation of Rust's string literal syntax with a few use cases:
 | 
| 68 | 
 | 
| 69 | - To print filenames to a terminal.  Printing arbitrary bytes to a
 | 
| 70 |   terminal is bad, so programs like [coreutils]($xref) already have [informal
 | 
| 71 |   QSN-like formats][coreutils-quotes].
 | 
| 72 | - To exchange data between different programs, like [JSON][] or UTF-8.  Note
 | 
| 73 |   that JSON can't express arbitrary byte strings.
 | 
| 74 | - To solve the "[framing problem](framing.html)" over pipes.  QSN represents
 | 
| 75 |   newlines like `\n`, so literal newlines can be used to delimit records.
 | 
| 76 |   
 | 
| 77 | Oil uses QSN because it's well-defined and parsable.  It's both human- and
 | 
| 78 | machine-readable.
 | 
| 79 | 
 | 
| 80 | Any programming language or tool that understands JSON should also understand
 | 
| 81 | QSN.
 | 
| 82 | 
 | 
| 83 | [JSON]: https://json.org
 | 
| 84 | 
 | 
| 85 | <div id="toc">
 | 
| 86 | </div>
 | 
| 87 | 
 | 
| 88 | <!--
 | 
| 89 | ### The Terminal Use Case
 | 
| 90 | 
 | 
| 91 | Filenames may contain arbitrary bytes, including ones that will <span
 | 
| 92 | class=terminal>change your terminal color</span>, and more.  Most command line
 | 
| 93 | programs need something like QSN, or they'll have subtle bugs.
 | 
| 94 | 
 | 
| 95 | For example, as of 2016, [coreutils quotes funny filenames][coreutils] to avoid
 | 
| 96 | the same problem.  However, they didn't specify the format so it can be parsed.
 | 
| 97 | In contrast, QSN can be parsed and printed like JSON.
 | 
| 98 | 
 | 
| 99 | -->
 | 
| 100 | 
 | 
| 101 | <!--
 | 
| 102 | The quoting only happens when `isatty()`, so it's not really meant
 | 
| 103 | to be parsed.
 | 
| 104 | -->
 | 
| 105 | 
 | 
| 106 | ## Important Properties
 | 
| 107 | 
 | 
| 108 | - QSN can represent **any byte sequence**.
 | 
| 109 | - Given a QSN-encoded string, any 2 decoders must produce the same byte string.
 | 
| 110 |   (On the other hand, encoders have flexibility with regard to escaping.)
 | 
| 111 | - An encoded string always fits on a **single line**.  Newlines must be encoded as
 | 
| 112 |   `\n`, not literal.
 | 
| 113 | - A encoded string always fits in a **TSV cell**.  Tabs must be encoded as `\t`,
 | 
| 114 |   not literal.
 | 
| 115 | - An encoded string can itself be **valid UTF-8**.
 | 
| 116 |   - Example: `'μ \xff'` is valid UTF-8, even though the decoded string is not.
 | 
| 117 | - An encoded string can itself be **valid ASCII**.
 | 
| 118 |   - Example: `'\xce\xbc'` is valid ASCII, even though the decoded string is
 | 
| 119 |     not.
 | 
| 120 | 
 | 
| 121 | ## More QSN Use Cases
 | 
| 122 | 
 | 
| 123 | - To pack arbitrary bytes on a **single line**, e.g. for line-based tools like
 | 
| 124 |   [grep]($xref), [awk]($xref), and [xargs]($xref).  QSN strings never contain
 | 
| 125 |   literal newlines or tabs.
 | 
| 126 | - For `set -x` in shell.  Like filenames, Unix `argv` arrays may contain
 | 
| 127 |   arbitrary bytes.  There's an example in the appendix.
 | 
| 128 |   - `ps` has to display untrusted `argv` arrays.
 | 
| 129 |   - `ls` has to display untrusted filenames.
 | 
| 130 |   - `env` has to display untrusted byte strings.  (Most versions of `env` don't
 | 
| 131 |     handle newlines well.)
 | 
| 132 | - As a building block for larger specifications, like [QTT][].
 | 
| 133 | - To transmit arbitrary bytes over channels that can only represent ASCII or
 | 
| 134 |   UTF-8 (e.g. e-mail, Twitter).
 | 
| 135 | 
 | 
| 136 | [QTT]: qtt.html
 | 
| 137 | [surrogate pairs]: https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF
 | 
| 138 | [coreutils]: https://www.gnu.org/software/coreutils/quotes.html
 | 
| 139 | 
 | 
| 140 | ## Specification
 | 
| 141 | 
 | 
| 142 | ### A Short Description
 | 
| 143 | 
 | 
| 144 | 1. Start with [Rust String Literal Syntax](https://doc.rust-lang.org/reference/tokens.html#string-literals)
 | 
| 145 | 2. Use **single quotes** instead of double quotes to surround the string.  This
 | 
| 146 |    is mainly to to avoid confusion with JSON.
 | 
| 147 | 
 | 
| 148 | ### An Analogy
 | 
| 149 | 
 | 
| 150 | <pre>
 | 
| 151 | 
 | 
| 152 |      <span class=an>JavaScript Object Literals</span>   are to    <span class=an>JSON</span>
 | 
| 153 | as   <span class=an>Rust String Literals</span>         are to    <span class=an>QSN</span>
 | 
| 154 | 
 | 
| 155 | </pre>
 | 
| 156 | 
 | 
| 157 | But QSN is **not** tied to either Rust or shell, just like JSON isn't tied to
 | 
| 158 | JavaScript.
 | 
| 159 | 
 | 
| 160 | It's a **language-independent format** like UTF-8 or HTML.  We're only
 | 
| 161 | borrowing a design, so that it's well-specified and familiar.
 | 
| 162 | 
 | 
| 163 | ### Full Spec
 | 
| 164 | 
 | 
| 165 | TODO: The short description above should be sufficient, but we might want to
 | 
| 166 | write it out.
 | 
| 167 | 
 | 
| 168 | - Special escapes:
 | 
| 169 |   - `\t` `\r` `\n`
 | 
| 170 |   - `\'` `\"`
 | 
| 171 |   - `\\`
 | 
| 172 |   - `\0`
 | 
| 173 | - Byte escapes: `\x7F`
 | 
| 174 | - Character escapes: `\u{03bc}` or `\u{0003bc}`.  These are encoded as UTF-8.
 | 
| 175 | 
 | 
| 176 | ## Advantages Over JSON Strings
 | 
| 177 | 
 | 
| 178 | - QSN can represent any byte string, like `'\x00\xff\x00'`.  JSON can't
 | 
| 179 |   represent **binary data** directly.
 | 
| 180 | - QSN can represent any code point, like `'\u{01f600}'` for 😀.  JSON
 | 
| 181 |   needs awkward [surrogate pairs][] to represent this code point.
 | 
| 182 | 
 | 
| 183 | ## Implementation Issues
 | 
| 184 | 
 | 
| 185 | ### How Does a QSN Encoder Deal with Unicode?
 | 
| 186 | 
 | 
| 187 | The input to a QSN encoder is a raw **byte string**.  However, the string may
 | 
| 188 | have additional structure, like being UTF-8 encoded.
 | 
| 189 | 
 | 
| 190 | The encoder has three options to deal with this structure:
 | 
| 191 | 
 | 
| 192 | 1. **Don't decode** UTF-8.  Walk through bytes one-by-one, showing unprintable
 | 
| 193 |    ones with escapes like `\xce\xbc`.  Never emit escapes like `\u{3bc}` or
 | 
| 194 |    literals like <code>μ</code>.  This option is OK for machines, but
 | 
| 195 |    isn't friendly to humans who can read Unicode characters.
 | 
| 196 | 
 | 
| 197 | Or **speculatively decode** UTF-8.  After decoding a valid UTF-8 sequence,
 | 
| 198 | there are two options:
 | 
| 199 | 
 | 
| 200 | 2. Show **escaped code points**, like `\u{3bc}`.  The encoded string is limited
 | 
| 201 |    to the ASCII subset, which is useful in some contexts.
 | 
| 202 | 
 | 
| 203 | 3. Show them **literally**, like <code>μ</code>.
 | 
| 204 | 
 | 
| 205 | QSN encoding should never fail; it should only fall back to byte escapes like
 | 
| 206 | `\xff`.  TODO: Show the state machine for detecting and decoding UTF-8.
 | 
| 207 | 
 | 
| 208 | Note: Strategies 2 and 3 indicate whether the string is valid UTF-8.
 | 
| 209 | 
 | 
| 210 | ### Which Bytes Should Be Hex-Escaped?
 | 
| 211 | 
 | 
| 212 | The reference implementation has two functions:
 | 
| 213 | 
 | 
| 214 | - `IsUnprintableLow`: any byte below an ASCII space `' '` is escaped
 | 
| 215 | - `IsUnprintableHigh`: the byte `\x7f` and all bytes above are escaped, unless
 | 
| 216 |   they're part of a valid UTF-8 sequence.
 | 
| 217 | 
 | 
| 218 | In theory, only escapes like `\'` `\n` `\\` are strictly necessary, and no
 | 
| 219 | bytes need to be hex-escaped.  But that strategy would defeat the purpose of
 | 
| 220 | QSN for many applications, like printing filenames in a terminal.
 | 
| 221 | 
 | 
| 222 | ### List of Syntax Errors
 | 
| 223 | 
 | 
| 224 | QSN decoders must enforce (at least) these syntax errors:
 | 
| 225 | 
 | 
| 226 | - Literal newline or tab in a string.  Should be `\t` or `\n`.  (The lack of
 | 
| 227 |   literal tabs and newlines is essential for [QTT][].)
 | 
| 228 | - Invalid character escape, e.g. `\z`
 | 
| 229 | - Invalid hex escape, e.g. `\xgg`
 | 
| 230 | - Invalid unicode escape, e.g. `\u{123` (incomplete)
 | 
| 231 | 
 | 
| 232 | Separate messages aren't required for each error; the only requirement is that
 | 
| 233 | they not accept these sequences.
 | 
| 234 | 
 | 
| 235 | ## Reference Implementation in Oil
 | 
| 236 | 
 | 
| 237 | - Oil's **encoder** is in [qsn_/qsn.py]($oils-src), including the state machine
 | 
| 238 |   for the UTF-8 strategies.
 | 
| 239 | - The **decoder** has a lexer in [frontend/lexer_def.py]($oils-src), and a
 | 
| 240 |   "parser" / validator in [qsn_/qsn_native.py]($oils-src).  (Note that QSN is a
 | 
| 241 |   [regular language]($xref:regular-language)).
 | 
| 242 | 
 | 
| 243 | The encoder has options to emit shell-compatible strings, which you probably
 | 
| 244 | **don't need**.  That is, C-escaped strings in bash look `$'like this\n'`.
 | 
| 245 | 
 | 
| 246 | A **subset** of QSN is compatible with this syntax.  Example:
 | 
| 247 | 
 | 
| 248 |     $'\x01\n'  # A valid bash string.  Removing $ makes it valid QSN.
 | 
| 249 | 
 | 
| 250 | Something like `$'\0065'` is never emitted, because QSN doesn't contain octal
 | 
| 251 | escapes.  It can be encoded  with hex or character escapes.
 | 
| 252 | 
 | 
| 253 | ## Appendices
 | 
| 254 | 
 | 
| 255 | ### Design Notes
 | 
| 256 | 
 | 
| 257 | The general idea: Rust string literals are like C and JavaScript string
 | 
| 258 | literals, without cruft like octal (`\755` or `\0755` — which is it?) and
 | 
| 259 | vertical tabs (`\v`).
 | 
| 260 | 
 | 
| 261 | Comparison with shell strings:
 | 
| 262 | 
 | 
| 263 | - `'Single quoted strings'` in shell can't represent arbitrary byte strings.
 | 
| 264 | - `$'C-style shell strings\n'` strings are similar to QSN, but have cruft like
 | 
| 265 |   octal and `\v`.
 | 
| 266 | - `"Double quoted strings"` have unneeded features like `$var` and `$(command
 | 
| 267 |   sub)`.
 | 
| 268 | 
 | 
| 269 | Comparison with Python's `repr()`:
 | 
| 270 | 
 | 
| 271 | - A single quote in Python is `"'"`, whereas it's `'\''` in QSN
 | 
| 272 | - Python has both `\uxxxx` and `\Uxxxxxxxx`, whereas QSN has the more natural
 | 
| 273 |   `\u{xxxxxx}`.
 | 
| 274 | 
 | 
| 275 | ### Related Links
 | 
| 276 | 
 | 
| 277 | - [GNU Coreutils - Quoting File names][coreutils-quotes].  *Starting with GNU
 | 
| 278 |   coreutils version 8.25 (released Jan. 2016), ls's default output quotes
 | 
| 279 |   filenames with special characters*
 | 
| 280 | - [In-band signaling][in-band] is the fundamental problem with filenames and
 | 
| 281 | terminals.  Code (control codes) and data are intermingled.
 | 
| 282 | - [QTT][] is a cleanup of CSV/TSV, built on top of QSN.
 | 
| 283 | 
 | 
| 284 | [coreutils-quotes]: https://www.gnu.org/software/coreutils/quotes.html
 | 
| 285 | 
 | 
| 286 | [in-band]: https://en.wikipedia.org/wiki/In-band_signaling
 | 
| 287 | 
 | 
| 288 | 
 | 
| 289 | ### `set -x` example
 | 
| 290 | 
 | 
| 291 | When arguments don't have any spaces, there's no ambiguity:
 | 
| 292 | 
 | 
| 293 | 
 | 
| 294 |     $ set -x
 | 
| 295 |     $ echo two args
 | 
| 296 |     + echo two args
 | 
| 297 | 
 | 
| 298 | Here we need quotes to show that the `argv` array has 3 elements:
 | 
| 299 | 
 | 
| 300 |     $ set -x
 | 
| 301 |     $ x='a b'
 | 
| 302 |     $ echo "$x" c
 | 
| 303 |     + echo 'a b' c
 | 
| 304 | 
 | 
| 305 | And we want the trace to fit on a single line, so we print a QSN string with
 | 
| 306 | `\n`:
 | 
| 307 | 
 | 
| 308 |     $ set -x
 | 
| 309 |     $ x=$'a\nb'
 | 
| 310 |     $ echo "$x" c
 | 
| 311 |     + echo $'a\nb' c
 | 
| 312 | 
 | 
| 313 | Here's an example with unprintable characters:
 | 
| 314 | 
 | 
| 315 |     $ set -x
 | 
| 316 |     $ x=$'\e\001'
 | 
| 317 |     $ echo "$x"
 | 
| 318 |     + echo $'\x1b\x01'
 |