| 1 | ---
 | 
| 2 | default_highlighter: oils-sh
 | 
| 3 | ---
 | 
| 4 | 
 | 
| 5 | J8 Notation - Fixing the JSON-Unix Mismatch
 | 
| 6 | ===========
 | 
| 7 | 
 | 
| 8 | J8 Notation is a set of text interchange formats.  It's a syntax for:
 | 
| 9 | 
 | 
| 10 | 1. **strings** / bytes
 | 
| 11 | 1. tree-shaped **records** (like [JSON]($xref))
 | 
| 12 | 1. line-based **streams** (like Unix)
 | 
| 13 | 1. **tables** (like TSV)
 | 
| 14 | 
 | 
| 15 | It's part of the Oils project, and is intended to solve the *JSON-Unix
 | 
| 16 | Mismatch*: the Unix kernel deals with bytes, while JSON deals with Unicode
 | 
| 17 | strings (plus UTF-16 errors).
 | 
| 18 | 
 | 
| 19 | It's backward compatible with [JSON]($xref), and built on top of
 | 
| 20 | it.
 | 
| 21 | 
 | 
| 22 | But just like JSON isn't only for JavaScript, J8 Notation isn't only for Oils.
 | 
| 23 | Any language understands JSON should also understand J8 Notation.
 | 
| 24 | 
 | 
| 25 | (Note: J8 replaced the similar [QSN](qsn.html) design in January
 | 
| 26 | 2024.  QSN wasn't as compatible with both JSON and YSH code.)
 | 
| 27 | 
 | 
| 28 | <div id="toc">
 | 
| 29 | </div>
 | 
| 30 | 
 | 
| 31 | ## Quick Picture
 | 
| 32 | 
 | 
| 33 | <style>
 | 
| 34 |   .uni4 {
 | 
| 35 |     /* color: #111; */
 | 
| 36 |   }
 | 
| 37 |   .dq {
 | 
| 38 |     color: darkred;
 | 
| 39 |   }
 | 
| 40 |   .sq {
 | 
| 41 |     color: #111;
 | 
| 42 |   }
 | 
| 43 | </style>
 | 
| 44 | 
 | 
| 45 | There are 3 styles of J8 strings:
 | 
| 46 | 
 | 
| 47 | <pre style="font-size: x-large;">
 | 
| 48 |  <span class=dq>"</span>hi 🙂 \u<span class=uni4>D83D</span>\u<span class=uni4>DE42</span><span class=dq>"</span>      <span class="sh-comment"># JSON-style, with surrogate pair</span>
 | 
| 49 | 
 | 
| 50 | <span class=sq>b'</span>hi 🙂 \yF0\y9F\y99\y82<span class=sq>'</span>  <span class="sh-comment"># Can be ANY bytes, including UTF-8</span>
 | 
| 51 | 
 | 
| 52 | <span class=sq>u'</span>hi 🙂 \u{1F642}<span class=sq>'</span>         <span class="sh-comment"># nice alternative syntax</span>
 | 
| 53 | </pre>
 | 
| 54 | 
 | 
| 55 | They all denote the same decoded string — "hi" and two `U+1F642` smiley
 | 
| 56 | faces:
 | 
| 57 | 
 | 
| 58 | <pre style="font-size: x-large;">
 | 
| 59 | hi 🙂 🙂
 | 
| 60 | </pre>
 | 
| 61 | 
 | 
| 62 | Why did we add these `u''` and `b''` strings?
 | 
| 63 | 
 | 
| 64 | - We want to represent any string that a Unix kernel can emit (`argv` arrays,
 | 
| 65 |   env variables, filenames, file contents, etc.)
 | 
| 66 |   - J8 encoders emit `b''` strings to avoid losing information.  
 | 
| 67 | - `u''` strings are like `b''` strings, but they can only express valid
 | 
| 68 |   Unicode strings.  
 | 
| 69 | 
 | 
| 70 | <!-- They can't express arbitrary binary data, and there's no such thing as a
 | 
| 71 | surrogate pair or half. -->
 | 
| 72 | 
 | 
| 73 | ---
 | 
| 74 | 
 | 
| 75 | Now, starting with J8 strings, we define the formats JSON8:
 | 
| 76 | 
 | 
| 77 |     { name: "Alice",
 | 
| 78 |       signature: b'\y01 ... \yff',  # binary data
 | 
| 79 |     }
 | 
| 80 | 
 | 
| 81 | J8 Lines:
 | 
| 82 | 
 | 
| 83 |       doc/hello.md
 | 
| 84 |      "doc/with spaces.md"
 | 
| 85 |     b'doc/with byte \yff.md'
 | 
| 86 | 
 | 
| 87 | and TSV8:
 | 
| 88 | 
 | 
| 89 |     !tsv8   size    name
 | 
| 90 |     !type   Int     Str
 | 
| 91 |             42        doc/hello.md
 | 
| 92 |             55       "doc/with spaces.md"
 | 
| 93 |             99      b'doc/with byte \yff.md'
 | 
| 94 | 
 | 
| 95 | Together, these are called *J8 Notation*.
 | 
| 96 | 
 | 
| 97 | (JSON8 and TSV8 are still to be fully implemented in Oils.).
 | 
| 98 | 
 | 
| 99 | ## Goals
 | 
| 100 | 
 | 
| 101 | 1. Fix the **JSON-Unix mismatch**: all text formats should be able to express
 | 
| 102 |    byte strings.
 | 
| 103 |    - But it's OK to use plain JSON in Oils, e.g. when filenames are known to be
 | 
| 104 |      strings.
 | 
| 105 | 1. Provide an option to avoid the surrogate pair / **UTF-16 legacy** of JSON.
 | 
| 106 | 1. Allow expressing metadata about **strings vs. bytes**.
 | 
| 107 | 1. Turn TSV into an **exterior** [data
 | 
| 108 |    frame](https://www.oilshell.org/blog/2018/11/30.html) format.
 | 
| 109 |    - Unix tools like `awk`, `cut`, and `sort` already understand tables
 | 
| 110 |      informally.
 | 
| 111 | 
 | 
| 112 | <!--
 | 
| 113 |    - TSV8 cells can represent arbitrary binary data, including tabs and
 | 
| 114 |      newlines.
 | 
| 115 | -->
 | 
| 116 | 
 | 
| 117 | Non-goals:
 | 
| 118 | 
 | 
| 119 | 1. "Replace" JSON.  JSON8 is backward compatible with JSON, and sometimes the
 | 
| 120 |    lossy encoding is OK.
 | 
| 121 | 1. Resolve the strings vs. bytes dilemma in all situations.
 | 
| 122 |    - Like JSON, our spec is **syntactic**.  We don't specify a mapping from J8
 | 
| 123 |      strings to interior data types in any particular language.
 | 
| 124 | 
 | 
| 125 | <!--
 | 
| 126 | ## J8 Notation in As Few Words As Possible
 | 
| 127 | 
 | 
| 128 | J8 Strings are a superset of JSON strings:
 | 
| 129 | 
 | 
| 130 | Only valid unicode:
 | 
| 131 | 
 | 
| 132 | <pre style="font-size: x-large;">
 | 
| 133 | u'hi 🤦 \u{1f926}'                  → hi 🤦 🤦
 | 
| 134 | </pre>
 | 
| 135 | 
 | 
| 136 | JSON: unicode + surrogate halves:
 | 
| 137 | 
 | 
| 138 | <pre style="font-size: x-large;">
 | 
| 139 |  "hi 🤦 \ud83e\udd26"               → hi 🤦 🤦
 | 
| 140 |  "\ud83e"
 | 
| 141 | </pre>
 | 
| 142 | 
 | 
| 143 | Any byte string:
 | 
| 144 | 
 | 
| 145 | <pre style="font-size: x-large;">
 | 
| 146 | b'hi 🤦 \u{1f926} \yf0\y9f\ya4\ya6' → hi 🤦 🤦 🤦
 | 
| 147 | b'\yff'
 | 
| 148 | </pre>
 | 
| 149 | 
 | 
| 150 | ## Structured Formats
 | 
| 151 | 
 | 
| 152 | ### JSON8
 | 
| 153 | 
 | 
| 154 | ### TSV8
 | 
| 155 | 
 | 
| 156 | 1. Required first row with column names
 | 
| 157 | 1. Optional second row with column types
 | 
| 158 | 1. Gutter Column
 | 
| 159 | 
 | 
| 160 | -->
 | 
| 161 | 
 | 
| 162 | ## Reference
 | 
| 163 | 
 | 
| 164 | See the [Data Notation Table of Contents](ref/toc-data.html) in the [Oils
 | 
| 165 | Reference](ref/index.html).
 | 
| 166 | 
 | 
| 167 | ### TODO / Diagrams
 | 
| 168 | 
 | 
| 169 | - Diagram of Evolution
 | 
| 170 |   - JSON strings → J8 Strings
 | 
| 171 |   - J8 strings as a building block → JSON8 and TSV8
 | 
| 172 | - Venn Diagrams of Data Language Relationships
 | 
| 173 |   - If you add the left "gutter" column, every TSV is valid TSV8.
 | 
| 174 |   - Every TSV8 is also syntactically valid TSV.  For example, you can import it
 | 
| 175 |     into a spreadsheet, and remove/ignore the gutter column and type row.
 | 
| 176 |   - TODO: make a screenshot and test it
 | 
| 177 | - Doc: How to turn a JSON library into a J8 Notation library.
 | 
| 178 |   - Issue: an interior type that can represent byte strings.
 | 
| 179 | 
 | 
| 180 | ## J8 Strings - Unicode and bytes
 | 
| 181 | 
 | 
| 182 | Let's review JSON strings, and then describe J8 strings.
 | 
| 183 | 
 | 
| 184 | ### Review of JSON strings
 | 
| 185 | 
 | 
| 186 | JSON strings are enclosed in double quotes, and may have these escape
 | 
| 187 | sequences:
 | 
| 188 | 
 | 
| 189 |     \"   \\   \/
 | 
| 190 |     \b   \f   \n   \r   \t
 | 
| 191 |     \u1234
 | 
| 192 | 
 | 
| 193 | Properties of JSON:
 | 
| 194 | 
 | 
| 195 | - The encoded form must also be valid UTF-8.
 | 
| 196 | - The encoded form can't contain literal control characters, including literal
 | 
| 197 |   tabs or newlines.  (This is good for TSV8, because it means a literal tab is
 | 
| 198 |   always a field separator.)
 | 
| 199 | 
 | 
| 200 | ### J8 Description
 | 
| 201 | 
 | 
| 202 | There are 3 **styles** of J8 strings:
 | 
| 203 | 
 | 
| 204 | 1. JSON strings `j""`, which may be written `""`
 | 
| 205 | 1. `b''` strings
 | 
| 206 | 1. `u''` strings
 | 
| 207 | 
 | 
| 208 | `b''` strings have these escapes:
 | 
| 209 | 
 | 
| 210 |     \yff                # byte escape
 | 
| 211 |     \u{1f926}           # code point escape.  UTF-16 escapes like \u1234
 | 
| 212 |                         # are ILLEGAL
 | 
| 213 |     \'                  # single quote, in addition to \"
 | 
| 214 |     \"  \\  \/          # same as JSON
 | 
| 215 |     \b  \f  \n  \r  \t  
 | 
| 216 | 
 | 
| 217 | (JSON-style double-quoted do not add the `\'` escape.  Except for the optional
 | 
| 218 | `j` prefix, they remain the same.)
 | 
| 219 | 
 | 
| 220 | Examples:
 | 
| 221 | 
 | 
| 222 |     b''
 | 
| 223 |     b'hello'
 | 
| 224 |     b'\\'
 | 
| 225 |     b'"double" \'single\''
 | 
| 226 |     b'nul byte \y00, unicode \u{1f642}'
 | 
| 227 | 
 | 
| 228 | `u''` strings have all the same escapes, but **not** `\yff`.  This implies that
 | 
| 229 | they're always valid unicode strings.  (If JSON-style `\u1234` escapes were
 | 
| 230 | allowed, they wouldn't be.)
 | 
| 231 | 
 | 
| 232 | Examples:
 | 
| 233 | 
 | 
| 234 |     u''
 | 
| 235 |     u'hello'
 | 
| 236 |     u'unicode string \u{1f642}' 
 | 
| 237 | 
 | 
| 238 | A string *without* a prefix, like `'foo'`, is equivalent to `u'foo'`:
 | 
| 239 | 
 | 
| 240 |      'this is a u string'  # discouraged, unless the context is clear
 | 
| 241 | 
 | 
| 242 |     u'this is a u string'  # better to be explicit
 | 
| 243 | 
 | 
| 244 | ### What's representable by each style?
 | 
| 245 | 
 | 
| 246 | <style>
 | 
| 247 | #subset {
 | 
| 248 |     text-align: center;
 | 
| 249 |     background-color: #DEE;
 | 
| 250 |     padding-top: 0.5em; padding-bottom: 0.5em;
 | 
| 251 |     margin-left: 3em; margin-right: 3em;
 | 
| 252 | }
 | 
| 253 | .set {
 | 
| 254 |   font-size: x-large;     
 | 
| 255 | }
 | 
| 256 | </style>
 | 
| 257 | 
 | 
| 258 | These relationships might help you understand the 3 styles of strings:
 | 
| 259 | 
 | 
| 260 | <div id="subset">
 | 
| 261 | 
 | 
| 262 | <span class="set">Strings representable by `u''`</span><br/>
 | 
| 263 | = All Unicode Strings (no more and no less)
 | 
| 264 | 
 | 
| 265 | <b>⊂</b>
 | 
| 266 | 
 | 
| 267 | <span class="set">Strings representable by `""`</span> (JSON-style)<br/>
 | 
| 268 | = All Unicode Strings <b>∪</b> Surrogate Half Errors
 | 
| 269 | 
 | 
| 270 | <b>⊂</b>
 | 
| 271 | 
 | 
| 272 | <span class="set">Strings representable by `b''`</span></br>
 | 
| 273 | = All Byte Strings
 | 
| 274 | 
 | 
| 275 | </div>
 | 
| 276 | 
 | 
| 277 | Examples:
 | 
| 278 | 
 | 
| 279 | - The JSON message `"\udd26"` represents a string that's not Unicode — it
 | 
| 280 |   has a surrogate half error.  This string is **not** representable with `u''`
 | 
| 281 |   strings.
 | 
| 282 | - The J8 message `b'\yff'` represents a byte string.  This string is **not**
 | 
| 283 |   representable with JSON strings or `u''` strings.
 | 
| 284 | 
 | 
| 285 | ### Assymmetry of Encoders and Decoders
 | 
| 286 | 
 | 
| 287 | A few things to notice about J8 **encoders**:
 | 
| 288 | 
 | 
| 289 | 1. They can emit only `""` strings, possibly using the Unicode replacement char
 | 
| 290 |    `U+FFFD`.  This is a strict JSON encoder.
 | 
| 291 | 1. They *must* emit `b''` strings to preserve all information, because `U+FFFD`
 | 
| 292 |    replacement is lossy.
 | 
| 293 | 1. They *never* need to emit `u''` strings.
 | 
| 294 |    - This is because `""` strings (and `b''` strings) can represent all values
 | 
| 295 |      that `u''` strings can.  Still, `u''` strings may be desirable in some
 | 
| 296 |      situations, like when you want `\u{1f642}` escapes, or to assert that a
 | 
| 297 |      value must be a valid Unicode string.
 | 
| 298 | 
 | 
| 299 | On the other hand, J8 **decoders** must accept all 3 kinds of strings.
 | 
| 300 | 
 | 
| 301 | ### YSH has 2 of the 3 styles
 | 
| 302 | 
 | 
| 303 | A nice property of YSH is that the `u''` and `b''` strings are valid code:
 | 
| 304 | 
 | 
| 305 |     echo u'hi \u{1f642}'  # u respected in YSH, but not OSH
 | 
| 306 | 
 | 
| 307 |     var myBytes = b'\yff\yfe'
 | 
| 308 | 
 | 
| 309 | This is useful for correct code generation, and simplifies the language.
 | 
| 310 | 
 | 
| 311 | But JSON-style strings aren't valid in YSH.  The two usages of double quotes
 | 
| 312 | can't really be reconciled, because JSON looks like `"line\n"` and shell looks
 | 
| 313 | like `"x = ${myvar}"`.
 | 
| 314 | 
 | 
| 315 | ### J8 Strings vs. POSIX Shell Strings
 | 
| 316 | 
 | 
| 317 | When the encoded form of a J8 string doesn't contain a **backslash**, it's
 | 
| 318 | identical to a POSIX shell string.  
 | 
| 319 | 
 | 
| 320 | In this case, it can make sense to omit the `u''` prefix.  Example:
 | 
| 321 | 
 | 
| 322 | <pre>
 | 
| 323 | shell_string='hi 🙂'
 | 
| 324 | 
 | 
| 325 | var ysh_str = u'hi 🙂'
 | 
| 326 | 
 | 
| 327 | var ysh_str =  'hi 🙂'  <span class="sh-comment"># same thing</span>
 | 
| 328 | </pre>
 | 
| 329 | 
 | 
| 330 | An encoded J8 string has no backslashes when the original string has all these
 | 
| 331 | properties:
 | 
| 332 | 
 | 
| 333 | 1. Valid Unicode (no non-UTF-8 bytes).
 | 
| 334 | 1. No ASCII control characters.  All bytes are `0x20` and greater.
 | 
| 335 | 1. No backslashes or single quotes.  (All other required escapes are control
 | 
| 336 |    characters.)
 | 
| 337 | 
 | 
| 338 | 
 | 
| 339 | ## JSON8 - Tree-Shaped Records
 | 
| 340 | 
 | 
| 341 | Now that we've defined J8 strings, we can define JSON8, an obvious extension of
 | 
| 342 | JSON.
 | 
| 343 | 
 | 
| 344 | (Not implemented yet.)
 | 
| 345 | 
 | 
| 346 | ### Review of JSON
 | 
| 347 | 
 | 
| 348 | See <https://json.org>
 | 
| 349 | 
 | 
| 350 |     [primitive]     null   true   false
 | 
| 351 |     [number]        42  -1.2e-4
 | 
| 352 |     [string]        "hello\n"
 | 
| 353 |     [array]         [1, 2, 3]
 | 
| 354 |     [object]        {"key": 42}
 | 
| 355 | 
 | 
| 356 | ### JSON8 Description
 | 
| 357 | 
 | 
| 358 | JSON8 is like JSON, but:
 | 
| 359 | 
 | 
| 360 | 1. All strings can be J8 strings — one of the **3 styles** describe
 | 
| 361 |    above.
 | 
| 362 | 1. Object/Dict keys may be **unquoted**, like `{age: 42}`
 | 
| 363 |    - Unquoted keys must be a valid JS identifier name matching the pattern
 | 
| 364 |      `[a-zA-Z_][a-zA-Z0-9_]*`.
 | 
| 365 | 1. **Trailing commas** are allowed on objects and arrays: `{"d": 42,}` and `[42,]`
 | 
| 366 | 1. End-of-line comments.  We use `#` to be consistent with shell.
 | 
| 367 | 
 | 
| 368 | <!--
 | 
| 369 | Note that // is consistent with JavaScript / JSON5, but it actually conflicts
 | 
| 370 | with Scheme symbols, which we use for NIL8.  These are both valid Scheme, and
 | 
| 371 | probably NIL8:
 | 
| 372 | 
 | 
| 373 |     (/ 5 3)
 | 
| 374 |     (// 5 3)  # This should not start a comment!
 | 
| 375 | -->
 | 
| 376 | 
 | 
| 377 | Example:
 | 
| 378 | 
 | 
| 379 | ```
 | 
| 380 | { name: "Bob",  # comment
 | 
| 381 |   age: 30,
 | 
| 382 |   sig: b'\y00\y01 ... \yff',  # trailing comma, binary data
 | 
| 383 | }
 | 
| 384 | ```
 | 
| 385 | 
 | 
| 386 | <!--
 | 
| 387 | !json8  # optional prefix to distinguish from JSON
 | 
| 388 | 
 | 
| 389 | I think using unquoted keys is a good enough signal, or MIME type.
 | 
| 390 | 
 | 
| 391 | -->
 | 
| 392 | 
 | 
| 393 | ## J8 Lines - Lines of Text
 | 
| 394 | 
 | 
| 395 | *J8 Lines* is another format built on J8 strings.  Each line is either:
 | 
| 396 | 
 | 
| 397 | 1. An unquoted string, which must be valid UTF-8.  Whitespace is allowed, but
 | 
| 398 |    not other ASCII control chars.
 | 
| 399 | 2. A quoted J8 string (JSON style `""` or J8-style `b'' u''`)
 | 
| 400 | 3. An **ignored** empty line
 | 
| 401 | 
 | 
| 402 | In all cases, leading and trailing whitespace is ignored.
 | 
| 403 | 
 | 
| 404 | ---
 | 
| 405 | 
 | 
| 406 | For example, 6 strings with weird characters could be represented like this:
 | 
| 407 | 
 | 
| 408 |       dir/with spaces.txt       # unquoted string must be UTF-8
 | 
| 409 |      "dir/with newline \n.txt"  # JSON-style 
 | 
| 410 |     b'dir/with bytes \yff.txt'  # J8-style
 | 
| 411 |     u'dir/unicode \u{3bc}'
 | 
| 412 |                                 # ignored empty line
 | 
| 413 |      ''                         # empty string, not ignored
 | 
| 414 |      'dir/unicode \u{3bc}'      # no prefix implies u''
 | 
| 415 | 
 | 
| 416 | Note that J8 strings always occupy **one** physical line, because they can't
 | 
| 417 | contain unescaped control characters, including newlines.
 | 
| 418 | 
 | 
| 419 | *J8 Lines* can be viewed as a simpler case of TSV8, described in the next
 | 
| 420 | section.
 | 
| 421 | 
 | 
| 422 | <!--
 | 
| 423 | 
 | 
| 424 | TODO: show grammar, which disallows anything but significant tabs/newlines, and
 | 
| 425 | insignificant spaces)
 | 
| 426 | -->
 | 
| 427 | 
 | 
| 428 | #### Related
 | 
| 429 | 
 | 
| 430 | - <https://jsonlines.org/> allows not just strings, but any value like `{}` and
 | 
| 431 |   `[]`.  We could define an obvious "JSON8 Lines" format, which is different
 | 
| 432 |   than "J8 Lines".
 | 
| 433 | 
 | 
| 434 | ## TSV8 - Table-Shaped Text
 | 
| 435 | 
 | 
| 436 | Let's review TSV, and then describe TSV8.
 | 
| 437 | 
 | 
| 438 | ### Review of TSV
 | 
| 439 | 
 | 
| 440 | TSV has a very short specification:
 | 
| 441 | 
 | 
| 442 | - <https://www.iana.org/assignments/media-types/text/tab-separated-values>
 | 
| 443 | 
 | 
| 444 | Example:
 | 
| 445 | 
 | 
| 446 | ```
 | 
| 447 | name<TAB>age
 | 
| 448 | alice<TAB>44
 | 
| 449 | bob<TAB>33
 | 
| 450 | ```
 | 
| 451 | 
 | 
| 452 | Limitations:
 | 
| 453 | 
 | 
| 454 | - Fields can't contain tabs or newlines.
 | 
| 455 | - There's no escaping, so unprintable bytes in field values result in an
 | 
| 456 |   unprintable TSV file.
 | 
| 457 | - Spaces are easy to confuse with tabs.
 | 
| 458 | 
 | 
| 459 | ### TSV8 Description
 | 
| 460 | 
 | 
| 461 | TSV8 is like TSV with:
 | 
| 462 | 
 | 
| 463 | 1. A `!tsv8` prefix and required column names.
 | 
| 464 | 2. An optional `!type` line, with types `Bool Int Float Str`.
 | 
| 465 | 3. Other optional column attributes.
 | 
| 466 | 4. Rows of data, each starting with an empty "gutter" column.
 | 
| 467 | 
 | 
| 468 | Example:
 | 
| 469 | 
 | 
| 470 | ```
 | 
| 471 | !tsv8   age     name    
 | 
| 472 | !type   Int     Str     # optional types
 | 
| 473 | !other  x       y       # more column metadata
 | 
| 474 |         44        alice
 | 
| 475 |         33        bob
 | 
| 476 |          1       "a\tb"
 | 
| 477 |          2      b'nul \y00'
 | 
| 478 |          3      u'unicode \u{3bc}'
 | 
| 479 | ```
 | 
| 480 | 
 | 
| 481 | Types:
 | 
| 482 | 
 | 
| 483 | ```
 | 
| 484 | [Bool]      false   true
 | 
| 485 | [Int]       JSON numbers, restricted to [0-9]+
 | 
| 486 | [Float]     same as JSON
 | 
| 487 | [Str]       J8 string (any of the 3 styles)
 | 
| 488 | ```
 | 
| 489 | 
 | 
| 490 | Rules for cells:
 | 
| 491 | 
 | 
| 492 | 1. They can be any of 4 forms in J8 Lines:
 | 
| 493 |    1. Unquoted
 | 
| 494 |    1. JSON-style `""`
 | 
| 495 |    1. `u''`
 | 
| 496 |    1. `b''`
 | 
| 497 | 1. Leading and trailing whitespace must be stripped, as in J8 Lines.
 | 
| 498 | 
 | 
| 499 | TODO: What about empty cells?  Are they equivalent to `null`?  TSV apparently
 | 
| 500 | can't have empty cells, as the rule is `[character]+`, not `[character]+`.
 | 
| 501 | 
 | 
| 502 | Column attributes:
 | 
| 503 | 
 | 
| 504 | - `!format` could be Instant / Duration?
 | 
| 505 | 
 | 
| 506 | ### Design Notes
 | 
| 507 | 
 | 
| 508 | TODO: This section will be filled in as we implement TSV8.
 | 
| 509 | 
 | 
| 510 | - Null Issues:
 | 
| 511 |   - Are bools nullable?  Seems like no reason, but you could be missing
 | 
| 512 |   - Are ints nullable?  In SQL they probably are
 | 
| 513 |   - Are floats nullable?  Yes, like NA in R.
 | 
| 514 |   - Decoders can use a parallel typed column to indicate nulls?
 | 
| 515 | 
 | 
| 516 | - It's OK to use plain TSV in YSH programs as well.  You don't have to add
 | 
| 517 |   types if you don't want to.
 | 
| 518 | 
 | 
| 519 | 
 | 
| 520 | ## Summary
 | 
| 521 | 
 | 
| 522 | This document described an upgrade of JSON strings:
 | 
| 523 | 
 | 
| 524 | - J8 Strings (in 3 styles)
 | 
| 525 | 
 | 
| 526 | And data formats that built on top of these strings:
 | 
| 527 | 
 | 
| 528 | - JSON8 - tree-shaped records
 | 
| 529 | - J8 Lines - Unix streams
 | 
| 530 | - TSV8 - table-shaped data
 | 
| 531 | 
 | 
| 532 | ## Appendix
 | 
| 533 | 
 | 
| 534 | ### Related Links
 | 
| 535 | 
 | 
| 536 | - <https://json.org/>
 | 
| 537 | - JSON extensions
 | 
| 538 |   - <https://json5.org/>
 | 
| 539 |   - [JSON with Commas and
 | 
| 540 |   Comments](https://nigeltao.github.io/blog/2021/json-with-commas-comments.html)
 | 
| 541 |   - Survey: <https://github.com/json-next/awesome-json-next>
 | 
| 542 | 
 | 
| 543 | ### Future Work
 | 
| 544 | 
 | 
| 545 | We could have an SEXP8 format for:
 | 
| 546 | 
 | 
| 547 | - Concrete syntax trees, with location information
 | 
| 548 | - Textual IRs like WebAssembly
 | 
| 549 | 
 | 
| 550 | ## FAQ
 | 
| 551 | 
 | 
| 552 | ### Why are byte escapes spelled `\yff`, and not `\xff` as in C?
 | 
| 553 | 
 | 
| 554 | Because in JavaScript and Python, `\xff` is a **code point**, not a byte.  That
 | 
| 555 | is, it's a synonym for `\u00ff`, which is encoded in UTF-8 as the 2 bytes `0xc3
 | 
| 556 | 0xbf`.
 | 
| 557 | 
 | 
| 558 | This is **exactly** the confusion we want to avoid, so `\yff` is explicitly
 | 
| 559 | different.
 | 
| 560 | 
 | 
| 561 | One of Chrome's JSON encoders [also has this
 | 
| 562 | confusion](https://source.chromium.org/chromium/chromium/src/+/main:base/json/json_reader.h;l=27;drc=d0919138b7951c1a154cf802a68aad7904b6f4c9).
 | 
| 563 | 
 | 
| 564 | ### Why have both `u''` and `b''` strings, if only `b''` is technically needed?
 | 
| 565 | 
 | 
| 566 | A few reasons:
 | 
| 567 | 
 | 
| 568 | 1. Apps in languages like Python and Rust could make use of the distinction.
 | 
| 569 |    Oils doesn't have a string/bytes distinction (on the "interior"), but many
 | 
| 570 |    languages do.
 | 
| 571 | 1. Using `u''` strings can avoid hacks like
 | 
| 572 |    [WTF-8](http://simonsapin.github.io/wtf-8/), which is often required for
 | 
| 573 |    round-tripping arbitrary JSON messages.  Our `u''` strings don't require
 | 
| 574 |    WTF-8 because they can't represent surrogate halves.
 | 
| 575 | 1. `u''` strings add trivial weight to the spec, since compared to `b''`
 | 
| 576 |    strings, they simply remove `\yff`.  This is true because *encoded* J8 strings
 | 
| 577 |    must be valid UTF-8.
 | 
| 578 | 
 | 
| 579 | ### Why not use double quotes like `u""` and `b""`?
 | 
| 580 | 
 | 
| 581 | J8-style strings could have used double quotes.  But single quotes make the new
 | 
| 582 | styles more visually distinct from `""`, and it allows `''` as a synonym for
 | 
| 583 | `u''`.
 | 
| 584 | 
 | 
| 585 | Compared to `""` strings, `''` strings don't have a UTF-16 legacy.
 | 
| 586 | 
 | 
| 587 | ### How do I write a J8 encoder and decoder?
 | 
| 588 | 
 | 
| 589 | The list of errors at [ref/chap-errors.html](ref/chap-errors.html) may be a
 | 
| 590 | good starting point.
 | 
| 591 | 
 | 
| 592 | TODO: describe the Oils implementation.
 | 
| 593 | 
 | 
| 594 | ### Should a J8 number be mapped to an Int, Float, or Decimal type?
 | 
| 595 | 
 | 
| 596 | J8 Notation is like JSON: it only specifies the syntax of messages on the wire.
 | 
| 597 | 
 | 
| 598 | The mapping of text to types is left to implementers, and depends on the
 | 
| 599 | programming language:
 | 
| 600 | 
 | 
| 601 | - Languages like C, C++, and Rust have different sizes of ints and floats
 | 
| 602 | - Languages like JavaScript favor floats
 | 
| 603 | - It's valid to map to a Decimal type, if the language runtime supports it
 | 
| 604 | 
 | 
| 605 | OSH and YSH happen to use `Int` and `Float`, but this is logically separate
 | 
| 606 | from J8 Notation.
 | 
| 607 | 
 | 
| 608 | ## Glossary
 | 
| 609 | 
 | 
| 610 | - **J8 Strings** - the building block for JSON8 and TSV8.  There are 3 similar
 | 
| 611 |   syntaxes: `"foo"` and `b'foo'` and `u'foo'`.
 | 
| 612 | - **JSON strings** - double quoted strings `"foo"`.
 | 
| 613 | - **J8-style strings** - either `b'foo'` or `u'foo'`.
 | 
| 614 | 
 | 
| 615 | Formats built on J8 strings:
 | 
| 616 | 
 | 
| 617 | - **J8 Lines** - unquoted and J8 strings, one per line.
 | 
| 618 | - **JSON8** - An upgrade of JSON.
 | 
| 619 | - **TSV8** - An upgrade of TSV.
 | 
| 620 | 
 | 
| 621 | 
 |