Why Sponsor Oils? | source | all docs for version 0.22.0 | all versions | oilshell.org
J8 Notation is a set of text interchange formats. It's a syntax for:
It's part of the Oils project, and is intended to solve the JSON-Unix Mismatch: the Unix kernel deals with bytes, while JSON deals with Unicode strings (plus UTF-16 errors).
It's backward compatible with JSON, and built on top of it.
But just like JSON isn't only for JavaScript, J8 Notation isn't only for Oils. Any language understands JSON should also understand J8 Notation.
(Note: J8 replaced the similar QSN design in January 2024. QSN wasn't as compatible with both JSON and YSH code.)
There are 3 styles of J8 strings:
"hi 🙂 \uD83D\uDE42" # JSON-style, with surrogate pair b'hi 🙂 \yF0\y9F\y99\y82' # Can be ANY bytes, including UTF-8 u'hi 🙂 \u{1F642}' # nice alternative syntax
They all denote the same decoded string — "hi" and two U+1F642
smiley
faces:
hi 🙂 🙂
Why did we add these u''
and b''
strings?
argv
arrays,
env variables, filenames, file contents, etc.)
b''
strings to avoid losing information.u''
strings are like b''
strings, but they can only express valid
Unicode strings.Now, starting with J8 strings, we define the formats JSON8:
{ name: "Alice",
signature: b'\y01 ... \yff', # binary data
}
J8 Lines:
doc/hello.md
"doc/with spaces.md"
b'doc/with byte \yff.md'
and TSV8:
!tsv8 size name
!type Int Str
42 doc/hello.md
55 "doc/with spaces.md"
99 b'doc/with byte \yff.md'
Together, these are called J8 Notation.
(JSON8 and TSV8 are still to be fully implemented in Oils.).
awk
, cut
, and sort
already understand tables
informally.Non-goals:
See the Data Notation Table of Contents in the Oils Reference.
Let's review JSON strings, and then describe J8 strings.
JSON strings are enclosed in double quotes, and may have these escape sequences:
\" \\ \/
\b \f \n \r \t
\u1234
Properties of JSON:
There are 3 styles of J8 strings:
j""
, which may be written ""
b''
stringsu''
stringsb''
strings have these escapes:
\yff # byte escape
\u{1f926} # code point escape. UTF-16 escapes like \u1234
# are ILLEGAL
\' # single quote, in addition to \"
\" \\ \/ # same as JSON
\b \f \n \r \t
(JSON-style double-quoted do not add the \'
escape. Except for the optional
j
prefix, they remain the same.)
Examples:
b''
b'hello'
b'\\'
b'"double" \'single\''
b'nul byte \y00, unicode \u{1f642}'
u''
strings have all the same escapes, but not \yff
. This implies that
they're always valid unicode strings. (If JSON-style \u1234
escapes were
allowed, they wouldn't be.)
Examples:
u''
u'hello'
u'unicode string \u{1f642}'
A string without a prefix, like 'foo'
, is equivalent to u'foo'
:
'this is a u string' # discouraged, unless the context is clear
u'this is a u string' # better to be explicit
These relationships might help you understand the 3 styles of strings:
Strings representable by u''
= All Unicode Strings (no more and no less)
⊂
Strings representable by ""
(JSON-style)
= All Unicode Strings ∪ Surrogate Half Errors
⊂
Strings representable by b''
= All Byte Strings
Examples:
"\udd26"
represents a string that's not Unicode — it
has a surrogate half error. This string is not representable with u''
strings.b'\yff'
represents a byte string. This string is not
representable with JSON strings or u''
strings.A few things to notice about J8 encoders:
""
strings, possibly using the Unicode replacement char
U+FFFD
. This is a strict JSON encoder.b''
strings to preserve all information, because U+FFFD
replacement is lossy.u''
strings.
""
strings (and b''
strings) can represent all values
that u''
strings can. Still, u''
strings may be desirable in some
situations, like when you want \u{1f642}
escapes, or to assert that a
value must be a valid Unicode string.On the other hand, J8 decoders must accept all 3 kinds of strings.
A nice property of YSH is that the u''
and b''
strings are valid code:
echo u'hi \u{1f642}' # u respected in YSH, but not OSH
var myBytes = b'\yff\yfe'
This is useful for correct code generation, and simplifies the language.
But JSON-style strings aren't valid in YSH. The two usages of double quotes
can't really be reconciled, because JSON looks like "line\n"
and shell looks
like "x = ${myvar}"
.
When the encoded form of a J8 string doesn't contain a backslash, it's identical to a POSIX shell string.
In this case, it can make sense to omit the u''
prefix. Example:
shell_string='hi 🙂'
var ysh_str = u'hi 🙂'
var ysh_str = 'hi 🙂' # same thing
An encoded J8 string has no backslashes when the original string has all these properties:
0x20
and greater.Now that we've defined J8 strings, we can define JSON8, an obvious extension of JSON.
(Not implemented yet.)
See https://json.org
[primitive] null true false
[number] 42 -1.2e-4
[string] "hello\n"
[array] [1, 2, 3]
[object] {"key": 42}
JSON8 is like JSON, but:
{age: 42}
[a-zA-Z_][a-zA-Z0-9_]*
.{"d": 42,}
and [42,]
#
to be consistent with shell.Example:
{ name: "Bob", # comment
age: 30,
sig: b'\y00\y01 ... \yff', # trailing comma, binary data
}
J8 Lines is another format built on J8 strings. Each line is either:
""
or J8-style b'' u''
)In all cases, leading and trailing whitespace is ignored.
For example, 6 strings with weird characters could be represented like this:
dir/with spaces.txt # unquoted string must be UTF-8
"dir/with newline \n.txt" # JSON-style
b'dir/with bytes \yff.txt' # J8-style
u'dir/unicode \u{3bc}'
# ignored empty line
'' # empty string, not ignored
'dir/unicode \u{3bc}' # no prefix implies u''
Note that J8 strings always occupy one physical line, because they can't contain unescaped control characters, including newlines.
J8 Lines can be viewed as a simpler case of TSV8, described in the next section.
{}
and
[]
. We could define an obvious "JSON8 Lines" format, which is different
than "J8 Lines".Let's review TSV, and then describe TSV8.
TSV has a very short specification:
Example:
name<TAB>age
alice<TAB>44
bob<TAB>33
Limitations:
TSV8 is like TSV with:
!tsv8
prefix and required column names.!type
line, with types Bool Int Float Str
.Example:
!tsv8 age name
!type Int Str # optional types
!other x y # more column metadata
44 alice
33 bob
1 "a\tb"
2 b'nul \y00'
3 u'unicode \u{3bc}'
Types:
[Bool] false true
[Int] JSON numbers, restricted to [0-9]+
[Float] same as JSON
[Str] J8 string (any of the 3 styles)
Rules for cells:
""
u''
b''
TODO: What about empty cells? Are they equivalent to null
? TSV apparently
can't have empty cells, as the rule is [character]+
, not [character]+
.
Column attributes:
!format
could be Instant / Duration?TODO: This section will be filled in as we implement TSV8.
Null Issues:
It's OK to use plain TSV in YSH programs as well. You don't have to add types if you don't want to.
This document described an upgrade of JSON strings:
And data formats that built on top of these strings:
We could have an SEXP8 format for:
\yff
, and not \xff
as in C?Because in JavaScript and Python, \xff
is a code point, not a byte. That
is, it's a synonym for \u00ff
, which is encoded in UTF-8 as the 2 bytes 0xc3 0xbf
.
This is exactly the confusion we want to avoid, so \yff
is explicitly
different.
One of Chrome's JSON encoders also has this confusion.
u''
and b''
strings, if only b''
is technically needed?A few reasons:
u''
strings can avoid hacks like
WTF-8, which is often required for
round-tripping arbitrary JSON messages. Our u''
strings don't require
WTF-8 because they can't represent surrogate halves.u''
strings add trivial weight to the spec, since compared to b''
strings, they simply remove \yff
. This is true because encoded J8 strings
must be valid UTF-8.u""
and b""
?J8-style strings could have used double quotes. But single quotes make the new
styles more visually distinct from ""
, and it allows ''
as a synonym for
u''
.
Compared to ""
strings, ''
strings don't have a UTF-16 legacy.
The list of errors at ref/chap-errors.html may be a good starting point.
TODO: describe the Oils implementation.
J8 Notation is like JSON: it only specifies the syntax of messages on the wire.
The mapping of text to types is left to implementers, and depends on the programming language:
OSH and YSH happen to use Int
and Float
, but this is logically separate
from J8 Notation.
"foo"
and b'foo'
and u'foo'
."foo"
.b'foo'
or u'foo'
.Formats built on J8 strings: