Why Sponsor Oils? | source | all docs for version 0.22.0 | all versions | oilshell.org
Warning: Work in progress! Leave feedback on Zulip or Github if you'd like this doc to be updated.
Oils is UTF-8 centric, unlike bash
and other shells.
That is, its Unicode support is like Go, Rust, Julia, and Swift, and not like Python or JavaScript. The former languages internally represent strings as UTF-8, while the latter use arrays of code points or UTF-16 code units.
Shell programs should be encoded in UTF-8 (or its ASCII subset). Unicode characters can be encoded directly in the source:
echo 'μ'
or denoted in ASCII with C-escaped strings:
echo $'\u03bc' # bash style
echo u'\u{3bc}' # YSH style
(Such strings are preferred over echo -e
because they're statically parsed.)
Strings in OSH are arbitrary sequences of bytes, which may be valid UTF-8. Details:
NUL
('\0'
) byte. This is a consequence of how Unix and C work.${#s}
and slicing ${s:1:3}
require the string
to be valid UTF-8. Decoding errors are fatal if shopt -s strict_word_eval
is on.These operations are currently implemented in Python, in osh/string_ops.py
:
${#s}
-- length in code points (buggy in bash)
len(s)
returns a number of bytes, not code points.${s:1:2}
-- index and length are a number of code points${x#glob?}
and ${x##glob?}
(see below)More:
${foo,}
and ${foo^}
for lowercase / uppercase[[ a < b ]]
and [ a '<' b ]
for sorting
strcoll()
?printf '%d' \'c
where c
is an arbitrary character. This is an obscure
syntax for ord()
, i.e. getting an integer from an encoded character.Globs have character classes [^a]
and ?
.
This pattern results in a glob()
call:
echo my?glob
These patterns result in fnmatch()
calls:
case $x in ?) echo 'one char' ;; esac
[[ $x == ? ]]
${s#?} # remove one character suffix, quadratic loop for globs
This uses our glob to ERE translator for position info:
echo ${s/?/x}
Regexes have character classes [^a]
and .
:
pat='.' # single "character"
[[ $x =~ $pat ]]
printf
also has time.Other:
wcswidth()
, which doesn't just count
code points. It calculates the display width of characters, which is
different in general.mystr ~ / [ \xff ] /
case (x) { / dot / }
for offset, rune in (runes(mystr))
decodes UTF-8, like GoStr.{trim,trimLeft,trimRight}
respect unicode space, like JavaScript doesStr.{upper,lower}
also need unicode case foldingsplit()
respects unicode space?Not unicode aware:
strcmp()
does byte-wise and UTF-8 wise comparisons?\yff
Unlike bash and CPython, Oils doesn't call setlocale()
. (Although GNU
readline may call it.)
It's expected that your locale will respect UTF-8. This is true on most distros. If not, then some string operations will support UTF-8 and some won't.
For example:
${#s}
is implemented in Oils code, not libc, so it will
always respect UTF-8.[[ s =~ $pat ]]
is implemented with libc, so it is affected by the locale
settings. Same with Oils (x ~ pat)
.TODO: Oils should support LANG=C
for some operations, but not LANG=X
for
other X
.
libc:
glob()
and fnmatch()
regexec()
strcoll()
respects LC_COLLATE
, which bash probably doesOur own:
trimLeft()
and ${s#prefix}
need thisfor r in (runes(x))
needs this\u{3bc}
(currently in data_lang/j8.py Utf8Encode()
)Not sure:
iconv
program converts text from one encoding to another.