Warning: Work in progress! Leave feedback on Zulip or Github if you'd like this doc to be updated.

Notes on Unicode in Shell

Table of Contents
Philosophy
A Mental Model
Program Encoding
Data Encoding
List of Features That Respect Unicode
OSH / bash
YSH
Data Languages
Implementation Notes
List of Low-Level UTF-8 Operations
Tips

Philosophy

Oils is UTF-8 centric, unlike bash and other shells.

That is, its Unicode support is like Go, Rust, Julia, and Swift, and not like Python or JavaScript. The former languages internally represent strings as UTF-8, while the latter use arrays of code points or UTF-16 code units.

A Mental Model

Program Encoding

Shell programs should be encoded in UTF-8 (or its ASCII subset). Unicode characters can be encoded directly in the source:

echo 'μ'

or denoted in ASCII with C-escaped strings:

echo $'\u03bc'   # bash style

echo u'\u{3bc}'  # YSH style

(Such strings are preferred over echo -e because they're statically parsed.)

Data Encoding

Strings in OSH are arbitrary sequences of bytes, which may be valid UTF-8. Details:

List of Features That Respect Unicode

OSH / bash

These operations are currently implemented in Python, in osh/string_ops.py:

More:

Globs

Globs have character classes [^a] and ?.

This pattern results in a glob() call:

echo my?glob

These patterns result in fnmatch() calls:

case $x in ?) echo 'one char' ;; esac

[[ $x == ? ]]

${s#?}  # remove one character suffix, quadratic loop for globs

This uses our glob to ERE translator for position info:

echo ${s/?/x}

Regexes (ERE)

Regexes have character classes [^a] and .:

pat='.'  # single "character"
[[ $x =~ $pat ]]

Locale-aware operations

Other:

YSH

Not unicode aware:

Data Languages

Implementation Notes

Unlike bash and CPython, Oils doesn't call setlocale(). (Although GNU readline may call it.)

It's expected that your locale will respect UTF-8. This is true on most distros. If not, then some string operations will support UTF-8 and some won't.

For example:

TODO: Oils should support LANG=C for some operations, but not LANG=X for other X.

List of Low-Level UTF-8 Operations

libc:

Our own:

Not sure:

Tips

Generated on Wed, 24 Jul 2024 17:20:16 +0000