doc/unicode.md

OILS / doc / unicode.md View on Github | oilshell.org

205 lines, 132 significant

1	---
2	default_highlighter: oils-sh
3	in_progress: yes
4	---
5
6	Notes on Unicode in Shell
7	=========================
8
9	<div id="toc">
10	</div>
11
12	## Philosophy
13
14	Oils is UTF-8 centric, unlike `bash` and other shells.
15
16	That is, its Unicode support is like Go, Rust, Julia, and Swift, and not like
17	Python or JavaScript. The former languages internally represent strings as
18	UTF-8, while the latter use arrays of code points or UTF-16 code units.
19
20	## A Mental Model
21
22	### Program Encoding
23
24	Shell programs should be encoded in UTF-8 (or its ASCII subset). Unicode
25	characters can be encoded directly in the source:
26
27	<pre>
28	echo 'μ'
29	</pre>
30
31	or denoted in ASCII with C-escaped strings:
32
33	echo $'\u03bc' # bash style
34
35	echo u'\u{3bc}' # YSH style
36
37	(Such strings are preferred over `echo -e` because they're statically parsed.)
38
39	### Data Encoding
40
41	Strings in OSH are arbitrary sequences of bytes, which may be valid UTF-8.
42	Details:
43
44	- When passed to external programs, strings are truncated at the first `NUL`
45	(`'\0'`) byte. This is a consequence of how Unix and C work.
46	- Some operations like length `${#s}` and slicing `${s:1:3}` require the string
47	to be valid UTF-8. Decoding errors are fatal if `shopt -s
48	strict_word_eval` is on.
49
50	## List of Features That Respect Unicode
51
52	### OSH / bash
53
54	These operations are currently implemented in Python, in `osh/string_ops.py`:
55
56	- `${#s}` -- length in code points (buggy in bash)
57	- Note: YSH `len(s)` returns a number of bytes, not code points.
58	- `${s:1:2}` -- index and length are a number of code points
59	- `${x#glob?}` and `${x##glob?}` (see below)
60
61	More:
62
63	- `${foo,}` and `${foo^}` for lowercase / uppercase
64	- `[[ a < b ]]` and `[ a '<' b ]` for sorting
65	- these can use libc `strcoll()`?
66	- `printf '%d' \'c` where `c` is an arbitrary character. This is an obscure
67	syntax for `ord()`, i.e. getting an integer from an encoded character.
68
69	#### Globs
70
71	Globs have character classes `[^a]` and `?`.
72
73	This pattern results in a `glob()` call:
74
75	echo my?glob
76
77	These patterns result in `fnmatch()` calls:
78
79	case $x in ?) echo 'one char' ;; esac
80
81	[[ $x == ? ]]
82
83	${s#?} # remove one character suffix, quadratic loop for globs
84
85	This uses our glob to ERE translator for position info:
86
87	echo ${s/?/x}
88
89	#### Regexes (ERE)
90
91	Regexes have character classes `[^a]` and `.`:
92
93	pat='.' # single "character"
94	[[ $x =~ $pat ]]
95
96	#### Locale-aware operations
97
98	- Prompt string has time, which is locale-specific.
99	- In bash, `printf` also has time.
100
101	Other:
102
103	- The prompt width is calculated with `wcswidth()`, which doesn't just count
104	code points. It calculates the display width of characters, which is
105	different in general.
106
107	### YSH
108
109	- Eggex matching depends on ERE semantics.
110	- `mystr ~ / [ \xff ] /`
111	- `case (x) { / dot / }`
112	- `for offset, rune in (runes(mystr))` decodes UTF-8, like Go
113	- `Str.{trim,trimLeft,trimRight}` respect unicode space, like JavaScript does
114	- `Str.{upper,lower}` also need unicode case folding
115	- `split()` respects unicode space?
116
117	Not unicode aware:
118
119	- `strcmp()` does byte-wise and UTF-8 wise comparisons?
120
121	### Data Languages
122
123	- Decoding JSON/J8 validates UTF-8
124	- Encoding JSON/J8 decodes and validates UTF-8
125	- So we can distinguish valid UTF-8 and invalid bytes like `\yff`
126
127	## Implementation Notes
128
129	Unlike bash and CPython, Oils doesn't call `setlocale()`. (Although GNU
130	readline may call it.)
131
132	It's expected that your locale will respect UTF-8. This is true on most
133	distros. If not, then some string operations will support UTF-8 and some
134	won't.
135
136	For example:
137
138	- String length like `${#s}` is implemented in Oils code, not libc, so it will
139	always respect UTF-8.
140	- `[[ s =~ $pat ]]` is implemented with libc, so it is affected by the locale
141	settings. Same with Oils `(x ~ pat)`.
142
143	TODO: Oils should support `LANG=C` for some operations, but not `LANG=X` for
144	other `X`.
145
146	### List of Low-Level UTF-8 Operations
147
148	libc:
149
150	- `glob()` and `fnmatch()`
151	- `regexec()`
152	- `strcoll()` respects `LC_COLLATE`, which bash probably does
153
154	Our own:
155
156	- Decode next rune from a position, or previous rune
157	- `trimLeft()` and `${s#prefix}` need this
158	- Decode UTF-8
159	- J8 encoding and decoding need this
160	- `for r in (runes(x))` needs this
161	- respecting surrogate half
162	- JSON needs this
163	- Encode integer rune to UTF-8 sequence
164	- J8 needs this, for `\u{3bc}` (currently in `data_lang/j8.py Utf8Encode()`)
165
166	Not sure:
167
168	- Case folding
169	- both OSH and YSH have uppercase and lowercase
170
171	## Tips
172
173	- The GNU `iconv` program converts text from one encoding to another.
174
175	<!--
176	## Spec Tests
177
178	June 2024 notes:
179
180	- `spec/var-op-patsub` has failing cases, e.g. `LC_ALL=C`
181	- ${s//?/a}
182	- glob() and fnmatch() seem to be OK? As long as locale is UTF-8.
183
184	-->
185
186
187
188	<!--
189
190	What libraries are we using?
191
192	TODO: Make sure these are UTF-8 mode, regardless of LANG global variables?
193
194	Or maybe we punt on that, and say Oils is only valid in UTF-8 mode? Need to
195	investigate the API more.
196
197	- fnmatch()
198	- glob()
199	- regcomp/regexec()
200
201	- Are we using any re2c unicode? For JSON?
202	- upper() and lower()? isupper() is lower()
203	- Need to sort these out
204
205	-->