OILS / mycpp / README.md View on Github | oilshell.org

582 lines, 416 significant
1mycpp
2=====
3
4This is a Python-to-C++ translator based on MyPy. It only
5handles the small subset of Python that we use in Oils.
6
7It's inspired by both mypyc and Shed Skin. These posts give background:
8
9- [Brief Descriptions of a Python to C++ Translator](https://www.oilshell.org/blog/2022/05/mycpp.html)
10- [Oil Is Being Implemented "Middle Out"](https://www.oilshell.org/blog/2022/03/middle-out.html)
11
12As of March 2024, the translation to C++ is **done**. So it's no longer
13experimental!
14
15However, it's still pretty **hacky**. This doc exists mainly to explain the
16hacks. (We may want to rewrite mycpp as "yaks", although it's low priority
17right now.)
18
19---
20
21Source for this doc: [mycpp/README.md]($oils-src). The code is all in
22[mycpp/]($oils-src).
23
24
25<div id="toc">
26</div>
27
28## Instructions
29
30### Translating and Compiling `oils-cpp`
31
32Running `mycpp` is best done on a Debian / Ubuntu-ish machine. Follow the
33instructions at <https://github.com/oilshell/oil/wiki/Contributing> to create
34the "dev build" first, which is DISTINCT from the C++ build. Make sure you can
35run:
36
37 oil$ build/py.sh all
38
39This will give you a working shell:
40
41 oil$ bin/osh -c 'echo hi' # running interpreted Python
42 hi
43
44To run mycpp, we will build Python 3.10, clone MyPy, and install MyPy's
45dependencies. First install packages:
46
47 # We need libssl-dev, libffi-dev, zlib1g-dev to bootstrap Python
48 oil$ build/deps.sh install-ubuntu-packages
49
50Then fetch data, like the Python 3.10 tarball and MyPy repo:
51
52 oil$ build/deps.sh fetch
53
54Then build from source:
55
56 oil$ build/deps.sh install-wedges
57
58To build oil-native, use:
59
60 oil$ ./NINJA-config.sh
61 oil$ ninja # translate and compile, may take 30 seconds
62
63 oil$ _bin/cxx-asan/osh -c 'echo hi' # running compiled C++ !
64 hi
65
66To run the tests and benchmarks:
67
68 oil$ mycpp/TEST.sh test-translator
69 ... 200+ tasks run ...
70
71If you have problems, post a message on `#oil-dev` at
72`https://oilshell.zulipchat.com`. Not many people have contributed to `mycpp`,
73so I can use your feedback!
74
75Related:
76
77- [Oil Native Quick
78Start](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start) on the
79wiki.
80- [Oil Dev Cheat Sheet](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start)
81
82## Notes on the Algorithm / Architecture
83
84There are four passes over the MyPy AST.
85
86(1) `const_pass.py`: Collect string constants
87
88Turn turn the constant in `myfunc("foo")` into top-level `GLOBAL_STR(str1,
89"foo")`.
90
91(2) Three passes in `cppgen_pass.py`.
92
93(a) Forward Declaration Pass.
94
95 class Foo;
96 class Bar;
97
98This pass also determines which methods should be declared `virtual` in their
99declarations. The `virtual` keyword is written in the next pass.
100
101(b) Declaration Pass.
102
103 class Foo {
104 void method();
105 };
106 class Bar {
107 void method();
108 };
109
110More work in this pass:
111
112- Collect member variables and write them at the end of the definition
113- Collect locals for "hoisting". Written in the next pass.
114
115(c) Definition Pass.
116
117 void Foo:method() {
118 ...
119 }
120
121 void Bar:method() {
122 ...
123 }
124
125Note: I really wish we were not using visitors, but that's inherited from MyPy.
126
127## mycpp Idioms / "Creative Hacks"
128
129Oils is written in typed Python 2. It will run under a stock Python 2
130interpreter, and it will typecheck with stock MyPy.
131
132However, there are a few language features that don't map cleanly from typed
133Python to C++:
134
135- switch statements (unfortunately we don't have the Python 3 match statement)
136- C++ destructors - the RAII ptatern
137- casting - MyPy has one kind of cast; C++ has `static_cast` and
138 `reinterpret_cast`. (We don't use C-style casting.)
139
140So this describes the idioms we use. There are some hacks in
141[mycpp/cppgen_pass.py]($oils-src) to handle these cases, and also Python
142runtime equivalents in `mycpp/mylib.py`.
143
144### `with {,tag,str_}switch` &rarr; Switch statement
145
146We have three constructs that translate to a C++ switch statement. They use a
147Python context manager `with Xswitch(obj) ...` as a little hack.
148
149Here are examples like the ones in [mycpp/examples/test_switch.py]($oils-src).
150(`ninja mycpp-logs-equal` translates, compiles, and tests all the examples.)
151
152Simple switch:
153
154 myint = 99
155 with switch(myint) as case:
156 if case(42, 43):
157 print('forties')
158 else:
159 print('other')
160
161Switch on **object type**, which goes well with ASDL sum types:
162
163 val = value.Str('foo) # type: value_t
164 with tagswitch(val) as case:
165 if case(value_e.Str, value_e.Int):
166 print('string or int')
167 else:
168 print('other')
169
170We usually need to apply the `UP_val` pattern here, described in the next
171section.
172
173Switch on **string**, which generates a fast **two-level dispatch** -- first on
174length, and then with `str_equals_c()`:
175
176 s = 'foo'
177 with str_switch(s) as case:
178 if case("foo")
179 print('FOO')
180 else:
181 print('other')
182
183### `val` &rarr; `UP_val` &rarr; `val` Downcasting pattern
184
185Summary: variable names like `UP_*` are **special** in our Python code.
186
187Consider the downcasts marked BAD:
188
189 val = value.Str('foo) # type: value_t
190
191 with tagswitch(obj) as case:
192 if case(value_e.Str):
193 val = cast(value.Str, val) # BAD: conflicts with first declaration
194 print('s = %s' % val.s)
195
196 elif case(value_e.Int):
197 val = cast(value.Int, val) # BAD: conflicts with both
198 print('i = %d' % val.i)
199
200 else:
201 print('other')
202
203MyPy allows this, but it translates to invalid C++ code. C++ can't have a
204variable named `val`, with 2 related types `value_t` and `value::Str`.
205
206So we use this idiom instead, which takes advantage of **local vars in case
207blocks** in C++:
208
209 val = value.Str('foo') # type: value_t
210
211 UP_val = val # temporary variable that will be casted
212
213 with tagswitch(val) as case:
214 if case(value_e.Str):
215 val = cast(value.Str, UP_val) # this works
216 print('s = %s' % val.s)
217
218 elif case(value_e.Int):
219 val = cast(value.Int, UP_val) # also works
220 print('i = %d' % val.i)
221
222 else:
223 print('other')
224
225This translates to something like:
226
227 value_t* val = Alloc<value::Str>(str42);
228 value_t* UP_val = val;
229
230 switch (val->tag()) {
231 case value_e::Str: {
232 // DIFFERENT local var
233 value::Str* val = static_cast<value::Str>(UP_val);
234 print(StrFormat(str43, val->s))
235 }
236 break;
237 case value_e::Int: {
238 // ANOTHER DIFFERENT local var
239 value::Int* val = static_cast<value::Int>(UP_val);
240 print(StrFormat(str44, val->i))
241 }
242 break;
243 default:
244 print(str45);
245 }
246
247This works because there's no problem having **different** variables with the
248same name within each `case { }` block.
249
250Again, the names `UP_*` are **special**. If the name doesn't start with `UP_`,
251the inner blocks will look like:
252
253 case value_e::Str: {
254 val = static_cast<value::Str>(val); // BAD: val reused
255 print(StrFormat(str43, val->s))
256 }
257
258And they will fail to compile. It's not valid C++ because the superclass
259`value_t` doesn't have a field `val->s`. Only the subclass `value::Str` has
260it.
261
262(Note that Python has a single flat scope per function, while C++ has nested
263scopes.)
264
265### Python context manager &rarr; C++ constructor and destructor (RAII)
266
267This Python code:
268
269 with ctx_Foo(42):
270 f()
271
272translates to this C++ code:
273
274 {
275 ctx_Foo tmp(42);
276 f()
277
278 // destructor ~ctx_Foo implicitly called
279 }
280
281## MyPy "Shimming" Technique
282
283We have an interesting way of "writing Python and C++ at the same time":
284
2851. First, all Python code must pass the MyPy type checker, and run with a stock
286 Python 2 interpreter.
287 - This is the source of truth &mdash; the source of our semantics.
2881. We translate most `.py` files to C++, **except** some files, in particular
289 [mycpp/mylib.py]($oils-src) and files starting with `py` like
290 `core/{pyos.pyutil}.py`.
2911. In C++, we can substitute custom implementations with the properties we
292 want, like `Dict<K, V>` being ordered, `BigInt` being distinct from C `int`,
293 `BufWriter` being efficient, etc.
294
295The MyPy type system is very powerful! It lets us do all this.
296
297### NewDict() for ordered dicts
298
299Dicts in Python 2 aren't ordered, but we make them ordered at **runtime** by
300using `mylib.NewDict()`, which returns `collections_.OrderedDict`.
301
302The **static type** is still `Dict[K, V]`, but change the "spec" to be an
303ordered dict.
304
305In C++, `Dict<K, V>` is implemented as an ordered dict. (Note: we don't
306implement preserving order on deletion, which seems OK.)
307
308- TODO: `iteritems()` could go away
309
310### StackArray[T]
311
312TODO: describe this when it works.
313
314### BigInt
315
316- In Python, it's simply defined a a class with an integer, in
317 [mylib/mops.py]($oils-src).
318- In C++, it's currently `typedef int64_t BigInt`, but we want to make it a big
319 integer.
320
321### ByteAt(), ByteEquals(), ...
322
323Hand optimization to reduce 1-byte strings. For IFS algorithm,
324`LooksLikeGlob()`, `GlobUnescape()`.
325
326### File / LineReader / BufWriter
327
328TODO: describe how this works.
329
330Can it be more type safe? I think we can cast `File` to both `LineReader` and
331`BufWriter`.
332
333Or can we invert the relationship, so `File` derives from **both** LineReader
334and BufWriter?
335
336### Fast JSON - avoid intermediate allocations
337
338- `pyj8.WriteString()` is shimmed so we don't create encoded J8 string objects,
339 only to throw them away and write to `mylib.BufWriter`. Instead, we append
340 an encoded strings **directly** to the `BufWriter`.
341- Likewise, we have `BufWriter::write_spaces` to avoid temporary allocations
342 when writing indents.
343 - This could be generalized to `BufWriter::write_repeated(' ', 42)`.
344- We may also want `BufWriter::write_slice()`
345
346## Limitations Requiring Source Rewrites
347
348mycpp itself may cause limitations on expressiveness, or the C++ language may
349be able express what we want.
350
351- C++ doesn't have `try / except / else`, or `finally`
352 - Use the `with ctx_Foo` pattern instead.
353- `if mylist` tests if the pointer is non-NULL; use `if len(mylist)` for
354 non-empty test
355- Functions can have at most one keyword / optional argument.
356 - We generate two methods: `f(x)` which calls `f(x, y)` with the default
357 value of `y`
358 - If there are two or more optional arguments:
359 - For classes, you can use the "builder pattern", i.e. add an
360 `Init_MyMember()` method
361 - If the arguments are booleans, translate it to a single bitfield argument
362- C++ has nested scope and Python has flat function scope. This can cause name
363 collisions.
364 - Could enforce this if it becomes a problem
365
366Also see `mycpp/examples/invalid_*` for Python code that fails to translate.
367
368## WARNING: Assumptions Not Checked
369
370### Global Constants Can't Be Mutated
371
372We translate top level constants to statically initialized C data structures
373(zero startup cost):
374
375 gStr = 'foo'
376 gList = [1, 2] # type: List[int]
377 gDict = {'bar': 42} # type: Dict[str, int]
378
379Even though `List` and `Dict` are mutable in general, you should **NOT** mutate
380these global instances! The C++ code will break at runtime.
381
382### Gotcha about Returning Variants (Subclasses) of a Type
383
384MyPy will accept this code:
385
386```
387if cond:
388 sig = proc_sig.Open # type: proc_sig_t
389 # bad because mycpp HOISTS this
390else:
391 sig = proc_sig.Closed.CreateNull()
392 sig.words = words # assignment fails
393return sig
394```
395
396It will translate to C++, but fail to compile. Instead, rewrite it like this:
397
398```
399sig = None # type: proc_sig_t
400if cond:
401 sig = proc_sig.Open # type: proc_sig_t
402 # bad because mycpp HOISTS this
403else:
404 closed = proc_sig.Closed.CreateNull()
405 closed.words = words # assignment fails
406 sig = closed
407return sig
408```
409
410### Exceptions Can't Leave Destructors / Python `__exit__`
411
412Context managers like `with ctx_Foo():` translate to C++ constructors and
413destructors.
414
415In C++, a destructor can't "leave" an exception. It results in a runtime error.
416
417You can throw and CATCH an exception WITHIN a destructor, but you can't let it
418propagate outside.
419
420This means you must be careful when coding the `__exit__` method. For example,
421in `vm::ctx_Redirect`, we had this bug due to `IOError` being thrown and not
422caught when restoring/popping redirects.
423
424To fix the bug, we rewrote the code to use an out param
425`List[IOError_OSError]`.
426
427Related:
428
429- <https://akrzemi1.wordpress.com/2011/09/21/destructors-that-throw/>
430
431## More Translation Notes
432
433### Hacky Heuristics
434
435- `callable(arg)` to either:
436 - function call `f(arg)`
437 - instantiation `Alloc<T>(arg)`
438- `name.attr` to either:
439 - `obj->member`
440 - `module::Func`
441- `cast(MyType, obj)` to either
442 - `static_cast<MyType*>(obj)`
443 - `reinterpret_cast<MyType*>(obj)`
444
445### Hacky Hard-Coded Names
446
447These are signs of coupling between mycpp and Oils, which ideally shouldn't
448exist.
449
450- `mycpp_main.py`
451 - `ModulesToCompile()` -- some files have to be ordered first, like the ASDL
452 runtime.
453 - TODO: Pea can respect parameter order? So we do that outside the project?
454 - Another ordering constraint comes from **inheritance**. The forward
455 declaration is NOT sufficient in that case.
456- `cppgen_pass.py`
457 - `_GetCastKind()` has some hard-coded names
458 - `AsdlType::Create()` is special cased to `::`, not `->`
459 - Default arguments e.g. `scope_e::Local` need a repeated `using`.
460
461Issue on mycpp improvements: <https://github.com/oilshell/oil/issues/568>
462
463### Major Features
464
465- Python `int` and `bool` &rarr; C++ `int` and `bool`
466 - `None` &rarr; `nullptr`
467- Statically Typed Python Collections
468 - `str` &rarr; `Str*`
469 - `List[T]` &rarr; `List<T>*`
470 - `Dict[K, V]` &rarr; `Dict<K, V>*`
471 - tuples &rarr; `Tuple2<A, B>`, `Tuple3<A, B, C>`, etc.
472- Collection literals turn into initializer lists
473 - And there is a C++ type inference issue which requires an explicit
474 `std::initializer_list<int>{1, 2, 3}`, not just `{1, 2, 3}`
475- Python's polymorphic iteration &rarr; `StrIter`, `ListIter<T>`, `DictIter<K,
476 V`
477 - `d.iteritems()` is rewritten `mylib.iteritems()` &rarr; `DictIter`
478 - TODO: can we be smarter about this?
479 - `reversed(mylist)` &rarr; `ReverseListIter`
480- Python's `in` operator:
481 - `s in mystr` &rarr; `str_contains(mystr, s)`
482 - `x in mylist` &rarr; `list_contains(mylist, x)`
483- Classes and inheritance
484 - `__init__` method becomes a constructor. Note: initializer lists aren't
485 used.
486 - Detect `virtual` methods
487 - TODO: could we detect `abstract` methods? (`NotImplementedError`)
488- Python generators `Iterator[T]` &rarr; eager `List<T>` accumulators
489- Python Exceptions &rarr; C++ exceptions
490- Python Modules &rarr; C++ namespace (we assume a 2-level hierarchy)
491 - TODO: mycpp need real modules, because our `oils_for_unix.mycpp.cc`
492 translation unit is getting big.
493 - And `cpp/preamble.h` is a hack to work around the lack of modules.
494
495### Minor Translations
496
497- `s1 == s2` &rarr; `str_equals(s1, s2)`
498- `'x' * 3` &rarr; `str_repeat(globalStr, 3)`
499- `[None] * 3` &rarr; `list_repeat(nullptr, 3)`
500- Omitted:
501 - If the LHS of an assignment is `_`, then the statement is omitted
502 - This is for `_ = log`, which shuts up Python lint warnings for 'unused
503 import'
504 - Code under `if __name__ == '__main__'`
505
506### Optimizations
507
508- Returning Tuples by value. To reduce GC pressure, we we return
509 `Tuple2<A, B>` instead of `Tuple2<A, B>*`, and likewise for `Tuple3` and `Tuple4`.
510
511### Rooting Policy
512
513The translated code roots local variables in every function
514
515 StackRoots _r({&var1, &var2});
516
517We have two kinds of hand-written code:
518
5191. Methods like `Str::strip()` in `mycpp/`
5202. OS bindings like `stat()` in `cpp/`
521
522Neither of them needs any rooting! This is because we use **manual collection
523points** in the interpreter, and these functions don't call any functions that
524can collect. They are "leaves" in the call tree.
525
526## The mycpp Runtime
527
528The mycpp translator targets a runtime that's written from scratch. It
529implements garbage-collected data structures like:
530
531- Typed records
532 - Python classes
533 - ASDL product and sum types
534- `Str` (immutable, as in Python)
535- `List<T>`
536- `Dict<K, V>`
537- `Tuple2<A, B>`, `Tuple3<A, B, C>`, ...
538
539It also has functions based on CPython's:
540
541- `mycpp/gc_builtins.{h,cc}` corresponds roughly to Python's `__builtin__`
542 module, e.g. `int()` and `str()`
543- `mycpp/gc_mylib.{h,cc}` corresponds `mylib.py`
544 - `mylib.BufWriter` is a bit like `cStringIO.StringIO`
545
546### Differences from CPython
547
548- Integers either C `int` or `mylib.BigInt`, not Python's arbitrary size
549 integers
550- `NUL` bytes are allowed in arguments to syscalls like `open()`, unlike in
551 CPython
552- `s.strip()` is defined in terms of ASCII whitespace, which does not include
553 say `\v`.
554 - This is done to be consistent with JSON and J8 Notation.
555
556## C++ Notes
557
558### Gotchas
559
560- C++ classes can have 2 member variables of the same name! From the base
561 class and derived class.
562- Failing to declare methods `virtual` can involve the wrong one being called
563 at runtime
564
565### Minor Features Used
566
567In addition to classes, templates, exceptions, etc. mentioned above, we use:
568
569- `static_cast` and `reinterpret_cast`
570- `enum class` for ASDL
571- Function overloading
572 - For equality and hashing?
573- `offsetof` for introspection of field positions for garbage collection
574- `std::initializer_list` for `StackRoots()`
575 - Should we get rid of this?
576
577### Not Used
578
579- I/O Streams, RTTI, etc.
580- `const`
581- Smart pointers
582