1 | mycpp
|
2 | =====
|
3 |
|
4 | This is a Python-to-C++ translator based on MyPy. It only
|
5 | handles the small subset of Python that we use in Oils.
|
6 |
|
7 | It's inspired by both mypyc and Shed Skin. These posts give background:
|
8 |
|
9 | - [Brief Descriptions of a Python to C++ Translator](https://www.oilshell.org/blog/2022/05/mycpp.html)
|
10 | - [Oil Is Being Implemented "Middle Out"](https://www.oilshell.org/blog/2022/03/middle-out.html)
|
11 |
|
12 | As of March 2024, the translation to C++ is **done**. So it's no longer
|
13 | experimental!
|
14 |
|
15 | However, it's still pretty **hacky**. This doc exists mainly to explain the
|
16 | hacks. (We may want to rewrite mycpp as "yaks", although it's low priority
|
17 | right now.)
|
18 |
|
19 | ---
|
20 |
|
21 | Source for this doc: [mycpp/README.md]($oils-src). The code is all in
|
22 | [mycpp/]($oils-src).
|
23 |
|
24 |
|
25 | <div id="toc">
|
26 | </div>
|
27 |
|
28 | ## Instructions
|
29 |
|
30 | ### Translating and Compiling `oils-cpp`
|
31 |
|
32 | Running `mycpp` is best done on a Debian / Ubuntu-ish machine. Follow the
|
33 | instructions at <https://github.com/oilshell/oil/wiki/Contributing> to create
|
34 | the "dev build" first, which is DISTINCT from the C++ build. Make sure you can
|
35 | run:
|
36 |
|
37 | oil$ build/py.sh all
|
38 |
|
39 | This will give you a working shell:
|
40 |
|
41 | oil$ bin/osh -c 'echo hi' # running interpreted Python
|
42 | hi
|
43 |
|
44 | To run mycpp, we will build Python 3.10, clone MyPy, and install MyPy's
|
45 | dependencies. First install packages:
|
46 |
|
47 | # We need libssl-dev, libffi-dev, zlib1g-dev to bootstrap Python
|
48 | oil$ build/deps.sh install-ubuntu-packages
|
49 |
|
50 | Then fetch data, like the Python 3.10 tarball and MyPy repo:
|
51 |
|
52 | oil$ build/deps.sh fetch
|
53 |
|
54 | Then build from source:
|
55 |
|
56 | oil$ build/deps.sh install-wedges
|
57 |
|
58 | To build oil-native, use:
|
59 |
|
60 | oil$ ./NINJA-config.sh
|
61 | oil$ ninja # translate and compile, may take 30 seconds
|
62 |
|
63 | oil$ _bin/cxx-asan/osh -c 'echo hi' # running compiled C++ !
|
64 | hi
|
65 |
|
66 | To run the tests and benchmarks:
|
67 |
|
68 | oil$ mycpp/TEST.sh test-translator
|
69 | ... 200+ tasks run ...
|
70 |
|
71 | If you have problems, post a message on `#oil-dev` at
|
72 | `https://oilshell.zulipchat.com`. Not many people have contributed to `mycpp`,
|
73 | so I can use your feedback!
|
74 |
|
75 | Related:
|
76 |
|
77 | - [Oil Native Quick
|
78 | Start](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start) on the
|
79 | wiki.
|
80 | - [Oil Dev Cheat Sheet](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start)
|
81 |
|
82 | ## Notes on the Algorithm / Architecture
|
83 |
|
84 | There are four passes over the MyPy AST.
|
85 |
|
86 | (1) `const_pass.py`: Collect string constants
|
87 |
|
88 | Turn turn the constant in `myfunc("foo")` into top-level `GLOBAL_STR(str1,
|
89 | "foo")`.
|
90 |
|
91 | (2) Three passes in `cppgen_pass.py`.
|
92 |
|
93 | (a) Forward Declaration Pass.
|
94 |
|
95 | class Foo;
|
96 | class Bar;
|
97 |
|
98 | This pass also determines which methods should be declared `virtual` in their
|
99 | declarations. The `virtual` keyword is written in the next pass.
|
100 |
|
101 | (b) Declaration Pass.
|
102 |
|
103 | class Foo {
|
104 | void method();
|
105 | };
|
106 | class Bar {
|
107 | void method();
|
108 | };
|
109 |
|
110 | More work in this pass:
|
111 |
|
112 | - Collect member variables and write them at the end of the definition
|
113 | - Collect locals for "hoisting". Written in the next pass.
|
114 |
|
115 | (c) Definition Pass.
|
116 |
|
117 | void Foo:method() {
|
118 | ...
|
119 | }
|
120 |
|
121 | void Bar:method() {
|
122 | ...
|
123 | }
|
124 |
|
125 | Note: I really wish we were not using visitors, but that's inherited from MyPy.
|
126 |
|
127 | ## mycpp Idioms / "Creative Hacks"
|
128 |
|
129 | Oils is written in typed Python 2. It will run under a stock Python 2
|
130 | interpreter, and it will typecheck with stock MyPy.
|
131 |
|
132 | However, there are a few language features that don't map cleanly from typed
|
133 | Python to C++:
|
134 |
|
135 | - switch statements (unfortunately we don't have the Python 3 match statement)
|
136 | - C++ destructors - the RAII ptatern
|
137 | - casting - MyPy has one kind of cast; C++ has `static_cast` and
|
138 | `reinterpret_cast`. (We don't use C-style casting.)
|
139 |
|
140 | So this describes the idioms we use. There are some hacks in
|
141 | [mycpp/cppgen_pass.py]($oils-src) to handle these cases, and also Python
|
142 | runtime equivalents in `mycpp/mylib.py`.
|
143 |
|
144 | ### `with {,tag,str_}switch` → Switch statement
|
145 |
|
146 | We have three constructs that translate to a C++ switch statement. They use a
|
147 | Python context manager `with Xswitch(obj) ...` as a little hack.
|
148 |
|
149 | Here are examples like the ones in [mycpp/examples/test_switch.py]($oils-src).
|
150 | (`ninja mycpp-logs-equal` translates, compiles, and tests all the examples.)
|
151 |
|
152 | Simple switch:
|
153 |
|
154 | myint = 99
|
155 | with switch(myint) as case:
|
156 | if case(42, 43):
|
157 | print('forties')
|
158 | else:
|
159 | print('other')
|
160 |
|
161 | Switch on **object type**, which goes well with ASDL sum types:
|
162 |
|
163 | val = value.Str('foo) # type: value_t
|
164 | with tagswitch(val) as case:
|
165 | if case(value_e.Str, value_e.Int):
|
166 | print('string or int')
|
167 | else:
|
168 | print('other')
|
169 |
|
170 | We usually need to apply the `UP_val` pattern here, described in the next
|
171 | section.
|
172 |
|
173 | Switch on **string**, which generates a fast **two-level dispatch** -- first on
|
174 | length, and then with `str_equals_c()`:
|
175 |
|
176 | s = 'foo'
|
177 | with str_switch(s) as case:
|
178 | if case("foo")
|
179 | print('FOO')
|
180 | else:
|
181 | print('other')
|
182 |
|
183 | ### `val` → `UP_val` → `val` Downcasting pattern
|
184 |
|
185 | Summary: variable names like `UP_*` are **special** in our Python code.
|
186 |
|
187 | Consider the downcasts marked BAD:
|
188 |
|
189 | val = value.Str('foo) # type: value_t
|
190 |
|
191 | with tagswitch(obj) as case:
|
192 | if case(value_e.Str):
|
193 | val = cast(value.Str, val) # BAD: conflicts with first declaration
|
194 | print('s = %s' % val.s)
|
195 |
|
196 | elif case(value_e.Int):
|
197 | val = cast(value.Int, val) # BAD: conflicts with both
|
198 | print('i = %d' % val.i)
|
199 |
|
200 | else:
|
201 | print('other')
|
202 |
|
203 | MyPy allows this, but it translates to invalid C++ code. C++ can't have a
|
204 | variable named `val`, with 2 related types `value_t` and `value::Str`.
|
205 |
|
206 | So we use this idiom instead, which takes advantage of **local vars in case
|
207 | blocks** in C++:
|
208 |
|
209 | val = value.Str('foo') # type: value_t
|
210 |
|
211 | UP_val = val # temporary variable that will be casted
|
212 |
|
213 | with tagswitch(val) as case:
|
214 | if case(value_e.Str):
|
215 | val = cast(value.Str, UP_val) # this works
|
216 | print('s = %s' % val.s)
|
217 |
|
218 | elif case(value_e.Int):
|
219 | val = cast(value.Int, UP_val) # also works
|
220 | print('i = %d' % val.i)
|
221 |
|
222 | else:
|
223 | print('other')
|
224 |
|
225 | This translates to something like:
|
226 |
|
227 | value_t* val = Alloc<value::Str>(str42);
|
228 | value_t* UP_val = val;
|
229 |
|
230 | switch (val->tag()) {
|
231 | case value_e::Str: {
|
232 | // DIFFERENT local var
|
233 | value::Str* val = static_cast<value::Str>(UP_val);
|
234 | print(StrFormat(str43, val->s))
|
235 | }
|
236 | break;
|
237 | case value_e::Int: {
|
238 | // ANOTHER DIFFERENT local var
|
239 | value::Int* val = static_cast<value::Int>(UP_val);
|
240 | print(StrFormat(str44, val->i))
|
241 | }
|
242 | break;
|
243 | default:
|
244 | print(str45);
|
245 | }
|
246 |
|
247 | This works because there's no problem having **different** variables with the
|
248 | same name within each `case { }` block.
|
249 |
|
250 | Again, the names `UP_*` are **special**. If the name doesn't start with `UP_`,
|
251 | the inner blocks will look like:
|
252 |
|
253 | case value_e::Str: {
|
254 | val = static_cast<value::Str>(val); // BAD: val reused
|
255 | print(StrFormat(str43, val->s))
|
256 | }
|
257 |
|
258 | And they will fail to compile. It's not valid C++ because the superclass
|
259 | `value_t` doesn't have a field `val->s`. Only the subclass `value::Str` has
|
260 | it.
|
261 |
|
262 | (Note that Python has a single flat scope per function, while C++ has nested
|
263 | scopes.)
|
264 |
|
265 | ### Python context manager → C++ constructor and destructor (RAII)
|
266 |
|
267 | This Python code:
|
268 |
|
269 | with ctx_Foo(42):
|
270 | f()
|
271 |
|
272 | translates to this C++ code:
|
273 |
|
274 | {
|
275 | ctx_Foo tmp(42);
|
276 | f()
|
277 |
|
278 | // destructor ~ctx_Foo implicitly called
|
279 | }
|
280 |
|
281 | ## MyPy "Shimming" Technique
|
282 |
|
283 | We have an interesting way of "writing Python and C++ at the same time":
|
284 |
|
285 | 1. First, all Python code must pass the MyPy type checker, and run with a stock
|
286 | Python 2 interpreter.
|
287 | - This is the source of truth — the source of our semantics.
|
288 | 1. We translate most `.py` files to C++, **except** some files, in particular
|
289 | [mycpp/mylib.py]($oils-src) and files starting with `py` like
|
290 | `core/{pyos.pyutil}.py`.
|
291 | 1. In C++, we can substitute custom implementations with the properties we
|
292 | want, like `Dict<K, V>` being ordered, `BigInt` being distinct from C `int`,
|
293 | `BufWriter` being efficient, etc.
|
294 |
|
295 | The MyPy type system is very powerful! It lets us do all this.
|
296 |
|
297 | ### NewDict() for ordered dicts
|
298 |
|
299 | Dicts in Python 2 aren't ordered, but we make them ordered at **runtime** by
|
300 | using `mylib.NewDict()`, which returns `collections_.OrderedDict`.
|
301 |
|
302 | The **static type** is still `Dict[K, V]`, but change the "spec" to be an
|
303 | ordered dict.
|
304 |
|
305 | In C++, `Dict<K, V>` is implemented as an ordered dict. (Note: we don't
|
306 | implement preserving order on deletion, which seems OK.)
|
307 |
|
308 | - TODO: `iteritems()` could go away
|
309 |
|
310 | ### StackArray[T]
|
311 |
|
312 | TODO: describe this when it works.
|
313 |
|
314 | ### BigInt
|
315 |
|
316 | - In Python, it's simply defined a a class with an integer, in
|
317 | [mylib/mops.py]($oils-src).
|
318 | - In C++, it's currently `typedef int64_t BigInt`, but we want to make it a big
|
319 | integer.
|
320 |
|
321 | ### ByteAt(), ByteEquals(), ...
|
322 |
|
323 | Hand optimization to reduce 1-byte strings. For IFS algorithm,
|
324 | `LooksLikeGlob()`, `GlobUnescape()`.
|
325 |
|
326 | ### File / LineReader / BufWriter
|
327 |
|
328 | TODO: describe how this works.
|
329 |
|
330 | Can it be more type safe? I think we can cast `File` to both `LineReader` and
|
331 | `BufWriter`.
|
332 |
|
333 | Or can we invert the relationship, so `File` derives from **both** LineReader
|
334 | and BufWriter?
|
335 |
|
336 | ### Fast JSON - avoid intermediate allocations
|
337 |
|
338 | - `pyj8.WriteString()` is shimmed so we don't create encoded J8 string objects,
|
339 | only to throw them away and write to `mylib.BufWriter`. Instead, we append
|
340 | an encoded strings **directly** to the `BufWriter`.
|
341 | - Likewise, we have `BufWriter::write_spaces` to avoid temporary allocations
|
342 | when writing indents.
|
343 | - This could be generalized to `BufWriter::write_repeated(' ', 42)`.
|
344 | - We may also want `BufWriter::write_slice()`
|
345 |
|
346 | ## Limitations Requiring Source Rewrites
|
347 |
|
348 | mycpp itself may cause limitations on expressiveness, or the C++ language may
|
349 | be able express what we want.
|
350 |
|
351 | - C++ doesn't have `try / except / else`, or `finally`
|
352 | - Use the `with ctx_Foo` pattern instead.
|
353 | - `if mylist` tests if the pointer is non-NULL; use `if len(mylist)` for
|
354 | non-empty test
|
355 | - Functions can have at most one keyword / optional argument.
|
356 | - We generate two methods: `f(x)` which calls `f(x, y)` with the default
|
357 | value of `y`
|
358 | - If there are two or more optional arguments:
|
359 | - For classes, you can use the "builder pattern", i.e. add an
|
360 | `Init_MyMember()` method
|
361 | - If the arguments are booleans, translate it to a single bitfield argument
|
362 | - C++ has nested scope and Python has flat function scope. This can cause name
|
363 | collisions.
|
364 | - Could enforce this if it becomes a problem
|
365 |
|
366 | Also see `mycpp/examples/invalid_*` for Python code that fails to translate.
|
367 |
|
368 | ## WARNING: Assumptions Not Checked
|
369 |
|
370 | ### Global Constants Can't Be Mutated
|
371 |
|
372 | We translate top level constants to statically initialized C data structures
|
373 | (zero startup cost):
|
374 |
|
375 | gStr = 'foo'
|
376 | gList = [1, 2] # type: List[int]
|
377 | gDict = {'bar': 42} # type: Dict[str, int]
|
378 |
|
379 | Even though `List` and `Dict` are mutable in general, you should **NOT** mutate
|
380 | these global instances! The C++ code will break at runtime.
|
381 |
|
382 | ### Gotcha about Returning Variants (Subclasses) of a Type
|
383 |
|
384 | MyPy will accept this code:
|
385 |
|
386 | ```
|
387 | if cond:
|
388 | sig = proc_sig.Open # type: proc_sig_t
|
389 | # bad because mycpp HOISTS this
|
390 | else:
|
391 | sig = proc_sig.Closed.CreateNull()
|
392 | sig.words = words # assignment fails
|
393 | return sig
|
394 | ```
|
395 |
|
396 | It will translate to C++, but fail to compile. Instead, rewrite it like this:
|
397 |
|
398 | ```
|
399 | sig = None # type: proc_sig_t
|
400 | if cond:
|
401 | sig = proc_sig.Open # type: proc_sig_t
|
402 | # bad because mycpp HOISTS this
|
403 | else:
|
404 | closed = proc_sig.Closed.CreateNull()
|
405 | closed.words = words # assignment fails
|
406 | sig = closed
|
407 | return sig
|
408 | ```
|
409 |
|
410 | ### Exceptions Can't Leave Destructors / Python `__exit__`
|
411 |
|
412 | Context managers like `with ctx_Foo():` translate to C++ constructors and
|
413 | destructors.
|
414 |
|
415 | In C++, a destructor can't "leave" an exception. It results in a runtime error.
|
416 |
|
417 | You can throw and CATCH an exception WITHIN a destructor, but you can't let it
|
418 | propagate outside.
|
419 |
|
420 | This means you must be careful when coding the `__exit__` method. For example,
|
421 | in `vm::ctx_Redirect`, we had this bug due to `IOError` being thrown and not
|
422 | caught when restoring/popping redirects.
|
423 |
|
424 | To fix the bug, we rewrote the code to use an out param
|
425 | `List[IOError_OSError]`.
|
426 |
|
427 | Related:
|
428 |
|
429 | - <https://akrzemi1.wordpress.com/2011/09/21/destructors-that-throw/>
|
430 |
|
431 | ## More Translation Notes
|
432 |
|
433 | ### Hacky Heuristics
|
434 |
|
435 | - `callable(arg)` to either:
|
436 | - function call `f(arg)`
|
437 | - instantiation `Alloc<T>(arg)`
|
438 | - `name.attr` to either:
|
439 | - `obj->member`
|
440 | - `module::Func`
|
441 | - `cast(MyType, obj)` to either
|
442 | - `static_cast<MyType*>(obj)`
|
443 | - `reinterpret_cast<MyType*>(obj)`
|
444 |
|
445 | ### Hacky Hard-Coded Names
|
446 |
|
447 | These are signs of coupling between mycpp and Oils, which ideally shouldn't
|
448 | exist.
|
449 |
|
450 | - `mycpp_main.py`
|
451 | - `ModulesToCompile()` -- some files have to be ordered first, like the ASDL
|
452 | runtime.
|
453 | - TODO: Pea can respect parameter order? So we do that outside the project?
|
454 | - Another ordering constraint comes from **inheritance**. The forward
|
455 | declaration is NOT sufficient in that case.
|
456 | - `cppgen_pass.py`
|
457 | - `_GetCastKind()` has some hard-coded names
|
458 | - `AsdlType::Create()` is special cased to `::`, not `->`
|
459 | - Default arguments e.g. `scope_e::Local` need a repeated `using`.
|
460 |
|
461 | Issue on mycpp improvements: <https://github.com/oilshell/oil/issues/568>
|
462 |
|
463 | ### Major Features
|
464 |
|
465 | - Python `int` and `bool` → C++ `int` and `bool`
|
466 | - `None` → `nullptr`
|
467 | - Statically Typed Python Collections
|
468 | - `str` → `Str*`
|
469 | - `List[T]` → `List<T>*`
|
470 | - `Dict[K, V]` → `Dict<K, V>*`
|
471 | - tuples → `Tuple2<A, B>`, `Tuple3<A, B, C>`, etc.
|
472 | - Collection literals turn into initializer lists
|
473 | - And there is a C++ type inference issue which requires an explicit
|
474 | `std::initializer_list<int>{1, 2, 3}`, not just `{1, 2, 3}`
|
475 | - Python's polymorphic iteration → `StrIter`, `ListIter<T>`, `DictIter<K,
|
476 | V`
|
477 | - `d.iteritems()` is rewritten `mylib.iteritems()` → `DictIter`
|
478 | - TODO: can we be smarter about this?
|
479 | - `reversed(mylist)` → `ReverseListIter`
|
480 | - Python's `in` operator:
|
481 | - `s in mystr` → `str_contains(mystr, s)`
|
482 | - `x in mylist` → `list_contains(mylist, x)`
|
483 | - Classes and inheritance
|
484 | - `__init__` method becomes a constructor. Note: initializer lists aren't
|
485 | used.
|
486 | - Detect `virtual` methods
|
487 | - TODO: could we detect `abstract` methods? (`NotImplementedError`)
|
488 | - Python generators `Iterator[T]` → eager `List<T>` accumulators
|
489 | - Python Exceptions → C++ exceptions
|
490 | - Python Modules → C++ namespace (we assume a 2-level hierarchy)
|
491 | - TODO: mycpp need real modules, because our `oils_for_unix.mycpp.cc`
|
492 | translation unit is getting big.
|
493 | - And `cpp/preamble.h` is a hack to work around the lack of modules.
|
494 |
|
495 | ### Minor Translations
|
496 |
|
497 | - `s1 == s2` → `str_equals(s1, s2)`
|
498 | - `'x' * 3` → `str_repeat(globalStr, 3)`
|
499 | - `[None] * 3` → `list_repeat(nullptr, 3)`
|
500 | - Omitted:
|
501 | - If the LHS of an assignment is `_`, then the statement is omitted
|
502 | - This is for `_ = log`, which shuts up Python lint warnings for 'unused
|
503 | import'
|
504 | - Code under `if __name__ == '__main__'`
|
505 |
|
506 | ### Optimizations
|
507 |
|
508 | - Returning Tuples by value. To reduce GC pressure, we we return
|
509 | `Tuple2<A, B>` instead of `Tuple2<A, B>*`, and likewise for `Tuple3` and `Tuple4`.
|
510 |
|
511 | ### Rooting Policy
|
512 |
|
513 | The translated code roots local variables in every function
|
514 |
|
515 | StackRoots _r({&var1, &var2});
|
516 |
|
517 | We have two kinds of hand-written code:
|
518 |
|
519 | 1. Methods like `Str::strip()` in `mycpp/`
|
520 | 2. OS bindings like `stat()` in `cpp/`
|
521 |
|
522 | Neither of them needs any rooting! This is because we use **manual collection
|
523 | points** in the interpreter, and these functions don't call any functions that
|
524 | can collect. They are "leaves" in the call tree.
|
525 |
|
526 | ## The mycpp Runtime
|
527 |
|
528 | The mycpp translator targets a runtime that's written from scratch. It
|
529 | implements garbage-collected data structures like:
|
530 |
|
531 | - Typed records
|
532 | - Python classes
|
533 | - ASDL product and sum types
|
534 | - `Str` (immutable, as in Python)
|
535 | - `List<T>`
|
536 | - `Dict<K, V>`
|
537 | - `Tuple2<A, B>`, `Tuple3<A, B, C>`, ...
|
538 |
|
539 | It also has functions based on CPython's:
|
540 |
|
541 | - `mycpp/gc_builtins.{h,cc}` corresponds roughly to Python's `__builtin__`
|
542 | module, e.g. `int()` and `str()`
|
543 | - `mycpp/gc_mylib.{h,cc}` corresponds `mylib.py`
|
544 | - `mylib.BufWriter` is a bit like `cStringIO.StringIO`
|
545 |
|
546 | ### Differences from CPython
|
547 |
|
548 | - Integers either C `int` or `mylib.BigInt`, not Python's arbitrary size
|
549 | integers
|
550 | - `NUL` bytes are allowed in arguments to syscalls like `open()`, unlike in
|
551 | CPython
|
552 | - `s.strip()` is defined in terms of ASCII whitespace, which does not include
|
553 | say `\v`.
|
554 | - This is done to be consistent with JSON and J8 Notation.
|
555 |
|
556 | ## C++ Notes
|
557 |
|
558 | ### Gotchas
|
559 |
|
560 | - C++ classes can have 2 member variables of the same name! From the base
|
561 | class and derived class.
|
562 | - Failing to declare methods `virtual` can involve the wrong one being called
|
563 | at runtime
|
564 |
|
565 | ### Minor Features Used
|
566 |
|
567 | In addition to classes, templates, exceptions, etc. mentioned above, we use:
|
568 |
|
569 | - `static_cast` and `reinterpret_cast`
|
570 | - `enum class` for ASDL
|
571 | - Function overloading
|
572 | - For equality and hashing?
|
573 | - `offsetof` for introspection of field positions for garbage collection
|
574 | - `std::initializer_list` for `StackRoots()`
|
575 | - Should we get rid of this?
|
576 |
|
577 | ### Not Used
|
578 |
|
579 | - I/O Streams, RTTI, etc.
|
580 | - `const`
|
581 | - Smart pointers
|
582 |
|