| 1 | --- | 
| 2 | in_progress: yes | 
| 3 | default_highlighter: oils-sh | 
| 4 | --- | 
| 5 |  | 
| 6 | Streams, Tables and Processes - awk, R, xargs | 
| 7 | ============================================= | 
| 8 |  | 
| 9 | *(July 2024)* | 
| 10 |  | 
| 11 | This is a long, "unified/orthogonal" design  for: | 
| 12 |  | 
| 13 | - Streams: [awk]($xref) delimited lines, regexes | 
| 14 | - Tables: like data frames with R's dplyr or Pandas, but with the "exterior" | 
| 15 | TSV8 format | 
| 16 | - Processes: xargs -P in parallel | 
| 17 |  | 
| 18 | There's also a relation to: | 
| 19 |  | 
| 20 | - Trees: `jq`, which will be covered elsewhere. | 
| 21 |  | 
| 22 | It's a layered design.  That means we need some underlying mechanisms: | 
| 23 |  | 
| 24 | - `eval` and positional args `$1 $2 $3` | 
| 25 | - `ctx` builtin | 
| 26 | - Data langauges: TSV8 | 
| 27 | - Process pool / event loop primitive | 
| 28 |  | 
| 29 | It will link to: | 
| 30 |  | 
| 31 | - Oils blog posts (old) | 
| 32 | - Zulip threads (recent) | 
| 33 | - Other related projects (many of them) | 
| 34 |  | 
| 35 | <div id="toc"> | 
| 36 | </div> | 
| 37 |  | 
| 38 | ## Background / References | 
| 39 |  | 
| 40 | - Shell, Awk, and Make Should be Combined (2016) | 
| 41 | - this is the Awk part! | 
| 42 |  | 
| 43 | - What is a Data Frame?  (2018) | 
| 44 |  | 
| 45 | - Sketches of YSH Features (June 2023) - can we express things in YSH? | 
| 46 | - Zulip: Oils Layering / Self-hosting | 
| 47 |  | 
| 48 | - Language Compositionality Test: J8 Lines | 
| 49 | - This whole thing is a compositionality test | 
| 50 |  | 
| 51 | - read --split | 
| 52 | - more feedback from Aidan and Samuel | 
| 53 |  | 
| 54 | - What is a Data Frame? | 
| 55 |  | 
| 56 | - jq in jq thread | 
| 57 |  | 
| 58 | Old wiki pages: | 
| 59 |  | 
| 60 | - <https://github.com/oilshell/oil/wiki/Structured-Data-in-Oil> | 
| 61 | - uxy - closest I think - <https://github.com/sustrik/uxy> | 
| 62 | - relies on to-json and jq for querying | 
| 63 | - miller - I don't like their language - https://github.com/johnkerl/miller - | 
| 64 | - jc - <https://github.com/kellyjonbrazil/jc> | 
| 65 | - nushell | 
| 66 | - extremely old thing - | 
| 67 |  | 
| 68 | We're doing **all of these**. | 
| 69 |  | 
| 70 | ## Concrete Use Cases | 
| 71 |  | 
| 72 | - benchmarks/* with dplyr | 
| 73 | - wedge report | 
| 74 | - oilshell.org analytics job uses dplyr and ggplot2 | 
| 75 |  | 
| 76 | ## Intro | 
| 77 |  | 
| 78 | ### How much code is it? | 
| 79 |  | 
| 80 | - I think this is ~1000 lines of Python and ~1000 lines of YSH (not including tests) | 
| 81 | - It should be small | 
| 82 |  | 
| 83 | ### Thanks | 
| 84 |  | 
| 85 | - Samuel - two big hints | 
| 86 | - do it in YSH | 
| 87 | - `table` with the `ctx` builtin | 
| 88 | - Aidan | 
| 89 | - `read --split` feedback | 
| 90 |  | 
| 91 |  | 
| 92 | ## Tools | 
| 93 |  | 
| 94 | - awk | 
| 95 | - streams of records - row-wise | 
| 96 | - R | 
| 97 | - column-wise operations on tables | 
| 98 | - `find . -printf '%s %P\n'`  - size and path | 
| 99 | - generate text that looks like a table | 
| 100 | - xargs | 
| 101 | - operate on tabular text -- it has a bespoke splitting algorithm | 
| 102 | - Opinionated guide to xargs | 
| 103 | - table in, table out | 
| 104 | - jq - "awk for JSON" | 
| 105 |  | 
| 106 |  | 
| 107 | ## Concepts | 
| 108 |  | 
| 109 | - TSV8 | 
| 110 | - aligned format SSV8 | 
| 111 | - columns have types, and attributes | 
| 112 | - Lines | 
| 113 | - raw lines like shell | 
| 114 | - J8 lines (which can represent any filename, any unicode or byte string) | 
| 115 | - Tables - can be thought of as: | 
| 116 | - Streams of Rows - shape `[{bytes: 123, path: "foo"}, {}, ...]` | 
| 117 | - this is actually <https://jsonlines.org> , and it fits well with `jq` | 
| 118 | - Columns - shape `{bytes: [], path: []} | 
| 119 |  | 
| 120 | ## Underlying Mechanisms in Oils / Primitives | 
| 121 |  | 
| 122 | - blocks `value.Block` - `^()` and `{ }` | 
| 123 | - expressions `value.Expr` - `^[]` and 'compute [] where []' | 
| 124 |  | 
| 125 | - eval (b, vars={}, positional=[]) | 
| 126 |  | 
| 127 | - Buffered for loop | 
| 128 | - YSH is now roughly as fast as Awk! | 
| 129 | - `for x in <>` | 
| 130 |  | 
| 131 | - "magic awk loop" | 
| 132 |  | 
| 133 | with chop { | 
| 134 | for <README.md *.py> { | 
| 135 | echo _line_num _line _filename $1 $2 | 
| 136 | } | 
| 137 | } | 
| 138 |  | 
| 139 | - positional args $1 $2 $3 | 
| 140 | - currently mean "argv stack" | 
| 141 | - or "the captures" | 
| 142 | - this can probably be generalized | 
| 143 |  | 
| 144 | - `ctx` builtin | 
| 145 | - `value.Place` | 
| 146 |  | 
| 147 | TODO: | 
| 148 |  | 
| 149 | - split() like Python, not like shell IFS algorithm | 
| 150 |  | 
| 151 | - string formatting ${bytes %.2f} | 
| 152 | - ${bytes %.2f M} Megabytes | 
| 153 | - ${bytes %.2f Mi} Mebibytes | 
| 154 |  | 
| 155 | - ${timestamp +'%Y-m-%d'}  and strfitime | 
| 156 |  | 
| 157 | - this is for | 
| 158 |  | 
| 159 | - floating point %e %f %g and printf and strftime | 
| 160 |  | 
| 161 | ### Process Pool or Event Loop Primitive? | 
| 162 |  | 
| 163 | - if you want to display progress, then you might need an event loop | 
| 164 | - test framework might display progress | 
| 165 |  | 
| 166 | ## Matrices - Orthogonal design in these dimensions | 
| 167 |  | 
| 168 | - input: lines vs. rows | 
| 169 | - output: string (Str, Template) vs. row vs. block execution (also a row) | 
| 170 | - execution: serial vs. parallel | 
| 171 | - representation: interior vs. exterior !!! | 
| 172 | - Dicts and Lists are interior, but TSV8 is exterior | 
| 173 | - and we have row-wise format, and column-wise format -- this always bugged me | 
| 174 | - exterior: human vs. machine readable | 
| 175 | - TSV8 is both human and machine-readable | 
| 176 | - "aligned" #.ssv8 format is also | 
| 177 | - they are one format named TSV8, with different file extensions.  This is | 
| 178 | because it doesn't make too much sense to implement SSV8 without TSV8.  The | 
| 179 | latter becomes trivial.  So we call the whole thing TSV8. | 
| 180 |  | 
| 181 | This means we consider all these conversions | 
| 182 |  | 
| 183 | - Line -> Line | 
| 184 | - Line -> Row | 
| 185 | - Row -> Line | 
| 186 | - Row -> Row | 
| 187 |  | 
| 188 | ## Concrete Decisions - Matrix cut off | 
| 189 |  | 
| 190 | Design might seem very general, but we did make some hard choices. | 
| 191 |  | 
| 192 | - push vs. pull | 
| 193 | - everything is "push" style I think | 
| 194 | - buffered vs. unbuffered, everything | 
| 195 |  | 
| 196 | - List vs iterators | 
| 197 | - everything is either iterable pipelines, or a List | 
| 198 |  | 
| 199 |  | 
| 200 | [OSH]: $xref | 
| 201 | [YSH]: $xref | 
| 202 |  | 
| 203 |  | 
| 204 | ## String World | 
| 205 |  | 
| 206 | **THESE ARE ALL THE SAME ALGORITHM**.  They just have different names. | 
| 207 |  | 
| 208 | - each-line | 
| 209 | - each-row | 
| 210 | - split-by (/d+/, cols=:|a b c|) | 
| 211 | - chop | 
| 212 | - if-match | 
| 213 | - must-match | 
| 214 | - todo | 
| 215 |  | 
| 216 | should we also have: if-split-by ?  In case there aren't enough  columns? | 
| 217 |  | 
| 218 | They all take: | 
| 219 |  | 
| 220 | - string arg ' ' | 
| 221 | - template arg (^"") - `value.Expr` | 
| 222 | - block arg | 
| 223 |  | 
| 224 | for the block arg, this applies: | 
| 225 |  | 
| 226 | -j 4 | 
| 227 | --max-jobs 4 | 
| 228 |  | 
| 229 | --max-jobs $(cached-nproc) | 
| 230 | --max-jobs $[_nproc - 1] | 
| 231 |  | 
| 232 | ### Awk Issues | 
| 233 |  | 
| 234 | So we have this | 
| 235 |  | 
| 236 | echo begin | 
| 237 | var d = {} | 
| 238 | cat -- @files | split-by (ifs=IFS) { | 
| 239 | echo $2 $1 | 
| 240 | call d->accum($1, $2) | 
| 241 | } | 
| 242 | echo end | 
| 243 |  | 
| 244 | But then how do we have conditionals: | 
| 245 |  | 
| 246 | Filter foo {  # does this define a proc?  Or a data structure | 
| 247 |  | 
| 248 | split-by (ifs=IFS)  # is this possible?  We register the proc itself? | 
| 249 |  | 
| 250 | config split-by (ifs=IFS)  # register it | 
| 251 |  | 
| 252 | BEGIN { | 
| 253 | var d = {} | 
| 254 | } | 
| 255 | END { | 
| 256 | echo d.sum | 
| 257 | } | 
| 258 |  | 
| 259 | when [$1 ~ /d+/] { | 
| 260 | setvar d.sum += $1 | 
| 261 | } | 
| 262 |  | 
| 263 | } | 
| 264 |  | 
| 265 | ## Table World | 
| 266 |  | 
| 267 | ### `table` to construct | 
| 268 |  | 
| 269 | Actions: | 
| 270 |  | 
| 271 | table cat | 
| 272 | table align / table tabify | 
| 273 | table header (cols) | 
| 274 | table slice (1, -1)   or (-1, -2) etc. | 
| 275 |  | 
| 276 | Subcommands | 
| 277 |  | 
| 278 | cols | 
| 279 | types | 
| 280 | attr units | 
| 281 |  | 
| 282 | Partial Parsing  / Lazy Parsing - TSV8 is designed for this | 
| 283 |  | 
| 284 | # we only decode the columns that are necessary | 
| 285 | cat myfile.tsv8 | table --by-col (&out, cols = :|bytes path|) | 
| 286 |  | 
| 287 | ## Will writing it in YSH be slow? | 
| 288 |  | 
| 289 | - We concentrate on semantics first | 
| 290 | - We can rewrite in Python | 
| 291 | - Better: users can use **exterior** tools with the same interface | 
| 292 | - in some cases | 
| 293 | - they can write an efficient `sort-tsv8` or `join-tsv8` with novel algorithms | 
| 294 | - Most data will be small at first | 
| 295 |  | 
| 296 |  | 
| 297 | ## Applications | 
| 298 |  | 
| 299 | - Shell is shared nothing | 
| 300 | - Scaling to infinity on the biggest clouds | 
| 301 |  | 
| 302 |  | 
| 303 | ## Extra: Tree World? | 
| 304 |  | 
| 305 | This is sort of "expanding the scope" of the project, when we want to reduce scope. | 
| 306 |  | 
| 307 | But YSH has both tree-shaped JSON, and table-shaped TSV8, and jq is a nice **bridge** between them. | 
| 308 |  | 
| 309 | Streams of Trees (jq) | 
| 310 |  | 
| 311 | empty | 
| 312 | this | 
| 313 | this[] | 
| 314 | => | 
| 315 | select() | 
| 316 | a & b  # more than one | 
| 317 |  | 
| 318 |  | 
| 319 | ## Pie in the Sky | 
| 320 |  | 
| 321 | Four types of Data Languages: | 
| 322 |  | 
| 323 | - flat strings | 
| 324 | - JSON8 - tree | 
| 325 | - TSV8 - table | 
| 326 | - NIL8 - Lisp Tree | 
| 327 | - HTML/XML - doc tree -- attributed text (similar to Emacs data model) | 
| 328 | - 8ml | 
| 329 |  | 
| 330 | Four types of query languaegs: | 
| 331 |  | 
| 332 | - regex | 
| 333 | - jq / jshape | 
| 334 | - tsv8 | 
| 335 |  | 
| 336 |  | 
| 337 | ## Appendix | 
| 338 |  | 
| 339 | ### Notes on Naming | 
| 340 |  | 
| 341 | Considering columns and then rows: | 
| 342 |  | 
| 343 | - SQL is "select ... where" | 
| 344 | - dplyr is "select ... filter" | 
| 345 | - YSH is "pick ... where" | 
| 346 | - select is a legacy shell keyword, and pick is shorter | 
| 347 | - or it could be elect in OSH, elect/select in YSH | 
| 348 | - OSH wouldn't support mutate [average = bytes/total] anyway | 
| 349 |  | 
| 350 | dplyr: | 
| 351 |  | 
| 352 | - summarise vs. summarize vs. summary | 
| 353 |  | 
| 354 |  | 
| 355 |  | 
| 356 |  | 
| 357 |  |