| 1 | ---
 | 
| 2 | in_progress: yes
 | 
| 3 | default_highlighter: oils-sh
 | 
| 4 | ---
 | 
| 5 | 
 | 
| 6 | Streams, Tables and Processes - awk, R, xargs
 | 
| 7 | =============================================
 | 
| 8 | 
 | 
| 9 | *(July 2024)*
 | 
| 10 | 
 | 
| 11 | This is a long, "unified/orthogonal" design  for:
 | 
| 12 | 
 | 
| 13 | - Streams: [awk]($xref) delimited lines, regexes
 | 
| 14 | - Tables: like data frames with R's dplyr or Pandas, but with the "exterior"
 | 
| 15 |   TSV8 format
 | 
| 16 | - Processes: xargs -P in parallel
 | 
| 17 | 
 | 
| 18 | There's also a relation to:
 | 
| 19 | 
 | 
| 20 | - Trees: `jq`, which will be covered elsewhere.
 | 
| 21 | 
 | 
| 22 | It's a layered design.  That means we need some underlying mechanisms:
 | 
| 23 | 
 | 
| 24 | - `eval` and positional args `$1 $2 $3`
 | 
| 25 | - `ctx` builtin
 | 
| 26 | - Data langauges: TSV8
 | 
| 27 | - Process pool / event loop primitive
 | 
| 28 | 
 | 
| 29 | It will link to:
 | 
| 30 | 
 | 
| 31 | - Oils blog posts (old)
 | 
| 32 | - Zulip threads (recent)
 | 
| 33 | - Other related projects (many of them)
 | 
| 34 | 
 | 
| 35 | <div id="toc">
 | 
| 36 | </div> 
 | 
| 37 | 
 | 
| 38 | ## Intro With Code Snippets
 | 
| 39 | 
 | 
| 40 | Let's introduce this with a text file
 | 
| 41 | 
 | 
| 42 |     $ seq 4 | xargs -n 2 | tee test.txt
 | 
| 43 |     1 2
 | 
| 44 |     3 4 
 | 
| 45 | 
 | 
| 46 | xargs does splitting:
 | 
| 47 | 
 | 
| 48 |     $ echo 'alice bob' | xargs -n 1 -- echo hi | tee test2.txt
 | 
| 49 |     hi alice
 | 
| 50 |     hi bob
 | 
| 51 | 
 | 
| 52 | Oils:
 | 
| 53 | 
 | 
| 54 |     # should we use $_ for _word _line _row?  $[_.age] instead of $[_row.age]
 | 
| 55 |     $ echo 'alice bob' | each-word { echo "hi $_" } | tee test2.txt
 | 
| 56 |     hi alice
 | 
| 57 |     hi bob
 | 
| 58 | 
 | 
| 59 | Normally this should be balanced
 | 
| 60 | 
 | 
| 61 | ### Streams - awk
 | 
| 62 | 
 | 
| 63 | Now let's use awk:
 | 
| 64 | 
 | 
| 65 |     $ cat test.txt | awk '{ print $2 " " $1 }'
 | 
| 66 |     2 1
 | 
| 67 |     4 3
 | 
| 68 | 
 | 
| 69 | In YSH:
 | 
| 70 | 
 | 
| 71 |     $ cat test.txt | chop '$2 $1'
 | 
| 72 |     2 1
 | 
| 73 |     4 3
 | 
| 74 | 
 | 
| 75 | It's shorter!  `chop` is an alias for `split-by (space=true, template='$2 $1')`
 | 
| 76 | 
 | 
| 77 | With a template, for static parsing:
 | 
| 78 | 
 | 
| 79 |     $ cat test.txt | chop (^"$2 $1")
 | 
| 80 |     2 1
 | 
| 81 |     4 3
 | 
| 82 | 
 | 
| 83 | It's shorter!  With a block:
 | 
| 84 | 
 | 
| 85 |     $ cat test.txt | chop { mkdir -v -p $2/$1 }
 | 
| 86 |     mkdir: created directory '2/1'
 | 
| 87 |     mkdir: created directory '4/3'
 | 
| 88 | 
 | 
| 89 | With no argument, it prints a table:
 | 
| 90 | 
 | 
| 91 |     $ cat test.txt | chop
 | 
| 92 |     #.tsv8 $1 $2
 | 
| 93 |            2  1
 | 
| 94 |            4  3
 | 
| 95 | 
 | 
| 96 |     $ cat test.txt | chop (names = :|a b|)
 | 
| 97 |     #.tsv8 a  b
 | 
| 98 |            2  1
 | 
| 99 |            4  3
 | 
| 100 | 
 | 
| 101 | Longer examples with split-by:
 | 
| 102 | 
 | 
| 103 |     $ cat test.txt | split-by (space=true, template='$2 $1')
 | 
| 104 |     $ cat test.txt | split-by (space=true, template=^"$2 $1")
 | 
| 105 |     $ cat test.txt | split-by (space=true) { mkdir -v -p $2/$1 }
 | 
| 106 |     $ cat test.txt | split-by (space=true)
 | 
| 107 |     $ cat test.txt | split-by (space=true, names= :|a b|)
 | 
| 108 |     $ cat test.txt | split-by (space=true, names= :|a b|) {
 | 
| 109 |         mkdir -v -p $a/$b
 | 
| 110 |       }
 | 
| 111 | 
 | 
| 112 | With must-match:
 | 
| 113 | 
 | 
| 114 |     $ var p = /<capture d+> s+ </capture d+>/
 | 
| 115 |     $ cat test.txt | must-match (p, template='$2 $1')
 | 
| 116 |     $ cat test.txt | must-match (p, template=^"$2 $1")
 | 
| 117 |     $ cat test.txt | must-match (p) { mkdir -v -p $2/$1 }
 | 
| 118 |     $ cat test.txt | must-match (p)
 | 
| 119 | 
 | 
| 120 | With names:
 | 
| 121 | 
 | 
| 122 |     $ var p = /<capture d+ as a> s+ </capture d+ as b>/
 | 
| 123 |     $ cat test.txt | must-match (p, template='$b $a')
 | 
| 124 |     $ cat test.txt | must-match (p)
 | 
| 125 |     #.tsv8 a b
 | 
| 126 |            2 1
 | 
| 127 |            4 3
 | 
| 128 | 
 | 
| 129 |     $ cat test.txt | must-match (p) {
 | 
| 130 |         mkdir -v -p $a/$b
 | 
| 131 |       }
 | 
| 132 | 
 | 
| 133 | Doing it in parallel:
 | 
| 134 | 
 | 
| 135 |     $ cat test.txt | must-match --max-jobs 4 (p) {
 | 
| 136 |         mkdir -v -p $a/$b
 | 
| 137 |       }
 | 
| 138 | 
 | 
| 139 | ### Tables - Data frames with dplyr (R)
 | 
| 140 | 
 | 
| 141 |    $ cat table.txt
 | 
| 142 |    size path
 | 
| 143 |    3    foo.txt
 | 
| 144 |    20   bar.jpg
 | 
| 145 | 
 | 
| 146 |    $ R
 | 
| 147 |    > t=read.table('table.txt', header=T)
 | 
| 148 |    > t
 | 
| 149 |      size    path
 | 
| 150 |    1    3 foo.txt
 | 
| 151 |    2   20 bar.jpg
 | 
| 152 | 
 | 
| 153 | ### Processes - xargs
 | 
| 154 | 
 | 
| 155 | We already saw this!  Because we "compressed" awk and xargs together
 | 
| 156 | 
 | 
| 157 | What's not in the streams / awk example above:
 | 
| 158 | 
 | 
| 159 | - `BEGIN END` - that can be separate
 | 
| 160 | - `when [$1 ~ /d+/] { }`
 | 
| 161 | 
 | 
| 162 | ## Background / References
 | 
| 163 | 
 | 
| 164 | - Shell, Awk, and Make Should be Combined (2016)
 | 
| 165 |   - this is the Awk part!
 | 
| 166 | 
 | 
| 167 | - What is a Data Frame?  (2018)
 | 
| 168 | 
 | 
| 169 | - Sketches of YSH Features (June 2023) - can we express things in YSH?
 | 
| 170 |   - Zulip: Oils Layering / Self-hosting
 | 
| 171 | 
 | 
| 172 | - Language Compositionality Test: J8 Lines
 | 
| 173 |   - This whole thing is a compositionality test
 | 
| 174 | 
 | 
| 175 | - read --split
 | 
| 176 |   - more feedback from Aidan and Samuel
 | 
| 177 | 
 | 
| 178 | - What is a Data Frame?
 | 
| 179 | 
 | 
| 180 | - jq in jq thread
 | 
| 181 | 
 | 
| 182 | Old wiki pages:
 | 
| 183 | 
 | 
| 184 | - <https://github.com/oilshell/oil/wiki/Structured-Data-in-Oil>
 | 
| 185 |   - uxy - closest I think - <https://github.com/sustrik/uxy>
 | 
| 186 |     - relies on to-json and jq for querying
 | 
| 187 |   - miller - I don't like their language - https://github.com/johnkerl/miller -
 | 
| 188 |   - jc - <https://github.com/kellyjonbrazil/jc>
 | 
| 189 | - nushell
 | 
| 190 | - extremely old thing -
 | 
| 191 | 
 | 
| 192 | We're doing **all of these**.
 | 
| 193 | 
 | 
| 194 | ## Concrete Use Cases
 | 
| 195 | 
 | 
| 196 | - benchmarks/* with dplyr
 | 
| 197 | - wedge report
 | 
| 198 | - oilshell.org analytics job uses dplyr and ggplot2
 | 
| 199 | 
 | 
| 200 | ## Intro
 | 
| 201 | 
 | 
| 202 | ### How much code is it?
 | 
| 203 | 
 | 
| 204 | - I think this is ~1000 lines of Python and ~1000 lines of YSH (not including tests)
 | 
| 205 |   - It should be small
 | 
| 206 | 
 | 
| 207 | ### Thanks
 | 
| 208 | 
 | 
| 209 | - Samuel - two big hints 
 | 
| 210 |   - do it in YSH
 | 
| 211 |   - `table` with the `ctx` builtin
 | 
| 212 | - Aidan
 | 
| 213 |   - `read --split` feedback
 | 
| 214 | 
 | 
| 215 | 
 | 
| 216 | ## Tools
 | 
| 217 | 
 | 
| 218 | - awk 
 | 
| 219 |   - streams of records - row-wise
 | 
| 220 | - R
 | 
| 221 |   - column-wise operations on tables
 | 
| 222 | - `find . -printf '%s %P\n'`  - size and path
 | 
| 223 |   - generate text that looks like a table
 | 
| 224 | - xargs
 | 
| 225 |   - operate on tabular text -- it has a bespoke splitting algorithm
 | 
| 226 |   - Opinionated guide to xargs
 | 
| 227 |   - table in, table out
 | 
| 228 | - jq - "awk for JSON"
 | 
| 229 | 
 | 
| 230 | 
 | 
| 231 | ## Concepts
 | 
| 232 | 
 | 
| 233 | - TSV8
 | 
| 234 |   - aligned format SSV8
 | 
| 235 |   - columns have types, and attributes
 | 
| 236 | - Lines 
 | 
| 237 |   - raw lines like shell
 | 
| 238 |   - J8 lines (which can represent any filename, any unicode or byte string)
 | 
| 239 | - Tables - can be thought of as:
 | 
| 240 |   - Streams of Rows - shape `[{bytes: 123, path: "foo"}, {}, ...]`
 | 
| 241 |     - this is actually <https://jsonlines.org> , and it fits well with `jq` 
 | 
| 242 |   - Columns - shape `{bytes: [], path: []}
 | 
| 243 | 
 | 
| 244 | ## Underlying Mechanisms in Oils / Primitives
 | 
| 245 | 
 | 
| 246 | - blocks `value.Block` - `^()` and `{ }` 
 | 
| 247 | - expressions `value.Expr` - `^[]` and 'compute [] where []'
 | 
| 248 | 
 | 
| 249 | - eval (b, vars={}, positional=[])
 | 
| 250 | 
 | 
| 251 | - Buffered for loop
 | 
| 252 |   - YSH is now roughly as fast as Awk!
 | 
| 253 |   - `for x in <>`
 | 
| 254 | 
 | 
| 255 | - "magic awk loop"
 | 
| 256 | 
 | 
| 257 |     with chop {
 | 
| 258 |       for <README.md *.py> {
 | 
| 259 |         echo _line_num _line _filename $1 $2
 | 
| 260 |       }
 | 
| 261 |     }
 | 
| 262 | 
 | 
| 263 | - positional args $1 $2 $3
 | 
| 264 |   - currently mean "argv stack"
 | 
| 265 |   - or "the captures"
 | 
| 266 |   - this can probably be generalized
 | 
| 267 | 
 | 
| 268 | - `ctx` builtin
 | 
| 269 | - `value.Place`
 | 
| 270 | 
 | 
| 271 | TODO:
 | 
| 272 | 
 | 
| 273 | - split() like Python, not like shell IFS algorithm
 | 
| 274 | 
 | 
| 275 | - string formatting ${bytes %.2f}
 | 
| 276 |   - ${bytes %.2f M} Megabytes
 | 
| 277 |   - ${bytes %.2f Mi} Mebibytes
 | 
| 278 | 
 | 
| 279 |   - ${timestamp +'%Y-m-%d'}  and strfitime
 | 
| 280 | 
 | 
| 281 |   - this is for
 | 
| 282 | 
 | 
| 283 |   - floating point %e %f %g and printf and strftime
 | 
| 284 | 
 | 
| 285 | ### Process Pool or Event Loop Primitive?
 | 
| 286 | 
 | 
| 287 | - if you want to display progress, then you might need an event loop
 | 
| 288 | - test framework might display progress
 | 
| 289 | 
 | 
| 290 | ## Matrices - Orthogonal design in these dimensions
 | 
| 291 | 
 | 
| 292 | - input: lines vs. rows
 | 
| 293 | - output: string (Str, Template) vs. row vs. block execution (also a row)
 | 
| 294 | - execution: serial vs. parallel
 | 
| 295 | - representation: interior vs. exterior !!!
 | 
| 296 |   - Dicts and Lists are interior, but TSV8 is exterior
 | 
| 297 |   - and we have row-wise format, and column-wise format -- this always bugged me
 | 
| 298 | - exterior: human vs. machine readable
 | 
| 299 |   - TSV8 is both human and machine-readable
 | 
| 300 |   - "aligned" #.ssv8 format is also
 | 
| 301 |   - they are one format named TSV8, with different file extensions.  This is
 | 
| 302 |     because it doesn't make too much sense to implement SSV8 without TSV8.  The
 | 
| 303 |     latter becomes trivial.  So we call the whole thing TSV8.
 | 
| 304 | 
 | 
| 305 | This means we consider all these conversions
 | 
| 306 | 
 | 
| 307 | - Line -> Line
 | 
| 308 | - Line -> Row
 | 
| 309 | - Row -> Line
 | 
| 310 | - Row -> Row
 | 
| 311 | 
 | 
| 312 | ## Concrete Decisions - Matrix cut off
 | 
| 313 | 
 | 
| 314 | Design might seem very general, but we did make some hard choices.
 | 
| 315 | 
 | 
| 316 | - push vs. pull
 | 
| 317 |   - everything is "push" style I think
 | 
| 318 | - buffered vs. unbuffered, everything
 | 
| 319 | 
 | 
| 320 | - List vs iterators
 | 
| 321 |   - everything is either iterable pipelines, or a List
 | 
| 322 | 
 | 
| 323 | 
 | 
| 324 | [OSH]: $xref
 | 
| 325 | [YSH]: $xref
 | 
| 326 | 
 | 
| 327 | 
 | 
| 328 | ## String World
 | 
| 329 | 
 | 
| 330 | **THESE ARE ALL THE SAME ALGORITHM**.  They just have different names.
 | 
| 331 | 
 | 
| 332 | - each-line
 | 
| 333 | - each-row
 | 
| 334 | - split-by (/d+/, cols=:|a b c|)
 | 
| 335 |   - chop
 | 
| 336 | - if-match
 | 
| 337 | - must-match
 | 
| 338 |   - todo
 | 
| 339 | 
 | 
| 340 | should we also have: if-split-by ?  In case there aren't enough  columns?
 | 
| 341 | 
 | 
| 342 | They all take:
 | 
| 343 | 
 | 
| 344 | - string arg ' '
 | 
| 345 | - template arg (^"") - `value.Expr`
 | 
| 346 | - block arg
 | 
| 347 | 
 | 
| 348 | for the block arg, this applies:
 | 
| 349 | 
 | 
| 350 |     -j 4
 | 
| 351 |     --max-jobs 4
 | 
| 352 | 
 | 
| 353 |     --max-jobs $(cached-nproc)
 | 
| 354 |     --max-jobs $[_nproc - 1]
 | 
| 355 | 
 | 
| 356 | ### Awk Issues
 | 
| 357 | 
 | 
| 358 | So we have this
 | 
| 359 | 
 | 
| 360 |     echo begin
 | 
| 361 |     var d = {}
 | 
| 362 |     cat -- @files | split-by (ifs=IFS) {
 | 
| 363 |       echo $2 $1
 | 
| 364 |       call d->accum($1, $2)
 | 
| 365 |     }
 | 
| 366 |     echo end
 | 
| 367 | 
 | 
| 368 | But then how do we have conditionals:
 | 
| 369 | 
 | 
| 370 |     Filter foo {  # does this define a proc?  Or a data structure
 | 
| 371 | 
 | 
| 372 |       split-by (ifs=IFS)  # is this possible?  We register the proc itself?
 | 
| 373 | 
 | 
| 374 |       config split-by (ifs=IFS)  # register it
 | 
| 375 | 
 | 
| 376 |       BEGIN {
 | 
| 377 |         var d = {}
 | 
| 378 |       }
 | 
| 379 |       END {
 | 
| 380 |         echo d.sum
 | 
| 381 |       }
 | 
| 382 | 
 | 
| 383 |       when [$1 ~ /d+/] {
 | 
| 384 |         setvar d.sum += $1
 | 
| 385 |       }
 | 
| 386 | 
 | 
| 387 |     }
 | 
| 388 | 
 | 
| 389 | ## Table World
 | 
| 390 | 
 | 
| 391 | ### `table` to construct
 | 
| 392 | 
 | 
| 393 | Actions:
 | 
| 394 | 
 | 
| 395 |     table cat
 | 
| 396 |     table align / table tabify
 | 
| 397 |     table header (cols)
 | 
| 398 |     table slice (1, -1)   or (-1, -2) etc.
 | 
| 399 | 
 | 
| 400 | Subcommands
 | 
| 401 | 
 | 
| 402 |     cols
 | 
| 403 |     types
 | 
| 404 |     attr units
 | 
| 405 | 
 | 
| 406 | Partial Parsing  / Lazy Parsing - TSV8 is designed for this
 | 
| 407 | 
 | 
| 408 |     # we only decode the columns that are necessary
 | 
| 409 |     cat myfile.tsv8 | table --by-col (&out, cols = :|bytes path|)
 | 
| 410 | 
 | 
| 411 | ## Will writing it in YSH be slow?
 | 
| 412 | 
 | 
| 413 | - We concentrate on semantics first
 | 
| 414 | - We can rewrite in Python
 | 
| 415 | - Better: users can use **exterior** tools with the same interface
 | 
| 416 |   - in some cases
 | 
| 417 |   - they can write an efficient `sort-tsv8` or `join-tsv8` with novel algorithms
 | 
| 418 | - Most data will be small at first
 | 
| 419 | 
 | 
| 420 | 
 | 
| 421 | ## Applications
 | 
| 422 | 
 | 
| 423 | - Shell is shared nothing
 | 
| 424 | - Scaling to infinity on the biggest clouds
 | 
| 425 | 
 | 
| 426 | 
 | 
| 427 | ## Extra: Tree World?
 | 
| 428 | 
 | 
| 429 | This is sort of "expanding the scope" of the project, when we want to reduce scope.
 | 
| 430 | 
 | 
| 431 | But YSH has both tree-shaped JSON, and table-shaped TSV8, and jq is a nice **bridge** between them.
 | 
| 432 | 
 | 
| 433 | Streams of Trees (jq)
 | 
| 434 | 
 | 
| 435 |     empty
 | 
| 436 |     this
 | 
| 437 |     this[]
 | 
| 438 |     =>
 | 
| 439 |     select()
 | 
| 440 |     a & b  # more than one
 | 
| 441 | 
 | 
| 442 | 
 | 
| 443 | ## Pie in the Sky
 | 
| 444 | 
 | 
| 445 | Four types of Data Languages:
 | 
| 446 | 
 | 
| 447 | - flat strings
 | 
| 448 | - JSON8 - tree
 | 
| 449 | - TSV8 - table
 | 
| 450 | - NIL8 - Lisp Tree
 | 
| 451 | - HTML/XML - doc tree -- attributed text (similar to Emacs data model)
 | 
| 452 |   - 8ml
 | 
| 453 | 
 | 
| 454 | Four types of query languaegs:
 | 
| 455 | 
 | 
| 456 | - regex
 | 
| 457 | - jq / jshape
 | 
| 458 | - tsv8
 | 
| 459 | 
 | 
| 460 | 
 | 
| 461 | ## Appendix
 | 
| 462 | 
 | 
| 463 | ### Notes on Naming
 | 
| 464 | 
 | 
| 465 | Considering columns and then rows:
 | 
| 466 | 
 | 
| 467 | - SQL is "select ... where"
 | 
| 468 | - dplyr is "select ... filter"
 | 
| 469 | - YSH is "pick ... where"
 | 
| 470 |   - select is a legacy shell keyword, and pick is shorter
 | 
| 471 |   - or it could be elect in OSH, elect/select in YSH
 | 
| 472 |     - OSH wouldn't support mutate [average = bytes/total] anyway
 | 
| 473 | 
 | 
| 474 | dplyr:
 | 
| 475 | 
 | 
| 476 | - summarise vs. summarize vs. summary
 | 
| 477 | 
 | 
| 478 | 
 | 
| 479 | 
 | 
| 480 | 
 | 
| 481 | 
 |