| 1 | ---
 | 
| 2 | in_progress: yes
 | 
| 3 | default_highlighter: oils-sh
 | 
| 4 | ---
 | 
| 5 | 
 | 
| 6 | Streams, Tables and Processes - awk, R, xargs
 | 
| 7 | =============================================
 | 
| 8 | 
 | 
| 9 | *(July 2024)*
 | 
| 10 | 
 | 
| 11 | This is a long, "unified/orthogonal" design  for:
 | 
| 12 | 
 | 
| 13 | - Streams: [awk]($xref) delimited lines, regexes
 | 
| 14 | - Tables: like data frames with R's dplyr or Pandas, but with the "exterior"
 | 
| 15 |   TSV8 format
 | 
| 16 | - Processes: xargs -P in parallel
 | 
| 17 | 
 | 
| 18 | There's also a relation to:
 | 
| 19 | 
 | 
| 20 | - Trees: `jq`, which will be covered elsewhere.
 | 
| 21 | 
 | 
| 22 | It's a layered design.  That means we need some underlying mechanisms:
 | 
| 23 | 
 | 
| 24 | - `eval` and positional args `$1 $2 $3`
 | 
| 25 | - `ctx` builtin
 | 
| 26 | - Data langauges: TSV8
 | 
| 27 | - Process pool / event loop primitive
 | 
| 28 | 
 | 
| 29 | It will link to:
 | 
| 30 | 
 | 
| 31 | - Oils blog posts (old)
 | 
| 32 | - Zulip threads (recent)
 | 
| 33 | - Other related projects (many of them)
 | 
| 34 | 
 | 
| 35 | <div id="toc">
 | 
| 36 | </div> 
 | 
| 37 | 
 | 
| 38 | ## Background / References
 | 
| 39 | 
 | 
| 40 | - Shell, Awk, and Make Should be Combined (2016)
 | 
| 41 |   - this is the Awk part!
 | 
| 42 | 
 | 
| 43 | - What is a Data Frame?  (2018)
 | 
| 44 | 
 | 
| 45 | - Sketches of YSH Features (June 2023) - can we express things in YSH?
 | 
| 46 |   - Zulip: Oils Layering / Self-hosting
 | 
| 47 | 
 | 
| 48 | - Language Compositionality Test: J8 Lines
 | 
| 49 |   - This whole thing is a compositionality test
 | 
| 50 | 
 | 
| 51 | - read --split
 | 
| 52 |   - more feedback from Aidan and Samuel
 | 
| 53 | 
 | 
| 54 | - What is a Data Frame?
 | 
| 55 | 
 | 
| 56 | - jq in jq thread
 | 
| 57 | 
 | 
| 58 | Old wiki pages:
 | 
| 59 | 
 | 
| 60 | - <https://github.com/oilshell/oil/wiki/Structured-Data-in-Oil>
 | 
| 61 |   - uxy - closest I think - <https://github.com/sustrik/uxy>
 | 
| 62 |     - relies on to-json and jq for querying
 | 
| 63 |   - miller - I don't like their language - https://github.com/johnkerl/miller -
 | 
| 64 |   - jc - <https://github.com/kellyjonbrazil/jc>
 | 
| 65 | - nushell
 | 
| 66 | - extremely old thing -
 | 
| 67 | 
 | 
| 68 | We're doing **all of these**.
 | 
| 69 | 
 | 
| 70 | ## Concrete Use Cases
 | 
| 71 | 
 | 
| 72 | - benchmarks/* with dplyr
 | 
| 73 | - wedge report
 | 
| 74 | - oilshell.org analytics job uses dplyr and ggplot2
 | 
| 75 | 
 | 
| 76 | ## Intro
 | 
| 77 | 
 | 
| 78 | ### How much code is it?
 | 
| 79 | 
 | 
| 80 | - I think this is ~1000 lines of Python and ~1000 lines of YSH (not including tests)
 | 
| 81 |   - It should be small
 | 
| 82 | 
 | 
| 83 | ### Thanks
 | 
| 84 | 
 | 
| 85 | - Samuel - two big hints 
 | 
| 86 |   - do it in YSH
 | 
| 87 |   - `table` with the `ctx` builtin
 | 
| 88 | - Aidan
 | 
| 89 |   - `read --split` feedback
 | 
| 90 | 
 | 
| 91 | 
 | 
| 92 | ## Tools
 | 
| 93 | 
 | 
| 94 | - awk 
 | 
| 95 |   - streams of records - row-wise
 | 
| 96 | - R
 | 
| 97 |   - column-wise operations on tables
 | 
| 98 | - `find . -printf '%s %P\n'`  - size and path
 | 
| 99 |   - generate text that looks like a table
 | 
| 100 | - xargs
 | 
| 101 |   - operate on tabular text -- it has a bespoke splitting algorithm
 | 
| 102 |   - Opinionated guide to xargs
 | 
| 103 |   - table in, table out
 | 
| 104 | - jq - "awk for JSON"
 | 
| 105 | 
 | 
| 106 | 
 | 
| 107 | ## Concepts
 | 
| 108 | 
 | 
| 109 | - TSV8
 | 
| 110 |   - aligned format SSV8
 | 
| 111 |   - columns have types, and attributes
 | 
| 112 | - Lines 
 | 
| 113 |   - raw lines like shell
 | 
| 114 |   - J8 lines (which can represent any filename, any unicode or byte string)
 | 
| 115 | - Tables - can be thought of as:
 | 
| 116 |   - Streams of Rows - shape `[{bytes: 123, path: "foo"}, {}, ...]`
 | 
| 117 |     - this is actually <https://jsonlines.org> , and it fits well with `jq` 
 | 
| 118 |   - Columns - shape `{bytes: [], path: []}
 | 
| 119 | 
 | 
| 120 | ## Underlying Mechanisms in Oils / Primitives
 | 
| 121 | 
 | 
| 122 | - blocks `value.Block` - `^()` and `{ }` 
 | 
| 123 | - expressions `value.Expr` - `^[]` and 'compute [] where []'
 | 
| 124 | 
 | 
| 125 | - eval (b, vars={}, positional=[])
 | 
| 126 | 
 | 
| 127 | - Buffered for loop
 | 
| 128 |   - YSH is now roughly as fast as Awk!
 | 
| 129 |   - `for x in <>`
 | 
| 130 | 
 | 
| 131 | - "magic awk loop"
 | 
| 132 | 
 | 
| 133 |     with chop {
 | 
| 134 |       for <README.md *.py> {
 | 
| 135 |         echo _line_num _line _filename $1 $2
 | 
| 136 |       }
 | 
| 137 |     }
 | 
| 138 | 
 | 
| 139 | - positional args $1 $2 $3
 | 
| 140 |   - currently mean "argv stack"
 | 
| 141 |   - or "the captures"
 | 
| 142 |   - this can probably be generalized
 | 
| 143 | 
 | 
| 144 | - `ctx` builtin
 | 
| 145 | - `value.Place`
 | 
| 146 | 
 | 
| 147 | TODO:
 | 
| 148 | 
 | 
| 149 | - split() like Python, not like shell IFS algorithm
 | 
| 150 | 
 | 
| 151 | - string formatting ${bytes %.2f}
 | 
| 152 |   - ${bytes %.2f M} Megabytes
 | 
| 153 |   - ${bytes %.2f Mi} Mebibytes
 | 
| 154 | 
 | 
| 155 |   - ${timestamp +'%Y-m-%d'}  and strfitime
 | 
| 156 | 
 | 
| 157 |   - this is for
 | 
| 158 | 
 | 
| 159 |   - floating point %e %f %g and printf and strftime
 | 
| 160 | 
 | 
| 161 | ### Process Pool or Event Loop Primitive?
 | 
| 162 | 
 | 
| 163 | - if you want to display progress, then you might need an event loop
 | 
| 164 | - test framework might display progress
 | 
| 165 | 
 | 
| 166 | ## Matrices - Orthogonal design in these dimensions
 | 
| 167 | 
 | 
| 168 | - input: lines vs. rows
 | 
| 169 | - output: string (Str, Template) vs. row vs. block execution (also a row)
 | 
| 170 | - execution: serial vs. parallel
 | 
| 171 | - representation: interior vs. exterior !!!
 | 
| 172 |   - Dicts and Lists are interior, but TSV8 is exterior
 | 
| 173 |   - and we have row-wise format, and column-wise format -- this always bugged me
 | 
| 174 | - exterior: human vs. machine readable
 | 
| 175 |   - TSV8 is both human and machine-readable
 | 
| 176 |   - "aligned" #.ssv8 format is also
 | 
| 177 |   - they are one format named TSV8, with different file extensions.  This is
 | 
| 178 |     because it doesn't make too much sense to implement SSV8 without TSV8.  The
 | 
| 179 |     latter becomes trivial.  So we call the whole thing TSV8.
 | 
| 180 | 
 | 
| 181 | This means we consider all these conversions
 | 
| 182 | 
 | 
| 183 | - Line -> Line
 | 
| 184 | - Line -> Row
 | 
| 185 | - Row -> Line
 | 
| 186 | - Row -> Row
 | 
| 187 | 
 | 
| 188 | ## Concrete Decisions - Matrix cut off
 | 
| 189 | 
 | 
| 190 | Design might seem very general, but we did make some hard choices.
 | 
| 191 | 
 | 
| 192 | - push vs. pull
 | 
| 193 |   - everything is "push" style I think
 | 
| 194 | - buffered vs. unbuffered, everything
 | 
| 195 | 
 | 
| 196 | - List vs iterators
 | 
| 197 |   - everything is either iterable pipelines, or a List
 | 
| 198 | 
 | 
| 199 | 
 | 
| 200 | [OSH]: $xref
 | 
| 201 | [YSH]: $xref
 | 
| 202 | 
 | 
| 203 | 
 | 
| 204 | ## String World
 | 
| 205 | 
 | 
| 206 | **THESE ARE ALL THE SAME ALGORITHM**.  They just have different names.
 | 
| 207 | 
 | 
| 208 | - each-line
 | 
| 209 | - each-row
 | 
| 210 | - split-by (/d+/, cols=:|a b c|)
 | 
| 211 |   - chop
 | 
| 212 | - if-match
 | 
| 213 | - must-match
 | 
| 214 |   - todo
 | 
| 215 | 
 | 
| 216 | should we also have: if-split-by ?  In case there aren't enough  columns?
 | 
| 217 | 
 | 
| 218 | They all take:
 | 
| 219 | 
 | 
| 220 | - string arg ' '
 | 
| 221 | - template arg (^"") - `value.Expr`
 | 
| 222 | - block arg
 | 
| 223 | 
 | 
| 224 | for the block arg, this applies:
 | 
| 225 | 
 | 
| 226 |     -j 4
 | 
| 227 |     --max-jobs 4
 | 
| 228 | 
 | 
| 229 |     --max-jobs $(cached-nproc)
 | 
| 230 |     --max-jobs $[_nproc - 1]
 | 
| 231 | 
 | 
| 232 | ### Awk Issues
 | 
| 233 | 
 | 
| 234 | So we have this
 | 
| 235 | 
 | 
| 236 |     echo begin
 | 
| 237 |     var d = {}
 | 
| 238 |     cat -- @files | split-by (ifs=IFS) {
 | 
| 239 |       echo $2 $1
 | 
| 240 |       call d->accum($1, $2)
 | 
| 241 |     }
 | 
| 242 |     echo end
 | 
| 243 | 
 | 
| 244 | But then how do we have conditionals:
 | 
| 245 | 
 | 
| 246 |     Filter foo {  # does this define a proc?  Or a data structure
 | 
| 247 | 
 | 
| 248 |       split-by (ifs=IFS)  # is this possible?  We register the proc itself?
 | 
| 249 | 
 | 
| 250 |       config split-by (ifs=IFS)  # register it
 | 
| 251 | 
 | 
| 252 |       BEGIN {
 | 
| 253 |         var d = {}
 | 
| 254 |       }
 | 
| 255 |       END {
 | 
| 256 |         echo d.sum
 | 
| 257 |       }
 | 
| 258 | 
 | 
| 259 |       when [$1 ~ /d+/] {
 | 
| 260 |         setvar d.sum += $1
 | 
| 261 |       }
 | 
| 262 | 
 | 
| 263 |     }
 | 
| 264 | 
 | 
| 265 | ## Table World
 | 
| 266 | 
 | 
| 267 | ### `table` to construct
 | 
| 268 | 
 | 
| 269 | Actions:
 | 
| 270 | 
 | 
| 271 |     table cat
 | 
| 272 |     table align / table tabify
 | 
| 273 |     table header (cols)
 | 
| 274 |     table slice (1, -1)   or (-1, -2) etc.
 | 
| 275 | 
 | 
| 276 | Subcommands
 | 
| 277 | 
 | 
| 278 |     cols
 | 
| 279 |     types
 | 
| 280 |     attr units
 | 
| 281 | 
 | 
| 282 | Partial Parsing  / Lazy Parsing - TSV8 is designed for this
 | 
| 283 | 
 | 
| 284 |     # we only decode the columns that are necessary
 | 
| 285 |     cat myfile.tsv8 | table --by-col (&out, cols = :|bytes path|)
 | 
| 286 | 
 | 
| 287 | ## Will writing it in YSH be slow?
 | 
| 288 | 
 | 
| 289 | - We concentrate on semantics first
 | 
| 290 | - We can rewrite in Python
 | 
| 291 | - Better: users can use **exterior** tools with the same interface
 | 
| 292 |   - in some cases
 | 
| 293 |   - they can write an efficient `sort-tsv8` or `join-tsv8` with novel algorithms
 | 
| 294 | - Most data will be small at first
 | 
| 295 | 
 | 
| 296 | 
 | 
| 297 | ## Applications
 | 
| 298 | 
 | 
| 299 | - Shell is shared nothing
 | 
| 300 | - Scaling to infinity on the biggest clouds
 | 
| 301 | 
 | 
| 302 | 
 | 
| 303 | ## Extra: Tree World?
 | 
| 304 | 
 | 
| 305 | This is sort of "expanding the scope" of the project, when we want to reduce scope.
 | 
| 306 | 
 | 
| 307 | But YSH has both tree-shaped JSON, and table-shaped TSV8, and jq is a nice **bridge** between them.
 | 
| 308 | 
 | 
| 309 | Streams of Trees (jq)
 | 
| 310 | 
 | 
| 311 |     empty
 | 
| 312 |     this
 | 
| 313 |     this[]
 | 
| 314 |     =>
 | 
| 315 |     select()
 | 
| 316 |     a & b  # more than one
 | 
| 317 | 
 | 
| 318 | 
 | 
| 319 | ## Pie in the Sky
 | 
| 320 | 
 | 
| 321 | Four types of Data Languages:
 | 
| 322 | 
 | 
| 323 | - flat strings
 | 
| 324 | - JSON8 - tree
 | 
| 325 | - TSV8 - table
 | 
| 326 | - NIL8 - Lisp Tree
 | 
| 327 | - HTML/XML - doc tree -- attributed text (similar to Emacs data model)
 | 
| 328 |   - 8ml
 | 
| 329 | 
 | 
| 330 | Four types of query languaegs:
 | 
| 331 | 
 | 
| 332 | - regex
 | 
| 333 | - jq / jshape
 | 
| 334 | - tsv8
 | 
| 335 | 
 | 
| 336 | 
 | 
| 337 | ## Appendix
 | 
| 338 | 
 | 
| 339 | ### Notes on Naming
 | 
| 340 | 
 | 
| 341 | Considering columns and then rows:
 | 
| 342 | 
 | 
| 343 | - SQL is "select ... where"
 | 
| 344 | - dplyr is "select ... filter"
 | 
| 345 | - YSH is "pick ... where"
 | 
| 346 |   - select is a legacy shell keyword, and pick is shorter
 | 
| 347 |   - or it could be elect in OSH, elect/select in YSH
 | 
| 348 |     - OSH wouldn't support mutate [average = bytes/total] anyway
 | 
| 349 | 
 | 
| 350 | dplyr:
 | 
| 351 | 
 | 
| 352 | - summarise vs. summarize vs. summary
 | 
| 353 | 
 | 
| 354 | 
 | 
| 355 | 
 | 
| 356 | 
 | 
| 357 | 
 |