Warning: Work in progress! Leave feedback on Zulip or Github if you'd like this doc to be updated.

Streams, Tables and Processes - awk, R, xargs

(July 2024)

This is a long, "unified/orthogonal" design for:

There's also a relation to:

It's a layered design. That means we need some underlying mechanisms:

It will link to:

Table of Contents
Intro With Code Snippets
Streams - awk
Tables - Data frames with dplyr (R)
Processes - xargs
Background / References
Concrete Use Cases
Intro
How much code is it?
Thanks
Tools
Concepts
Underlying Mechanisms in Oils / Primitives
Process Pool or Event Loop Primitive?
Matrices - Orthogonal design in these dimensions
Concrete Decisions - Matrix cut off
String World
Awk Issues
Table World
table to construct
Will writing it in YSH be slow?
Applications
Extra: Tree World?
Pie in the Sky
Appendix
Notes on Naming

Intro With Code Snippets

Let's introduce this with a text file

$ seq 4 | xargs -n 2 | tee test.txt
1 2
3 4 

xargs does splitting:

$ echo 'alice bob' | xargs -n 1 -- echo hi | tee test2.txt
hi alice
hi bob

Oils:

# should we use $_ for _word _line _row?  $[_.age] instead of $[_row.age]
$ echo 'alice bob' | each-word { echo "hi $_" } | tee test2.txt
hi alice
hi bob

Normally this should be balanced

Streams - awk

Now let's use awk:

$ cat test.txt | awk '{ print $2 " " $1 }'
2 1
4 3

In YSH:

$ cat test.txt | chop '$2 $1'
2 1
4 3

It's shorter! chop is an alias for split-by (space=true, template='$2 $1')

With a template, for static parsing:

$ cat test.txt | chop (^"$2 $1")
2 1
4 3

It's shorter! With a block:

$ cat test.txt | chop { mkdir -v -p $2/$1 }
mkdir: created directory '2/1'
mkdir: created directory '4/3'

With no argument, it prints a table:

$ cat test.txt | chop
#.tsv8 $1 $2
       2  1
       4  3

$ cat test.txt | chop (names = :|a b|)
#.tsv8 a  b
       2  1
       4  3

Longer examples with split-by:

$ cat test.txt | split-by (space=true, template='$2 $1')
$ cat test.txt | split-by (space=true, template=^"$2 $1")
$ cat test.txt | split-by (space=true) { mkdir -v -p $2/$1 }
$ cat test.txt | split-by (space=true)
$ cat test.txt | split-by (space=true, names= :|a b|)
$ cat test.txt | split-by (space=true, names= :|a b|) {
    mkdir -v -p $a/$b
  }

With must-match:

$ var p = /<capture d+> s+ </capture d+>/
$ cat test.txt | must-match (p, template='$2 $1')
$ cat test.txt | must-match (p, template=^"$2 $1")
$ cat test.txt | must-match (p) { mkdir -v -p $2/$1 }
$ cat test.txt | must-match (p)

With names:

$ var p = /<capture d+ as a> s+ </capture d+ as b>/
$ cat test.txt | must-match (p, template='$b $a')
$ cat test.txt | must-match (p)
#.tsv8 a b
       2 1
       4 3

$ cat test.txt | must-match (p) {
    mkdir -v -p $a/$b
  }

Doing it in parallel:

$ cat test.txt | must-match --max-jobs 4 (p) {
    mkdir -v -p $a/$b
  }

Tables - Data frames with dplyr (R)

$ cat table.txt size path 3 foo.txt 20 bar.jpg

$ R

t=read.table('table.txt', header=T) t size path 1 3 foo.txt 2 20 bar.jpg

Processes - xargs

We already saw this! Because we "compressed" awk and xargs together

What's not in the streams / awk example above:

Background / References

Old wiki pages:

We're doing all of these.

Concrete Use Cases

Intro

How much code is it?

Thanks

Tools

Concepts

Underlying Mechanisms in Oils / Primitives

TODO:

Process Pool or Event Loop Primitive?

Matrices - Orthogonal design in these dimensions

This means we consider all these conversions

Concrete Decisions - Matrix cut off

Design might seem very general, but we did make some hard choices.

String World

THESE ARE ALL THE SAME ALGORITHM. They just have different names.

should we also have: if-split-by ? In case there aren't enough columns?

They all take:

for the block arg, this applies:

-j 4
--max-jobs 4

--max-jobs $(cached-nproc)
--max-jobs $[_nproc - 1]

Awk Issues

So we have this

echo begin
var d = {}
cat -- @files | split-by (ifs=IFS) {
  echo $2 $1
  call d->accum($1, $2)
}
echo end

But then how do we have conditionals:

Filter foo {  # does this define a proc?  Or a data structure

  split-by (ifs=IFS)  # is this possible?  We register the proc itself?

  config split-by (ifs=IFS)  # register it

  BEGIN {
    var d = {}
  }
  END {
    echo d.sum
  }

  when [$1 ~ /d+/] {
    setvar d.sum += $1
  }

}

Table World

table to construct

Actions:

table cat
table align / table tabify
table header (cols)
table slice (1, -1)   or (-1, -2) etc.

Subcommands

cols
types
attr units

Partial Parsing / Lazy Parsing - TSV8 is designed for this

# we only decode the columns that are necessary
cat myfile.tsv8 | table --by-col (&out, cols = :|bytes path|)

Will writing it in YSH be slow?

Applications

Extra: Tree World?

This is sort of "expanding the scope" of the project, when we want to reduce scope.

But YSH has both tree-shaped JSON, and table-shaped TSV8, and jq is a nice bridge between them.

Streams of Trees (jq)

empty
this
this[]
=>
select()
a & b  # more than one

Pie in the Sky

Four types of Data Languages:

Four types of query languaegs:

Appendix

Notes on Naming

Considering columns and then rows:

dplyr:

Generated on Wed, 24 Jul 2024 05:19:10 +0000