The goal of this tool is to filter files lines with a logic-like language.
That is, there are some built in predicates that applies operations on lines.
Then, the results can also be aggregated and plotted.
It combines features of different well known tools like grep
, head
, tail
, etc, and extend them.
Each predicate takes as input at least one variable/constant with a string (the considered line) and unifies a variable with the result of the operation or succeeds/fails.
Why? Because I often have to write the same Python script to scan log files with a lot of text and extract results having a specific structure.
Assume you have a file called log.txt
which contains something like
----- CV 1 -----
Train on: f2,f3,f4,f5, Test on: f1
AUCPR1: 0.720441984486102
AUCROC1: 0.735
LL1: -12.753700137597306
----- CV 2 -----
Train on: f1,f3,f4,f5, Test on: f2
AUCPR2: 0.9423737373737374
AUCROC2: 0.94
LL2: -7.4546895016487245
----- CV 3 -----
Train on: f1,f2,f4,f5, Test on: f3
AUCPR3: 0.7111492673992674
AUCROC3: 0.71
LL3: -11.836606612796992
----- CV 4 -----
Train on: f1,f2,f3,f5, Test on: f4
AUCPR4: 0.9536004273504273
AUCROC4: 0.95
LL4: -6.458126608595575
----- CV 5 -----
Train on: f1,f2,f3,f4, Test on: f5
AUCPR: 0.6554753579753579
AUCROC: 0.765
LL5: -12.149458117122595
and you want to extract the AUCPR
results and average them.
With take
you can quickly do so:
take -f log.txt -c "line(L), startswith(L,'AUCPR'), split_select(L,':',1,L1), strip(L1,L2), println(L2)" -a average
Output
0.720441984486102
0.9423737373737374
0.7111492673992674
0.9536004273504273
0.6554753579753579
[average] 0.7966081549169783
Another example: extract the real
value of the bash time and convert it into seconds.
Suppose you have a file log.txt
of the form
Started at:
size 7
[RESULT] [4. 4.]
real 0m8.853s
user 0m8.231s
sys 0m0.110s
size 8
[RESULT] [4. 4.]
real 0m31.248s
user 0m30.784s
sys 0m0.177s
size 9
[RESULT] [4. 4.]
real 2m42.765s
user 2m41.007s
sys 0m1.205s
To do so:
take -f log.txt -c "line(L), startswith(L,'real'), split_select(L,tab,1,T), time_to_seconds(T,TS), println(TS)"
Output
8.853
31.248
162.765
Install uv
, clone the repo, and run
uv pip install .
If you want to edit the project, add the -e
flag after install
.
Variables start with an uppercase letter while constants starts with a lowercase letter, are numbers, or are enclosed within single quotes.
The execution idea is simple: each command starts with a line/1
predicate which assigns its argument to the content of the current file line (for instance line(L)
assigns L
to the content of the current line, since L
is a variable. If L
is a constant, it checks whether the current line is equal to the constant).
Then, iteratively, it applies subsequent commands, in order of appearance, until failure.
To print results, you can use the print/1
or println/1
.
Available predicates:
line(L)
: unifiesL
with the current file line. Note: each command must haveline/1
in itprint(L)/println(L)
: print the content ofL
(println/1
also adds a newline)startswith(L,P)
: true ifL
starts withP
startswith_i(L,P)
: true ifL
starts withP
, case insensitiveendswith(L,P)
: asstartswith/2
, but checks ends of the stringendswith_i(L,P)
: asstartswith/2
, but checks ends of the string, case insensitivelength(L,N)
: true ifL
is of lengthN
lt(L,N)
: true ifL < N
gt(L,N)
: true ifL > N
leq(L,N)
: true ifL <= N
geq(L,N)
: true ifL >= N
eq(L,N)
: true ifL == N
neq(L,N)
: true ifL != N
capitalize(L,C)
:C
is the capitalized version ofL
, i.e., makes the first character as upper case and the rest lower casesplit_select(L,V,P,L1)
: splitsL
at each occurrence ofV
thenL1
contains the split at positionP
, starting from 0. Fails ifP
is larger than the number of splits. Special split delimiters:V = space
andV = tab
replace(L,A,B,L1)
: replace the occurrences of the stringA
in L withB
and unifiesL1
with the resultscontains(L,A)
: true if the string unified withL
contains the string unified withA
, false otherwisecontains_i(L,A)
: true if the string unified withL
contains the string unified withA
, false otherwise, case insensitivestrip(L,L1)
: removes leading and trailing whitespaces fromL
and unifiesL1
with the resulttime_to_seconds(L,L1)
: converts a bash time of the form AmBs into seconds (example:L = 2m42.765s
intoL1 = 162.765
)add(A,B,C)
:C
is the result ofA + B
sub(A,B,C)
:C
is the result ofA - B
mul(A,B,C)
:C
is the result ofA * B
div(A,B,C)
:C
is the result ofA - B
pow(A,B,C)
:C
is the result ofA**B
mod(A,B,C)
:C
is the result ofA % B
abs(A,B)
:B
is|A|
substring(S,Start,End,ST)
:ST
is the substring ofS
from positionStart
(included) to positionEnd
(excluded)
You can also prepend not
to predicates (except to line/1
, print/1
, and println/1
) to flip the result.
You can pass arguments as strings by enclosing them into single quotes (e.g., 'Hello'
will be treated as a string and not as a variable).
You can also aggregate the results of the applications of the predicates on the file with the option -a/--aggregate
.
Available aggregates (some are self-explanatory):
count
: counts the linessum
product
average
ormean
stddev
variance
min
max
range
: max value - min valuesummary
: computes summary statistics (count, sum, mean, median, std dev, min, max, and range)concat
: concatenates the linesunique
: filters unique linesfirst
last
sort_ascending
sort_descending
median
word_count
If you want only the result of the aggregation and suppress the other output, you can use the flag -so/--suppress-output
.
You can specify multiple aggregates by repeating the flag.
You can aggregate data but keeping them separated by file, using the flag -ks
.
Assume the file is called f.txt
.
Count the empty lines from a file: take -f f.txt -c "line(L), length(L,N), lt(N,1), println(L)" -a count -so
Assuming you have a file where the line contains results separated by spaces and you want to pick the second element of each line and sum all: take -f f.txt -c "line(L), split_select(L,space,1,L1), println(L1)" -a sum -so
These are the available options.
This list may be not updated, so you should use take --help
.
options:
-h, --help show this help message and exit
-f FILENAME [FILENAME ...], --filename FILENAME [FILENAME ...]
Filename(s) to process
-c COMMAND, --command COMMAND
Command to process
-r, --recursive Process directories recursively
-so, --suppress-output
Suppress output, only show the result of the aggregation
-p, --plot Plot the results
-m MAX_COUNT, --max-count MAX_COUNT
Maximum number of lines to process overall (0 for no limit)
-H, --with-filename Print the filename in the output lines
--uncolored Disable colored output
--stats Show statistics about the processed files
--debug Show debug information
--max-columns MAX_COLUMNS
Maximum text length (0 for no limit)
-ks, --keep-separated
Keep the file data separated during aggregation and plotting
-a {count,sum,product,average,mean,stddev,variance,median,min,max,range,summary,concat,unique,first,last,sort_ascending,sort_descending,word_count}, --aggregate {count,sum,product,average,mean,stddev,variance,median,min,max,range,summary,concat,unique,first,last,sort_ascending,sort_descending,word_count}
Aggregation function to apply to the results
The following list shows some well-known commands and their counterparts using take
.
Scan all the files with extension .txt
, search for pos, print line numbers and at most 5 matches per file
- grep "pos" *.txt -m 5 -n
- take -f *.txt -c "line(L), contains(L,'pos'), line_number(L,N), print(N), print(':'), println(L)" -m 5 -H -ks
Print the first 10 lines of a file called filename.txt
- head -n 10 filename.txt
- take -f f.txt -c "line(L), line_number(L,I), leq(I,10), println(L)"
or take -f f.txt -c "line(L), println(L)" -m 10
We run a small experimental evaluation to benchmark the tool.
We considered files called stress_N.txt
with N
lines of the form Iteration IT: R, data
where R
is a random number.
The goal is to extract all the values of R
and compute statistics on them, so we run the command
take -f stress_N.txt -c "line(L), split_select(L, space, 2, L1), replace(L1, ',', '',L2), println(L2)" -a summary -H -so
We consider 10 runs of the same command computed with multitime
.
Overall, we run multitime -n 10 take -f stress_N.txt -c "line(L), split_select(L, space, 2, L1), replace(L1, ',', '',L2), println(L2)" -a summary -H -so
.
Results:
N = 100 (size 4.2K)
Mean Std.Dev. Min Median Max
real 0.476+/-0.0234 0.023 0.445 0.465 0.516
user 2.245+/-0.0715 0.071 2.108 2.243 2.369
sys 0.075+/-0.0152 0.015 0.051 0.074 0.099
N = 1_000 (size 42K)
Mean Std.Dev. Min Median Max
real 0.475+/-0.0240 0.024 0.445 0.474 0.515
user 2.283+/-0.1124 0.112 2.080 2.296 2.479
sys 0.086+/-0.0300 0.030 0.052 0.076 0.134
N = 10_000 (size 419K)
Mean Std.Dev. Min Median Max
real 0.593+/-0.0176 0.018 0.566 0.595 0.618
user 2.421+/-0.1013 0.101 2.213 2.443 2.592
sys 0.092+/-0.0253 0.025 0.052 0.092 0.127
N = 100_000 (size 4.1M)
Mean Std.Dev. Min Median Max
real 1.942+/-0.1521 0.152 1.668 1.981 2.148
user 3.750+/-0.1404 0.140 3.449 3.768 3.915
sys 0.104+/-0.0297 0.030 0.057 0.107 0.155
N = 1_000_000 (size 41M)
Mean Std.Dev. Min Median Max
real 14.054+/-1.1232 1.121 12.893 13.547 16.527
user 15.702+/-1.0858 1.083 14.650 15.145 17.965
sys 0.225+/-0.0231 0.023 0.182 0.232 0.258
Suggestions, issues, pull requests, etc, are welcome.
The program is provided as it is and it main contain bugs.