SQL-style map/reduce programming in bash
- Published on Thursday, 22 December 2011 12:50
- Hits: 1310
1. Download the bash map/reduce script
2. Processing collections
Quite a few shell scripting problems can be handled by applying the following pattern:
- [init]: statements to initialize the processing
- [list_produce]: statements to produce the list of lines to process
- [line_process]: statements to process the input line
For example, the following bash statements will convert a collection of CD audio files sitting on a CD to mp3 files on a connected mp3 player:
In the example above, the list is produced by the ls command, while each line is processed by the ffmpeg command.
3. The map() functional
A functional, a higher-order function, or a first-class function, is a function that accepts another function as argument or returns another function as result. For example, the input tag in HTML is a functional:
The HTML input tag accepts a function as argument for its onchange parameter. Therefore, it is a higher-order function. There are many definitions in circulation for functional programming, but in the end, functional programming always ends up solving programming problems through the use of functionals.
Map/Reduce programming is mostly mentioned in connection with distributed computing, and the desire to spread a particular workload over more than one CPU or machine. There are good examples available of how to achieve this goal in bash. If you'd like to look at a few more examples of map/reduce programming, you can also read the joel on software article.
In bash, the map functional processes an entire collection of lines on stdin and (optionally) outputs the resulting collection on stdout:
If we rewrite the example above using the map functional, we obtain the following expression:
Note: In general terms, the map functional can be used as following, C2=map(C1,f). The functional accepts a collection C1 as input and a function f, in order to produce the collection C2. If item is an element of C1, then f(item) will be an element of C2.
In its simplest incarnation, a map shell script can be implemented as following:
It is indeed as simple as that. At the basis, the map functional is just a foreach, also called a fold looping construct, that you abstract away. Looping over a collection can indeed be achieved, simply by invoking a functional with as argument, the function to execute for each element in the input.
4. SQL on one table with one column
Map/Reduce programming is, a much more useful programming construct than just a technique for achieving infrastructure scalability.
In shell programming, stdin consists of lines separated by newline characters. Each line is one processing unit, that is, one record. At the basis, stdin looks like a table with just one column.
SQL is simply a functional programming language. The map/reduce functionals can be used to provide functionality that looks pretty much like what you would be doing in SQL.
Note that the map, select, and execute functionals are synonyms. They all refer to exactly the same program.
- map: use this name, when the emphasis is on producing a second collection on stdin
- select: use this name, when you want to express that the expression will have no side effects on the system (won't change it)
- execute: use this name, when you want to express that the expression is primarily used to create side effects, and will not return useful data
You can download the bash script that implements these functionals at the top of this blog post, and play with it yourself. Feel free to let me know what your own findings are.
In the example above, we should have used the execute name instead of map, because the main purpose for invoking the functional, is to create side effects, and not to return data:
An example for the select synonym:
In the example above, the select functional returns all items that start with the letter "T". Because it is so often done, I added the option --first to only return the first result. You could achieve the same by piping the output to the head -n 1 command:
Now, let's return the result in uppercase, by using the tr a-z A-Z command:
Even though we are only querying one table at the same time, and although this table currently only supports one column, the command looks already much more like SQL than like a shell command. This is no coincidence. The SQL select is only a slightly extended map functional.
5. The reduce() functional
The reduce functional initializes an accumulator to an initial value, and updates the accumulator with each line in the input. At the end of the input, the functional can optionally output the value of the accumulator.
Note: In general terms, the reduce functional, r=reduce(C,finit,f) accepts a collection C as input and two functions, finit and f, in order to produce the output r.
For example, let's count the number of files in the parent directory:
There are 22 files and folders in the parent directory. Of course, we could only count the ones starting with the letter "T":
Note: If we managed to sort a multifield output and fire a conditional reduce statement on each group in the output, we could execute SQL GROUP BY and HAVING functionality in Bash. But then again, I will handle this possibility in a future blog post.
6. The exists() functional
The reduce functional is much more useful that it looks at first glance. For example, The exists functional can be defined as:
We can use the exists functional directly, as following:
The exists functional returns "0" when the condition is true and empty string "" when the condition is false.
In the following example, we can see a relatively useless embedded exists statement (because it is always true anyway), but it is just an example:
In this example, we also managed to get ourselves into an embedded quotes nightmare. In order to truly correlate both queries, we will also need to export the item variable from the parent shell to the embedded exists subshell.
Even though the expression ends up looking quite ugly, the construct truly represents a variant of the SQL EXISTS functionality.
7. The underlying script
The reduce shell script that you can download in this blog post, takes the following arguments:
Show/Hidden bash code$ ./reduce --help Usage: reduce [OPTION] [ACTION] synonyms: map, select, execute specific version: exists maps or reduces the stream on stdin and outputs the reduced stream to stdout using the commands in [ACTION] for each unit on stdin. in the action you can set or refer to the following variables: VARIABLES $init_value initial value for the accumulator $accumulator the accumulator $i the loop counter $item the current unit being processed (char or unit) $char the current character being processed, in character mode $line the current line being processed, in line mode the action defaults to echo $item. none of the following options is mandatory: STANDARD OPTIONS -h,--help display this help and exit MODUS OPTIONS -l,--line read line by line; this is the default; processing unit=line -c,--char read char by char; processing unit=char -f,--first return first result and then exit ACCUMULATOR OPTIONS -o,--output-accumulator print the value of the accumulator variable after finishing looping over the input REDUCE OPTIONS -i,--init-value 'value' set the initial value for the accumulator variable to 'value' -w,--where 'condition' only execute the action for the unit, if 'condition' is true -s,--condition-stop 'condition' prematurely stop processing if 'condition' is true
As you can see, we can already get quite far with one simple script, when we want to do SQL-style programming in bash. You may ask yourself the question what it would take to create an entire relational database in shell scripting?
In my impression, the maximum it would take, is:
- a relatively primitive mechanism that can manage to index a tabular text file with fields in its columns
- a program to read from these files along its indexes, and output the resulting rows as lines on stdout
Since Bash already provides the embedded scripting engine that any relational database engine needs, in order to execute its SQL or SQL-style commands, it would therefore be possible to do very SQL-like things with such textual query sources, while handily intermixing this functionality with more traditional shell script programming, as well as with traditional SQL queries against traditional relational databases.