Library (called cl-cluster) uses SSH to get a remote shell, run some lisp and execute given code directly in it. SSH was chosen, because I didn't want to write some kind of tcp server/client. Also I didn't want to write fault tolerant remote objects broker with migration between hosts, etc, so transport level may be really simple.
Besides not very interesting stuff like making connection, sending and receiving data, there were a few moments:
One problem was how to prevent remote lisp from falling to debugger. There are 2 possibilities to trap: parse errors and execution errors. Later problem is relatively simple, I just need to wrap main request's body in (handler-case ...). Parse errors are different, they trigger before handler-case will be able to catch them. The best solution I found was to pass body as string and parsing it on remote side using read-from-string and eval it. Thus, making toplevel input form errors free and fully controlled by the library. Of course, handler-case and read-from-string ate some speed, but it's not important for my proof of concept :)
Another problem is: not all lisp objects can be read back from their printed representation. Especially, I faced to problem of parsing remote exceptions on local side. I've ended up returning exceptions in the form of list with symbol "error" as first argument and error description as second. This is very unlikely that regular user-provided code will produce such result. On local side I rise exception with received description, noticing the node where this exception has really happened. In practice, it looks good.
One more trap was spotted with passing packages names to remote: format specifier ~a just cuts package designators, producing (oos 'load-op :asdf) instead of (asdf:oos 'asdf:load-op :asdf). This is not so tricky for experienced lisper, but I've spent some time understanding what's going on and then switched format to ~s.
At the end I wanted to have a let-like macro for parallel execution of bindings on remote machines, gathering results, and passing preprocessed bindings them to another machine. For example:
;; calculate some stuff on remote lisps and aggregate answers on another remote
((a (with-remote node1
(+ 1 2)))
(b (with-remote node2
(* a b))
A and B will be computed in parallel on nodes node1 and node2 respectively, after that node3 will get request:
(let ((a someval)
(* a b))
For parallel tasks dispatching I've used the cl-pmap library, written by my friend Alexey Voznyuk a.k.a Swizard. It's not the best solution, because it uses thread per task, whereas I was needed simple i/o muxer to write data to several output streams and read results back from input. It'd be a good idea to use iolib instead.
Recently I've tried to make the library portable across different popular Lisps, but, seems, there's no existing wrapper for asynchronous shell execution, allowing to communicate with it via streams. The most closest library, trivial-shell, hasn't this possibility. Quick overview of trivial-shell sources told me, there will be problems to do such thing on some implementation due to bugs or whatever else.
The last idea is to get rid of SSH and use Zero MQ for transport :) Real life application will help to do good zmq bindings, and also zmq's subscriber model can help to cl-cluster with common changes propagation across all slave nodes. Not telling, zmq can use advanced networking techniques and save some microseconds ;)
As usually, sources are hosted at repo.or.cz: http://repo.or.cz/w/cl-cluster.git