Initially EARTH was conceived to be implemented in a machine that allowed the EARTH runtime system to directly manage the inter-node network, including network generated interruptions. Thus, in the earlier implementations of EARTH, the runtime system decided when and how often to poll the network to check for incoming data or synchronization signals.
We designed and implemented an EARTH runtime system for a cluster of processors. For portability, we built this runtime system on top of Pthreads under Linux. This implementation enables the overlapping of communication and computation on a cluster of Symmetric Multi-Processors (SMP), and let the interruptions generated by the arrival of new data drive the system, rather than relying on network polling. An interesting research question for the implementation of a multi-threading system on a multi-processor/multi-node system is how to arrange the execution and synchronization activities to make the best use of the resources available, and how to design the interaction between the local processing and the network activities.
Our new organization of the EARTH runtime system is composed of various modules. The Execution Module executes fibers and also takes on the responsibilities of intra-node scheduling, synchronization, communication and load balancing, tasks which were performed strictly by the SU in previous implementations of EARTH. The Receiver (resp. Sender) Module is in charge of handling incoming (resp. outgoing) messages. The Token Queue (TQ) contains work that may either be performed by a processor within the local node, or that might be sent to a different node for execution. The Ready Queue (RQ) contains fibers that must be executed locally, and the Sender Queue (SQ) is a queue of outgoing messages.
For our experimental work we used Ecgtheow, a cluster at Michigan Technological University comprised of 64 dual-processor nodes. For one particular benchmark, Another Tool for Genome Comparison (ATGC), we obtained a relative speedup 58.7 on 32 nodes (2 processors per node), whereas for 60 dual-processor nodes, the speedup was 89.8.