A Multi-Threaded Runtime System for a Multi-Processor/Multi-Node Cluster

We describe the runtime system for the EARTH Architecture (Efficient Architecture for Running THreads) operating on a multi-processor/multi-node, distributed-shared memory cluster. Some distinguishing characteristics of the EARTH model include: (1) the activities on each node are performed by two independent units: an Execution Unit (EU) and Synchronization Unit (SU); (2) there are two levels of threading: threaded functions, themselves decomposed into fibers into fibers; (3) the EU performs the ``useful work'', i.e., executes the fibers, while the SU is in charge of synchronization, inter-node communication, scheduling, and load balancing; (4) threads, fibers and synchronization are explicitly expressed using the Threaded-C language, a variant of the C language.

Initially EARTH was conceived to be implemented in a machine that allowed the EARTH runtime system to directly manage the inter-node network, including network generated interruptions. Thus, in the earlier implementations of EARTH, the runtime system decided when and how often to poll the network to check for incoming data or synchronization signals.

We designed and implemented an EARTH runtime system for a cluster of processors. For portability, we built this runtime system on top of Pthreads under Linux. This implementation enables the overlapping of communication and computation on a cluster of Symmetric Multi-Processors (SMP), and let the interruptions generated by the arrival of new data drive the system, rather than relying on network polling. An interesting research question for the implementation of a multi-threading system on a multi-processor/multi-node system is how to arrange the execution and synchronization activities to make the best use of the resources available, and how to design the interaction between the local processing and the network activities.

Our new organization of the EARTH runtime system is composed of various modules. The Execution Module executes fibers and also takes on the responsibilities of intra-node scheduling, synchronization, communication and load balancing, tasks which were performed strictly by the SU in previous implementations of EARTH. The Receiver (resp. Sender) Module is in charge of handling incoming (resp. outgoing) messages. The Token Queue (TQ) contains work that may either be performed by a processor within the local node, or that might be sent to a different node for execution. The Ready Queue (RQ) contains fibers that must be executed locally, and the Sender Queue (SQ) is a queue of outgoing messages.

For our experimental work we used Ecgtheow, a cluster at Michigan Technological University comprised of 64 dual-processor nodes. For one particular benchmark, Another Tool for Genome Comparison (ATGC), we obtained a relative speedup 58.7 on 32 nodes (2 processors per node), whereas for 60 dual-processor nodes, the speedup was 89.8.