Parallel Computation on the Cluster

This document outlines methods for writing C and C++ code that exploits the potential of Drexel ECE’s statistical signal processing computer cluster.

1 Cluster Composition and Layout

As of August 6, 2007, the cluster consisted of 12 computation nodes (Apple Xserves), an Xserve RAID 2.05TB file system hosted by the file server, an ethernet switch, and a power backup unit. We discuss each of these components presently.

1.1 Xserve Nodes

The computation nodes in the cluster have domain names and IP addresses:

As you can find out by logging into a node with VNC, clicking the apple in the upper left hand corner, and selecting “about this mac,” each node contains:

1.2 Xserve RAID File System

A Xserve RAID array of 2.05 TB total space of disk storage, is mounted using NFS on each machine under


A second set of drives (in the same XserveRAID) containing 1.4 TB of storage is mounted using NFS on each compute node under


Information about administering Xserve RAID array can be found in its datasheet and its technology overview. To communicate with the host Xserve, the Xserve RAID uses the fibre channel bus. The host Xserve then communicates to the rest of the network nodes haydnXX via the gigabit ethernet. Note that is an older Xserve, and thus has a different hardware configuration, including

Since it is a slower computer to begin with, and it also must handle the file requests and LDAP information for all of the nodes in the cluster, schubert should not be used as a compute node.

1.3 SMC Ethernet Switch

The Xserves and the Xserve RAID file system are all connected to a SMC Networks EZ Switch 10/100/1000 (SMCGS24C-Smart). Information about the switch can be found in its data sheet and its manual.

2 Remote Login

2.1 Secure Command Line Interface (SSH) and File Transfer (SFTP)

2.2 X Windows Remote Graphics

2.3 Remote Desktop

If you would prefer to be just as if you were the operator at a keyboard and display plugged into on of the network nodes, you can use a remote desktop service. If anyone else is logged into the computer, even with a different user name and different privileges, however, you will both be seeing and manipulating the same desktop. For this reason all users are recommended to uses the X windows option above instead. However, remote desktop will remain enabled (with a non-public password) just in case it is needed for maintenance tasks.

3 Batch Job Submission

There are several signal processing computations which can be performed on parallel processors. These include cases where an algorithm can be decomposed into independent computations or when a large data set can be subdivided and each subset processed independently.

The Signal processing lab has a cluster of twelve computation nodes each of which consists of 2 dual core processors. The processors have individual cache memory and share the common storage on an Xserve RAID. This facility allows us to run multiple instances of the same code on several of the nodes simultaneously or to optimize a single instance of a program to take advantage of the parallelism inherent in the problem.

There are two main approaches to parallel programming. The message passing model where processes pass messages to communicate with other processes and the directives based data parallel model where programming languages make serial programs parallel by the use of directives which tell the compiler how to distribute work and data among the processors.

4 openMPI

In order to efficiently run code on the cluster, the code written makes use of both openMPI and openMP as well as numerical libraries like Lapack. OpenMPI is a message passing interface that allows communication between processes running on different nodes.It consists of a set of C functions or Fortran subroutines. For instance, if certain variable computed on one node needs to be passed to another node, appropriate MPI directives are included in the code. Each process is identified by rank and communication between processes is effected by calls to MPI communication routines.

In MPI processes are assigned work based on their rank. At run time, one need only specify the number of nodes on which the multiple processes are to be run. For instance, if a simulation is to be performed multiple times, it can be run on the nodes simultaneously.

5 openMP

OpenMP is an API for parallelizing C,C++ and Fortran programs on shared memory architectures. When using openMP, compiler directives are inserted in the code so that the executable code is ideal for parallel processing. OpenMP can be used in conjunction with openMPI where openMPI interconnects the machines in the cluster and openMP ensure that the codes is efficient to run in parallel each of the nodes (with each node having multiple processors).

In all applications of parallel processing careful decomposition of the problem is imperative. This division is either domain decomposition or functional decomposition. Also, it is important to ensure that the time saving due to parallel execution is not lost due to the need to communicate between multiple instances of the program.

Useful introductory courses on openMP,and openMPI can be found at