Adaptive Signal Processing and Information Theory Research Group: Contact

Parallel Computation on the Cluster

This document outlines methods for writing C and C++ code that exploits the potential of Drexel ECE’s statistical signal processing computer cluster.

1 Cluster Composition and Layout

As of August 6, 2007, the cluster consisted of 12 computation nodes (Apple Xserves), an Xserve RAID 2.05TB file system hosted by the file server schubert.ece.drexel.edu, an ethernet switch, and a power backup unit. We discuss each of these components presently.

1.1 Xserve Nodes

The computation nodes in the cluster have domain names and IP addresses:

haydn01.ece.drexel.edu 129.25.60.169
haydn02.ece.drexel.edu 129.25.60.170
haydn03.ece.drexel.edu 129.25.60.171
haydn04.ece.drexel.edu 129.25.60.172
haydn05.ece.drexel.edu 129.25.60.173
haydn06.ece.drexel.edu 129.25.60.174
haydn07.ece.drexel.edu 129.25.60.175
haydn08.ece.drexel.edu 129.25.60.176
haydn09.ece.drexel.edu 129.25.60.177
haydn10.ece.drexel.edu 129.25.60.178
haydn11.ece.drexel.edu 129.25.60.179
haydn12.ece.drexel.edu 129.25.60.180

As you can find out by logging into a node with VNC, clicking the apple in the upper left hand corner, and selecting “about this mac,” each node contains:

4 GB RAM (DDR2 FB-DIMM 667Mhz) composed of 4×512 MB sticks and 2 × 1 GB
2 dual core 2.0GHz Intel Xeon 5130 CPUs for a total of 4 cores per node. The additional specifications on these processors, which may be deduced from the result of the command sysctl -A and from Intel’s product page include
- 2 GHz clock speed.
- Intel (R) 64 technology (see EM64T below). Do not confuse with IA-64 which include, for instance, the itanium processors! Our processors can also execute IA-32 instructions.
- 32 kB L1 instruction cache and 32 kB L1 data cache per core.
- 4 MB L2 Cache per processor, shared between the two cores in the processor.
- Supports MMX, SSE,SSE2,SSE3, and XD EM64T (=extended memory 64 technology)
- 1333 MHz Front Side Buz.
- Execute Disable bit support, Intel Virtualization Technology Support.
- FPU and MMX units use 128-bit wide registers.
Serial Attached SCSI (SAS) bus attached to 2 × 76 GB local HDD (Seagate SATA). These have been configured into a single 76 GB mirrored RAID array for increased data reliability (see Disk Utility).
Two on-board 1000baseT Ethernet Connections. (As of August 17, 2007 only one of these was being used per node.)

1.2 Xserve RAID File System

A Xserve RAID array of 2.05 TB total space of disk storage, is mounted using NFS on each machine under

/Network/Servers/schubert.ece.drexel.edu/Volumes/Lab318

A second set of drives (in the same XserveRAID) containing 1.4 TB of storage is mounted using NFS on each compute node under

/Network/Servers/schubert.ece.drexel.edu/Volumes/MET-lab

Information about administering Xserve RAID array can be found in its datasheet and its technology overview. To communicate with the host Xserve schubert.ece.drexel.edu, the Xserve RAID uses the fibre channel bus. The host Xserve schubert.ece.drexel.edu then communicates to the rest of the network nodes haydnXX via the gigabit ethernet. Note that schubert.ece.drexel.edu is an older Xserve, and thus has a different hardware configuration, including

2 GHz PowerPC G5 processor with 512k L2 Cache.
1 GB DDR SDRAM
2 × 76.69 GB SATA HDDs. These are combined into a single 76.6 GB mirrored RAID drive for data redundancy.

Since it is a slower computer to begin with, and it also must handle the file requests and LDAP information for all of the nodes in the cluster, schubert should not be used as a compute node.

1.3 SMC Ethernet Switch

The Xserves and the Xserve RAID file system are all connected to a SMC Networks EZ Switch 10/100/1000 (SMCGS24C-Smart). Information about the switch can be found in its data sheet and its manual.

2 Remote Login

2.1 Secure Command Line Interface (SSH) and File Transfer (SFTP)

MAC Users and Linux Users Your linux distribution and MAC OSX probably came with ssh installed by default. If not obtain install copy with either yum (fedora) or apt-get (MAC OSX). Once ssh is installed and set to be in the path, you can connect with user name xxx using commands of the form ssh xxx@haydnXX.ece.drexel.edu.
Microsoft Windows Users A ssh client is available free through Drexel at the IRT software website. An easy to use GUI for file transfer and a terminal are available.

2.2 X Windows Remote Graphics

MAC Users and Linux Users You will need an X-windows client, which almost all window managers for linux (e.g. Gnome and KDE) are built on top of. Mac OsX users will usually have an X-windows client such as Xfree86 installed. You can then use the ssh -X command option to pipe an X-windows connection through ssh. After that, graphic windows (e.g. plots from matlab) started on the remote machine should be displayed in windows on the local display.
Microsoft Windows Users Make sure that you enable X windows piping through your ssh/sftp client. The program X-Win 32 is an X windows manager available for free via Drexel at the IRT software website. To enable the X-windows piping in this client go to edit-�settings, under profile settings highlight tunneling, and click the check box to enable X11 connections under both the outgoing and incoming tabs. You will need to have the x-windows client running as well as the ssh client to have the remote graphics displayed locally.

2.3 Remote Desktop

If you would prefer to be just as if you were the operator at a keyboard and display plugged into on of the network nodes, you can use a remote desktop service. If anyone else is logged into the computer, even with a different user name and different privileges, however, you will both be seeing and manipulating the same desktop. For this reason all users are recommended to uses the X windows option above instead. However, remote desktop will remain enabled (with a non-public password) just in case it is needed for maintenance tasks.

MAC USERS: Users running Mac OSX can email Dr. Youngmoo Kim (ykim@drexel.edu) for the Apple Remote Desktop client if they don’t have it already.
Windows or Linux USERS: Windows or Linux users can run TightVNC’s vncviewer (free) tunneled through SSH.
- If you have ssh secure shell installed, and available in the current path, just use the command ssh username@haydnXX.ece.drexel.edu -L 5900:127.0.0.1:5900 to connect to haydnXX (with XX replaced with the appropriate number). This sets up the tunnel through SSH. You can then use tightvnc and connect to 127.0.0.1 to view the remote desktop.
- If you alternatively use http://www.chiark.greenend.org.uk/ sgtatham/putty/ for your SSH connection, you can find directions on how to forward the necessary ports to do the tunneling through SSH at http://members.shaw.ca/nicholas.fong/vnc/.

3 Batch Job Submission

There are several signal processing computations which can be performed on parallel processors. These include cases where an algorithm can be decomposed into independent computations or when a large data set can be subdivided and each subset processed independently.

The Signal processing lab has a cluster of twelve computation nodes each of which consists of 2 dual core processors. The processors have individual cache memory and share the common storage on an Xserve RAID. This facility allows us to run multiple instances of the same code on several of the nodes simultaneously or to optimize a single instance of a program to take advantage of the parallelism inherent in the problem.

There are two main approaches to parallel programming. The message passing model where processes pass messages to communicate with other processes and the directives based data parallel model where programming languages make serial programs parallel by the use of directives which tell the compiler how to distribute work and data among the processors.

4 openMPI

In order to efficiently run code on the cluster, the code written makes use of both openMPI and openMP as well as numerical libraries like Lapack. OpenMPI is a message passing interface that allows communication between processes running on different nodes.It consists of a set of C functions or Fortran subroutines. For instance, if certain variable computed on one node needs to be passed to another node, appropriate MPI directives are included in the code. Each process is identified by rank and communication between processes is effected by calls to MPI communication routines.

In MPI processes are assigned work based on their rank. At run time, one need only specify the number of nodes on which the multiple processes are to be run. For instance, if a simulation is to be performed multiple times, it can be run on the nodes simultaneously.

5 openMP

OpenMP is an API for parallelizing C,C++ and Fortran programs on shared memory architectures. When using openMP, compiler directives are inserted in the code so that the executable code is ideal for parallel processing. OpenMP can be used in conjunction with openMPI where openMPI interconnects the machines in the cluster and openMP ensure that the codes is efficient to run in parallel each of the nodes (with each node having multiple processors).

In all applications of parallel processing careful decomposition of the problem is imperative. This division is either domain decomposition or functional decomposition. Also, it is important to ensure that the time saving due to parallel execution is not lost due to the need to communicate between multiple instances of the program.

Useful introductory courses on openMP,and openMPI can be found at http://ci-tutor.ncsa.uiuc.edu/.