Molecular simulations are one class of applications of high-performance computing (HPC). HPC generally refers to the hardware and software environment that allows users to run simulations of many hundreds of thousands of degrees of freedom (or more) distributed across multiple processing elements (PE) and even over multiple nodes, in a shared, general-purpose cluster. Essentially all HPC clusters nowadays run the Linux operating system, and users are (mostly) expected to interact with such clusters via the command-line. For this class, we will mostly restrict our explorations to systems that are small enough NOT to require HPC hardware in order to run them. However, we will focus on building skills with the Linux command-line. If you are already familiar with the Linux command line, you can skip this subsection.
Linux is an operating system. The most basic (and sufficient) way to interact with Linux is via the command-line. The program responsible for monitoring the command-line, allowing the user and the operating system to interact, is called a shell. There are many types of shell programs you can choose to run, but the default for most Linux versions nowadays is bash. This is also the default shell when you install Ubuntu on WSL2 in Windows; in macOS X, the default shell is zsh, but this can easily be changed to bash (though this is not strictly necessary). I will demonstrate some simple exercises in bash here. In all the examples, the $ refers to the bash prompt.
Let’s first create a subdirectory for holding all your work in this course in your WSL:
$ cd $ mkdir cheT580 $ cd cheT580
The cd command alone sets the current working directory to your home directory (/home/username/). The
mkdir command makes a new subdirectory under the current working directory, and the second cd changes
the current working directory to be that directory. (rmdir can remove an empty directory.)
Now, lets create a simple file here, just to play around with.
$ echo "Hello, world!" > my_file.txt $ ls my_file.txt $ cat my_file.txt Hello, world! $ rm my_file.txt $ cat my_file.txt cat: my_file.txt: No such file or directory
What did we do here? We created a file by redirecting the output of the echo command to my_file.txt.
(There are many, many ways to create a file; this is just one.) We then used the ls command to show all files
and subdirectory names in the current working directory; my_file.txt just happens to be the only one. We
then displayed the contents of this file to the terminal using the cat command. Finally, we removed the file
using rm, and when we then try to cat it, we get an error message indicating the file no longer
exists.
Enough playing around. Let’s make a directory called assignment1:
$ mkdir assignment1
Maybe you don’t like that name; you can destroy it with rmdir:
$ rmdir assignment1
Let’s not actually destroy this directory. If you just destroyed it, recreate it. Let’s cd into it, and then clone the
github repository for assignment1:
$ cd assignment1 $ git clone github.com:<repository-name>
Here, <repository-name> should be replaced with the actual name. Now, you can follow the instructions
in the README.md you are viewing on github.
Some other things: You can cd “up” to the parent directory of the current working directory like this:
$ cd ..
You can always ask the shell to tell you what the current working directory is using pwd:
$ pwd /home/<username>/cheT580 $
No matter what your current working directory is, you can cd to your home directory like this:
$ cd
Go to your home directory, and let’s play with files a bit more. Let’s create a new text file with the cat
command. Type the following:
$ cat > my_file This is a test file. Don’t panic. <Ctrl-D> $
<Ctrl-D> means perform the “control-D” key sequence, which signifies to the cat command that you are
finished writing to the file. The cat command on the first line waits for you to type some file contents into the
terminal, and the > redirects that input to cat to my_file. Now, we can list the contents of the current
directory (which is your home directory here) with the command ls. Guess what we will see?
$ ls cheT580 my_file $
If we use the -F flag with ls, we can easily see which files are files and which are directories:
$ ls -F cheT580/ my_file $
See the “/” after assignment1? That means it is a directory. Now, make a copy of the file my_file called
my_file2 using the cp command:
$ cp my_file my_file2 $ ls cheT580/ my_file my_file2 $
We can rename a file with the mv command. Rename my_file2 to my_file3:
$ mv my_file2 my_file3 $ ls -F cheT580/ my_file my_file3
Notice that my_file2 no longer exists. Now, move my_file3 into the cheT580/ directory with mv:
$ mv my_file3 assignment1 $ ls -F cheT580/ my_file $ ls -F cheT580 my_file3 assignment1/ $
Notice that the last command lists the contents of the cheT580 directory. We could also cd into that
directory and just type ls -F; we would see the same thing.
Those are all the basic file handling skills you will need to work with code for this course.
An important concept that arises because of the directory structure of Linux filesystems is are relative and
absolute pathname. “Relative” always refers to the current working directory, while “absolute” always refers to
the root directory. Suppose that in the assignment1 subdirectory of your cheT580 subdirectory of your home
directory, there is a file called my_file. That file can be referred to from any other directory using either a
relative or an absolute pathname. Suppose you are in your home directory and you want to view the contents of
that file using cat:
$ cd $ cat cheT580/assignment1/my_file
The string cheT580/assignment1/my_file is the pathname of that specific file relative to your home
directory. Now, no matter what directory you are in, you can always refer to a file using its unique absolute
pathname:
$ cat /home/<username>/cheT580/assignment1/my_file
Absolute pathnames are a pain to type, but they have the benefit of being completely unambiguous.
The Ubuntu 20.04 you installed from the Microsoft store is a stable release version, but individual components
of the operating system are constantly being upgraded, sometimes to fix security issues. You should
get in the habit of keeping your Ubuntu up to date. This is done using apt in superuser mode:
$ sudo apt upgrade [sudo] password for <username>: Reading package lists... Done Building dependency tree Reading state information... Done Calculating upgrade... Done 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
(In this case, I’m not showing any updates since mine is up to date.) Performing this check once a week is a good idea; the default login message you see when you launch a WSL/Ubuntu terminal also informs you when updates are available.
apt is Ubuntu’s “package manager” program, and it maintains a database of all packages installed and
which ones are upgradable. To do this, it connects periodically to remote repositories (hosted by Ubuntu) in
which updated packages are published. You can learn a lot about apt by looking at its manual pages using the
man command:
$ man apt
Although you will likely not need to do this in this course, a key concept in scientific computing using Linux is
the ability to remote in to other computers. This is very easy using the command-line, and is normally done
using the “secure shell” protocol’s ssh command:
$ ssh <username>@hostname.domain.edu
Here, hostname.domain.edu is the fully resolved name of a remote computer on which <username> has
login privilege. The next thing one normally needs to do is provide a password (and, if necessary, some kind of
two-factor authentication, like a one-time code or responding to a push notification on your phone). Typically,
once you are logged in you have a command-line interface just like you do locally. Typical workflows for remote
work involve uploading data and input files for simulation runs, running the simulations, and then downloading
output data back to a local machine.
Often, the “other computers” are actually login nodes that front enormous clusters of “compute nodes”. In these settings, execution of simulations is actually scheduled using a batch scheduler, and “running a simulation” actually amounts to submitting the commands necessary to run the simulation to the scheduler. The job of the scheduler is to decide when to run your program based on the availability of system resources. This kind of “batch” processing is typical of high-performance computing. If you are provided an account on a cluster, you will be trained on how to submit jobs to the scheduler (among other things), and a basic working knowledge of Linux is typically assumed for this kind of training.
Many universities maintain their own HPC facilities. Drexel’s University Research Computing Facility
(URCF) has two main clusters: proteus.urcf.drexel.edu and picotte.urcf.drexel.edu.
This is covered in Assignment 1, and I just go over basics here.
Programs written in C or FORTRAN or some other languages must be compiled to generate executable
programs. Most programs we will work with are in C, and the default compiler for C in Linux is gcc. I’ll
demonstrate a typical workflow for writing, compiling, and running a C program here.
First, cd to your cheT580 subdirectory, and create and cd into a subdirectory called examples, then launch
VSCode:
$ cd $ cd cheT580 $ mkdir examples $ cd examples $ code .
You should see the code window appear something like this:
Using the explorer panel, I can click on the new file icon and create a new file called hello_pi.c:
Now, let’s create a little C program:
Notice that two libraries are included: stdio and math. I need stdio to use the printf() function, and I need
math to access the constant M_PI.
Saving that with Ctrl-S, I can now launch a new Terminal inside VSCode (or just go back to the WSL
terminal), and compile and run:
The compile command is gcc and its main argument is the name of the C program hello_pi.c. The -o
switch is used to identify the name of the output of the compilation; here, that is the name of the executable,
and we’re choosing to call that hello_pi. If we do not include a -o switch, gcc calls the ouput a.out. The -lm
switch instructs gcc to include the precompiled standard math library; try omitting this switch and see what
happens.
The executable hello_pi lives in the same directory as the source code hello_pi.c. We can run it by just
typing the name of the executable, prepended with ./. This instructs the shell NOT to go looking in any
standard system directories for the name of the command (which it normally does), but instead to run the
command whose executable is found in the current directory. The current directory is always signified by ./.
Running this program provides the anticipated result.
Unlike C, Python is an interpreted programming language. This just means that you don’t have to compile it yourself before running it. Instead, you feed the program to the Python intepreter and it compiles and runs it for you, and then exits.
Keeping that instance of VSCode running inside the examples subdirectory, let’s create a new file called
hello_pi.py:
In red, I’ve circled the little message indicating which Python interpreter VSCode will use if
you choose to run this program using VSCode. You may instead see a message here indicating
that you have to select a Python interpreter. (Windows users: If you did not install Python inside
your WSL/Ubuntu already, as instructed in Assignment 1, you can do so now using apt at the
command-line.)
Here is Python that does exactly the same thing as hello_pi.c:
Now, notice that little green “play” button in the upper-right? I can just click that to run the Python
program:
And we still get the anticipated result. We need not run the Python program inside VSCode; we are free to
run it at the command-line, but notice the command that VSCode issued to run the program: the program that
VSCode runs is actually the interpreter /bin/python3, and the argument of that command is the full pathname
of the Python script. You could alternatively issue that command at the bash prompt and the same result would
happen.