ODU Research Computing Documentation
Result

Getting Started

Welcome to the ODU Research Computing Services (RCS) Documentation site! We hope this documentation can help you start the journey in High Performance Computing.

Before we dive into the important parts, like the Grid Scheduler, environment management, and details about our cluster, we cover some basic concepts and answer a few frequently asked questions. We talk about how to get an account on our cluster, how to connect, and how to get help. We also go over how to contact us. Whether you have questions or issues, we are always glad to help. Finally, we list free and easily available resources about using Linux. We hope you enjoy them.

Getting An Account

Use your Monarch Identification and Authorization System account (MIDAS) to access High Performance Computing resources. Email itshelp@odu.edu requesting activation of the HPC service on your account.

If you do not have a MIDAS account yet, please visit here.

Getting Help

When you need help, the Research Computing Services group is always here for you. Reach us by email at itshelp@odu.edu or contact our members individually.

Name Phone Email  
John Pratt 683-3088 jpratt@odu.edu HPC Supervisor
Wirawan Purwanto 683-3586 wpurwant@odu.edu Computational Scientist
Terry Stilwell 683-7145 tstilwel@odu.edu HPC System Engineer
Min Dong 683-7149 mdong@odu.edu HPC System Engineer
Adrian Jones 683-3678 axjones@odu.edu HPC System Engineer

Feel free to visit us directly at:

  Engineering & Computational Sciences Bldg, 4700 Elkhorn Avenue
  Suite 4300
  Norfolk VA, 23529

Connecting to the Cluster

There are two ways to remotely access our resources:

  • Secure Shell (SSH)

    SSH is an encrypted network protocol. It enables users to remotely login to our servers and operates securely over the Internet.

    Using SSH, you can connect to our resources from on campus or your home.

  • X Window System (X11).

    X11 is the foundation of UNIX graphical interfaces. It allows users to reach our resources graphically and remotely.

    We support X11 and X11 over SSH on all of our resources and we support X11 over Remote Desktop Protocol (RDP) for the Turing Cluster.

In addition to the above, we also support Secure Shell File Transfer Protocol (SFTP) for file access to our cluster.

Cluster Access Matrix

Cluster Cluster Hostname SSH Support X11 Support X11 over RDP Support File Transfer Support
Wahab wahab.hpc.odu.edu yes yes yes yes
Turing turing.hpc.odu.edu yes yes yes yes
Hadoop namenode.hpc.odu.edu yes yes no yes

SSH Clients for Microsoft Windows

Windows does not come with a SSH client but there are many great, free options available. We list a few that are popular among SSH users on Windows:

  1. Putty
  2. Kitty - Enhanced putty
  3. Cygwin

SSH Clients for Linux, Mac, or other flavors of UNIX

SSH is installed on most UNIX-like systems by default. You can use the following command in your terminal to connect to our resources:

ssh YOUR_MIDAS_USERNAME@CLUSTER_HOSTNAME

In case you cannot find the ssh command, please try installing the following package with the package manager that ships with your OS.

  • Debian/Ubuntu/Mint - openssh-client
  • Redhat/CentOS/Fedora - openssh-clients

SSH For Putty/Kitty

Putty/Kitty does not require installation. You can just download and double click it to execute.

We suggest using Putty/Kitty for Windows SSH access. Please download Putty or Kitty.

  • Step 1: Start Putty/Kitty.

    step-0.png

  • Step 2: Enter the cluster hostname.

    step-1.png

  • Step 3: Click yes if this is your first time connecting to the cluster.

    step-2.png

  • Step 4: Enter your credentials.

    step-3.png

X11 Graphical User Interface for Microsoft Windows

We suggest X11 over RDP for you since there is a built-in remote desktop connection application for every Windows installation. This approach requires minimal configuration.

  • Step 1: Start Remote Desktop Connection.

    step-0.png

    You can find it from searching in your Start menu.

  • Step 2: Enter the cluster information.

    step-1.png

  • Step 3: Enter your credentials.

    step-2.png

X11 Graphical User Interface over SSH for Linux, Mac, or other flavors of UNIX

ssh -C -X YOUR_MIDAS_USERNAME@CLUSTER_HOSTNAME

# -X enables X11 forwarding
# -C compresses the data to enhance graphical performance

Using X11 on your system is normally much easier and requires no configuration, thus we highly recommend that you use X11 over SSH.

X11 Graphical User Interface over RDP for Mac Users

By default, Mac does not come with a RDP client but you can download the Microsoft Remote Desktop Connection App from the App store. The app enables you to connect to our resource over RDP.

Once installed, please perform the following steps to establish a connection to the cluster.

  • Step 1, Start your app and click the New button

    step-0.png

  • You will see this dialog asking for your information. You don’t have to do this every time.

    step-1.png

  • Input your information and click the close button. There is no save button here.

    step-2.png

  • Double click on the entry you just created and you will have your connection.

    step-3.png

X11 Graphical User Interface over RDP for Linux or other flavors of UNIX

You have several choices. Usually, a remote desktop viewer comes with your distribution. If it is not available, you may want install one of following clients:

  • Vinagre - For Gnome/Unity/any gtk+ based environment
  • KRDC - For Kde/any qt based environment

File Transfers

We recommend using WinSCP to transfer files to and from the cluster. To view their download page, visit here. After you download and install the application, it needs to be configured to connect to the cluster (either Turing or Wahab).

  • Go to File -> Site Manager -> New Site

    Step-1.png

  • Input your information. This only need to be done once.

    Step-2.png

    You can click Login once you are finished.

FAQ

Q: Can I use any of the resources offered by the Research Computing Group?

A: Yes, if you are a valid Old Dominion University student or a member of the faculty or staff. Also yes, if you are conducting research and collaborating with ODU faculty or staff.

Q: Does the Research Computing Group offer software X or service Y?

A: If you can not find it in our documentation, you can always contact us and let us know what you want. We try the best to accommodate every user’s request.

Slurm Workload Manager

The Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management), or Slurm, is a very successful job scheduler that enjoys wide popularity within the HPC world. More than 60% of the TOP 500 super computers use slurm, and we decide to adopt Slurm on ODU’s clusters as well.

Slurm, as most job schedulers, brings the following abilities to the cluster:

  • Load Balancing

    Slurm distributes computational tasks across nodes within a cluster. It avoids the situation where certain nodes are under utilized while some are overloaded.

  • Scheduling

    Slurm lets you submit tasks to the cluster, and starts the computation for you once resources become available.

  • Monitoring

    Slurm monitors job status, node status, and also keeps a historical record of them for further study.

Basic Concepts and Terminology

Cluster

A cluster is all resources wired together for the purpose of high performance computing, which includes computational devices (servers), networking devices (switches) and storage devices combined.

Node

A node is a single computational device, usually a server.

Job

When you want the scheduler to execute a program, performing a computation on your behalf, it has to be boxed into an abstraction layer called “job”.

Partition

A partition is a set of compute nodes, grouped logically. We separate our computational resources base on the features of their hardware and the nature of the job.

For instance, there is a regular compute partition main and a CUDA enabled GPU based partition gpu.

Task

It maybe confusing, but tasks in Slurm means processor resource.

By default, 1 task uses 1 core. However, this behavior can be altered.

Feature

Each node come with some difference, for instance different generation of processor, and that would be a feature of that node. You can pick a set of machines to share certain features to execute your job.

General Work Flow of Using Slurm

  • Gather Information
  • Submit job
    • sbatch a job script with complex instructions
    • salloc a interactive shell
    • srun a single command
  • Periodically gather information and check job output if you wish

Gather Information

There are two commands in Slurm that are often used to gather information regarding the cluster.

  • sinfo command gives an overview of the resources offered by the cluster
  • squeue command shows jobs currently runnning and pending on the cluster
$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
main*         up   infinite      2  down* coreV2-22-[012,026]
main*         up   infinite      1    mix coreV1-22-016
main*         up   infinite     11  alloc coreV1-22-[001,004,007,011-013],coreV2-22-[028,030],coreV2-25-[006-008]
main*         up   infinite     18   idle coreV1-22-[017-024],coreV2-22-[001,005,010,014,016-017],coreV2-25-[011,018-020]
timed-main    up    2:00:00      2  down* coreV2-22-[012,026]
timed-main    up    2:00:00      1    mix coreV1-22-016
timed-main    up    2:00:00     11  alloc coreV1-22-[001,004,007,011-013],coreV2-22-[028,030],coreV2-25-[006-008]
timed-main    up    2:00:00     18   idle coreV1-22-[017-024],coreV2-22-[001,005,010,014,016-017],coreV2-25-[011,018-020]

sinfo shows partitions and nodes available, or occupied in that partition.

$ sinfo -N -l
Thu May 25 14:36:08 2017
NODELIST       NODES  PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
coreV1-22-001      1      main*   allocated   16    2:8:1 126976        0      1   (null) none
coreV1-22-001      1 timed-main   allocated   16    2:8:1 126976        0      1   (null) none
coreV1-22-004      1      main*   allocated   16    2:8:1 126976        0      1   (null) none
coreV1-22-004      1 timed-main   allocated   16    2:8:1 126976        0      1   (null) none
coreV1-22-007      1      main*   allocated   16    2:8:1 126976        0      1   (null) none
coreV1-22-007      1 timed-main   allocated   16    2:8:1 126976        0      1   (null) none
coreV1-22-011      1      main*   allocated   16    2:8:1 126976        0      1   (null) none
...

You can try out some varitions on sinfo, for instance sinfo -N -l will show information arranged into a node-oriented fashion.

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               357      main migr0000 XXXXXXXX  R 1-01:27:02      7 coreV1-22-[001,004,007,011-013,016]
               356      main migr0001 XXXXXXXX  R 1-01:28:02      5 coreV2-22-[028,030],coreV2-25-[006-008]
               358      main migr0002 XXXXXXXX PD       0:00      8 (Resources)

squeue shows two jobs running, with information such as which partition they are running on, the user or owner who submitted the job, and the time and resource consumed by the jobs.

It is worth noting that Jobid is a very useful property since many Slurm commands use it. For example, to cancel job migr0000 the user can use the command scancel 357.

squeue above also showed that job 358 is in the state of PD, which means pending and scheduler will start it once the resource requirement for the job can be met.

Composition of Job Submission

Job submission, including a resource request and a script, instructs the cluster to perform computing.

You can list all job requests in the beginning of a job script or input them as command line arguments.

For example, you can submit a job with a script below:

#!/bin/bash -l

#SBATCH --job-name=test
#SBATCH --output=output.txt

#SBATCH --ntasks=1


hostname

then use the command:

$ sbatch job_script.sh
Submitted batch job 390

or you can use the command:

$ salloc --ntasks=1 --job-name=test hostname
salloc: Pending job allocation 391
salloc: job 391 queued and waiting for resources
salloc: job 391 has been allocated resources
salloc: Granted job allocation 391
turing2
salloc: Relinquishing job allocation 391

In the above example, --ntasks=1 --job-name=test is the resource request, and hostname is the computational task.

Resource requests in Slurm can be very simple, and it can be very specific. Please take look at following situation:

A simple MPI job

A typical MPI job requires a certain amount of processes and each process requires 1 core. Therefore, the resource request can be as simple as:

#SBATCH --ntasks=8

A simple multi-threading job such as OpenMP

A typical multi-threading job requires 1 process and it runs on multiple cores. Therefore, the resource request would looks like below:

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8

A Hybrid multi-processes and multi-threading job such as OpenMPI + OpenMP

When a hybrid program requires both multi-processes and multi-threading, you can make such request in slurm like below example. 8 processes will be launched and each process execute on 8 cores, Totally 64 core will allocated to this job.

#SBATCH --ntasks=8
#SBATCH --cpus-per-task=8

Job Submission

Typically, that are two kinds of jobs on a cluster: Interactive and non-interactive.

sbatch - #for a non-interactive job, more frequently used than the other
salloc - #for running a single task or launch interactive shell
  srun - #for running a single task

You will have to use different commands to submit them but the good part is they use the same arguments.

command structure:

salloc/srun [options] path-to-your-executable [arguments-for-your-executable]

example:

salloc/srun --ntasks=1 --job-name=matlab /usr/local/bin/matlab -singleCompThread -nodesktop

options:
    --job-name   job-name (optional)
    --ntasks     number of process request

If you submit the same job often or the job is relatively complex, you can use a submission script and submit it with the sbatch command.

#!/bin/bash -l

#SBATCH --ntasks=4

enable_lmod
module load gcc/4
module load openmpi/2.0

mpirun your_application

Now, please look at the lines starting with #SBATCH. It appends options to the sbatch command, so:

#!/bin/bash -l
#SBATCH --ntasks=4

/usr/bin/true

This is the same as:

#!/bin/bash -l

/usr/bin/true

And execute with command:

$ sbatch --ntasks=4 job_script.sh

Secondly, please look at the module command. It changes your environment variables based on pre-defined module files. We will discus modules more in the section Dynamic Module Loading

At the end, please give the instruction on how to execute your code. In this example, the section is:

mpirun your_application

Useful environment variables for submission scripts

Sometimes, it is very useful to have some information regarding your Slurm job when you are writing your script. Let’s take a look at an example:

#!/bin/bash -l

#SBATCH --job-name=SBATCH_EXAMPLE
#SBATCH --output=output
#SBATCH --ntasks=64


cat << EOF
this job is called $SLURM_JOB_NAME and its ID is $SLURM_JOB_ID
job $SLURM_JOB_ID has being allocated $SLURM_NTASKS cores across $SLURM_NNODES hosts
job $SLURM_JOB_ID will be running on the following machines:

EOF

echo $SLURM_NODELIST

cat << EOF

the working directory for job $SLURM_JOB_NAME is $SLURM_SUBMIT_DIR
what is inside?

EOF

ls -l "$SLURM_SUBMIT_DIR"

If we take a look at the output after running this script, you will see:

this job is called SBATCH_EXAMPLE and its ID is 397
job 397 has being allocated 64 cores across 4 hosts
job 397 will be running on the following machines:

coreV2-25-[011,018-020]

the working directory for job SBATCH_EXAMPLE is /home/XXXXXXX/Testing - Slurm
what is inside?

total 480
....

These environment variables are very useful and self explanatory. Please take a look at the cheat sheet below to see more.

Useful options for submission

Slurm supports both long form of options and short form for convenience. Below lists both if available.

  • --partition=partition_name

    -p partition_name

    specify which partition the job needs

  • --job-name=name

    -J name

    give a name to your job. This will make managing your job easier

  • --output=filename

    -o filename

    redirect your stdout to a file with the specified filename

  • --mail-type=type

    Slurm will send you a mail when certain events happen. Events are defined as follow:

    • BEGIN Mail is sent at the beginning of the job
    • END Mail is sent at the end of the job
    • FAIL Mail is sent when the job is aborted or rescheduled
    • ALL Above all
  • --mail-user=your@email.address

    Slurm will send email to the list of addresses here when events defined in --mail-user occur.

Deleting a Job from the Queue

When you decide that you no longer wish a task to continue executing, you can use the scancel job_number command to remove it from the cluster.

You can only remove your own jobs.

Slurm Cheat Sheet

Job Scripts Header

Resource Request Long Form Short Form
Job name –job-name=name -J
Stdout –output=file_name -o
Stderr –error=file_name -e
Email notification –mail-type=type  
Email address –mail-user=address  
Partition –partition=partition_name -p
Tasks –ntasks=core -n
Core per Task –cpus-per-task=core -c
Job array –array=array -a

Job Scripts

Environment Variable  
Job ID $SLURM_JOB_ID
Job Name $SLURM_JOB_NAME
Partition Name $SLURM_JOB_PARTITIONE
Node list $SLURM_NODELIST
Number of Tasks $SLURM_NTASKS
Number of Node $SLURM_NNODES
Submit Directory $SLURM_SUBMIT_DIR
Submit Host $SLURM_SUBMIT_HOST
Task ID in Job $SLURM_ARRAY_TASK_ID
First Task ID in Job $SLURM_ARRAY_TASK_MIN
Last Task ID in Job $SLURM_ARRAY_TASK_MAX
Task Step Size in Job $SLURM_ARRAY_TASK_STEP

Slurm Command Line

Command  
Job submission sbatch script_file
Job deletion scancel job_id
Job status (by job) scontrol show job job_id
Job status (by user) squeue -u username
Job hold scontrol hold job_id
Job release scontrol release job_id
Cluster status sinfo

Dynamic Module Loading

The module command sets the appropriate environment variables independently of the user’s shell. You can use it within the shell command line or job submission script.

Basic module examples

A user can list the modules loaded by:

$ module list

To find out which modules are available to be loaded, type:

$ module avail

To load packages, type

$ module load package1 package2 ...

To unload packages, type:

$ module unload package1 package2 ...

Advanced module examples

Start the Lmod based Module System:

$ enable_lmod

Modulefiles contain helpful messages. To access this information for a modulefile, type:

$ module help packageName

The module avail command has search capabilities.

$ module avail cc

Modulefiles have a description section know as “whatis”. It is accessed by:

$ module whatis gcc

Another way to search for modules is with the “module spider” command. Spider searches all modulefiles rather than the current hierarchy. To learn more about Module Hierarchy, please continue reading.

$ module spider gcc

Module Hierarchy

Module Hierarchy solves a very simple problem. Please take a look at a typical legacy module work flow:

You want to load gcc, openmpi, and netcdf for you code.

# Legacy Mode:

$ module avail gcc
---------------------------- /cm/shared/modulefiles ----------------------------
gcc/4.8.4       gcc/4.9.0       gcc/5.4.0
gcc/4.8.5       gcc/4.9.3       gcc/6.1.0
gcc/4.8.5-no-qm gcc/5.3.0       gcc/6.2.0

$ module avail openmpi

---------------------------- /cm/shared/modulefiles ----------------------------
openmpi/gcc/64/1.10.2       openmpi/icc/64/1.6.5
openmpi/gcc/64/1.6.5        openmpi/open64/64/1.6.5
openmpi/gcc/64/4.9.3/1.10.2 openmpi/openacc/64/1.6.5
openmpi/gcc/64/6.1.0/1.10.2

$ module avail netcdf

---------------------------- /cm/shared/modulefiles ----------------------------
netcdf/gcc/64/4.3.1.1    netcdf/gcc/64/6.1/4.4.0  netcdf/open64/64/4.3.1.1
netcdf/gcc/64/4.3.3.1    netcdf/gcc/64/6.1/4.4.1  netcdf/open64/64/4.3.3.1
netcdf/gcc/64/4.9/4.4.0  netcdf/gcc/64/test
netcdf/gcc/64/5.4/4.4.1  netcdf/intel/64/4.4.0

You want to load gcc/4.9.3, openmpi/1.10.2, and netcdf/4.4.0. You can use the following commmand both on the command line or in job submission scripts.

# Legacy Mode:

$ module load gcc/4.9.3
$ module load openmpi/gcc/64/4.9.3/1.10.2
$ module load netcdf/gcc/64/4.9/4.4.0

If you wish to load gcc/6.1 instead, you need to modify your scripts three times as follows:

# Legacy Mode:

$ module unload gcc/4.9.3
$ module unload openmpi/gcc/64/4.9.3/1.10.2
$ module unload netcdf/gcc/64/4.9/4.4.0
# above steps can be ignored if used in script

$ module load gcc/6.1.0
$ module load openmpi/gcc/64/6.1.0/1.10.2
$ module load netcdf/gcc/64/6.1/4.4.0

This creates some unnecessary work and introduces some very low level mistakes. For instance, if you forgot to modify one of those lines, you may encounter some odd errors much later.

You can spot some easy errors right way, such as:

Fatal Error: Cannot read module file 'netcdf.mod' opened at (1),
because it was created by a different version of GNU Fortran

To make your life easier, we introduced Lmod with Module Hierarchy. Let’s take a look at the work flow:

First enable Lmod

$ enable_lmod

Check availability of gcc, openmpi and netcdf (optional):

$ module spider gcc
----------------------------------------------------------------------------------------------------------------------
  gcc:
----------------------------------------------------------------------------------------------------------------------
    Description:
      The GNU Compiler Suit and Support Files

     Versions:
        gcc/4
        gcc/5
        gcc/6

----------------------------------------------------------------------------------------------------------------------
  For detailed information about a specific "gcc" module (including how to load the modules) use the module's full name.
  For example:

     $ module spider gcc/6
----------------------------------------------------------------------------------------------------------------------

$ module spider openmpi

----------------------------------------------------------------------------------------------------------------------
  openmpi:
----------------------------------------------------------------------------------------------------------------------
    Description:
      A powerful implementation of MPI

     Versions:
        openmpi/2.0

----------------------------------------------------------------------------------------------------------------------
  For detailed information about a specific "openmpi" module (including how to load the modules) use the module's full name.
  For example:

     $ module spider openmpi/2.0
----------------------------------------------------------------------------------------------------------------------

$ module spider netcdf

----------------------------------------------------------------------------------------------------------------------
  netcdf:
----------------------------------------------------------------------------------------------------------------------
    Description:
      Libraries for the Unidata network Common Data Form (NetCDF)

     Versions:
        netcdf/4.4

----------------------------------------------------------------------------------------------------------------------
  For detailed information about a specific "netcdf" module (including how to load the modules) use the module's full name.
  For example:

     $ module spider netcdf/4.4
----------------------------------------------------------------------------------------------------------------------

Check detailed information regarding any package (optional):

$ module spider netcdf/4.4

----------------------------------------------------------------------------------------------------------------------
  netcdf: netcdf/4.4
----------------------------------------------------------------------------------------------------------------------
    Description:
      Libraries for the Unidata network Common Data Form (NetCDF)

    You will need to load all module(s) on any one of the lines below before the "netcdf/4.4" module is available to load.

      gcc/4
      gcc/5
      gcc/6
      icc/17
      pgi/16

    Help:
      Module Purpose
      -------
      This module file defines the system paths and environment variables
      needed to use netcdf version 4.4.1.

      Module Description
      -------
        NetCDF is a set of software libraries and self-describing, machine-independent
        data formats that support the creation, access, and sharing of array-oriented
        scientific data. NetCDF was developed and is maintained at Unidata. Unidata
        provides data and software tools for use in geoscience education and research.
        Unidata is part of the University Corporation for Atmospheric Research (UCAR)
        Community Programs (UCP).  Unidata is funded primarily by the National Science
        Foundation.

      Additional Information
      -------
        For more information about netcdf, see the following URL:

        http://www.unidata.ucar.edu/software/netcdf/
        http://www.unidata.ucar.edu/software/netcdf/docs/

      Updated at 21 Jun 2016

Load modules:

$ module load gcc/4
$ module load openmpi/2.0
$ module load netcdf/4.4
$ module list

Currently Loaded Modules:
  1) gcc/4   2) netcdf/4.4   3) openmpi/2.0

As you can see, each module path becomes much shorter and it loaded the correct package for you. You can verify it as below:

$ which gcc
/cm/shared/applications/gcc/4.9.4/bin/gcc
$ which mpicc
/cm/shared/applications/openmpi/2.0.1/gcc-4/bin/mpicc
$ which nf-config
/cm/shared/applications/netcdf/4.4.1/gcc-4/bin/nf-config

Now, let’s say you want to switch to use gcc 6.2. All you need to do is below:

$ module load gcc/6

Due to MODULEPATH changes, the following have been reloaded:
  1) netcdf/4.4  2) openmpi/2.0

The following has been reloaded with a version change:
  1) gcc/4 => gcc/6

$ module list

Currently Loaded Modules:
  1) gcc/6   2) netcdf/4.4   3) openmpi/2.0

You can verify it below:

$ which gcc
/cm/shared/applications/gcc/6.2.0/bin/gcc
$ which mpicc
/cm/shared/applications/openmpi/2.0.1/gcc-6/bin/mpicc
$ which nf-config
/cm/shared/applications/netcdf/4.4.1/gcc-6/bin/nf-config

As you can see, just simply type module load gcc/6. The Lmod module command will automatically unload old packages and load new packages with correct dependencies.

Additional user experience improvements:

  • Lmod is faster:

    module avail and module spider both use caches and it takes very little time to run.

  • We created better Modulefiles for Lmod with some helpful Information inside:

$ module help gcc

------------------------------------------ Module Specific Help for "gcc/6.2" -------------------------------------------
Module Purpose
-------
This module file defines the system paths and environment variables
needed to use gcc version 6.2.0.

Module Description
-------
  The GNU Compiler Collection includes front ends for C, C++, Objective-C,
  Fortran, Java, Ada, and Go, as well as libraries for these languages
  (libstdc++, libgcj,...). GCC was originally written as the compiler for
  the GNU operating system. The GNU system was developed to be 100%
  free software, free in the sense that it respects the user`s freedom.

Additional Information
-------
  For more information about gcc, see the following URL:

  GCC online documentation - https://gcc.gnu.org/onlinedocs/
  GNU Fortran Manual       - https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gfortran/
  GCC STL Manual           - https://gcc.gnu.org/onlinedocs/gcc-6.2.0/libstdc++/manual/
  GCC STL Reference Manual - https://gcc.gnu.org/onlinedocs/gcc-6.2.0/libstdc++/api/
  GCC OpenMP Manual        - https://gcc.gnu.org/onlinedocs/gcc-6.2.0/libgomp/
  GCC Quad-Precision Math  - https://gcc.gnu.org/onlinedocs/gcc-6.2.0/libquadmath/

Updated at 19 May 2016

$ module help mkl

------------------------------------------- Module Specific Help for "mkl/17" -------------------------------------------
Module Purpose
-------
This module file defines the system paths and environment variables
needed to use mkl version 17.1.

Module Description
-------
  Intel Math Kernel Library (Intel MKL) accelerates math processing routines
  that increase application performance and reduce development time. Intel MKL
  includes highly vectorized and threaded Linear Algebra, Fast Fourier Transforms
  (FFT), Vector Math and Statistics functions.  The easiest way to take advantage
  of all of that processing power is to use a carefully optimized computing math
  library. Even the best compiler can't compete with the level of performance
  possible from a hand-optimized library. If your application already relies on
  the BLAS or LAPACK functionality, simply re-link with Intel MKL to get better
  performance on Intel and compatible architectures.

Additional Information
-------
  For more information about mkl, see the following URL:

  Intel Modern Code         - https://software.intel.com/en-us/modern-code
  Intel Math Kernel Library - https://software.intel.com/en-us/intel-mkl/documentation
  Intel MKL Link Advisor    - https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor


Updated at 04 Nov 2016
  • Every module brings in some helpful environment variables. For example:
$ module load gcc openmpi netcdf
# You can ignore the version. It loads the latest version of each the package.

$ env | grep ^NETCDF
NETCDF_ROOT=/cm/shared/applications/netcdf/4.4.1/gcc-6
NETCDF_VER=4.4
NETCDF_VERSION=4.4.1

# useful when compiling code:

$ gcc -I$NETCDF_ROOT/include source.c
$ gcc -L$NETCDF_ROOT/lib -lnetcdf source.o

Turing HPC cluster

Turing is a traditional computation targeted cluster. It is interconnected with an Infiniband network and typically good at MPI or similar distributed applications.

You should consider running your application on Turing if the following conditions satisfy your needs:

  1. It runs on a Linux Platform

    We do not support any other platform at this point and it is unlikely that we will in the foreseeable feature.

  2. It benefits from parallel computing

    Your application could utilize more than one cpu core from multi-processing, multi-threading, MPI or a similar technique.

  3. It does not require Hadoop or a Hadoop based data analysis platform

    We do have a separate big data analysis cluster available at ODU. Please read Hadoop Big Data Cluster

Hardware resources

The following hardware resources are available on the Turing cluster:

Node Type Total nodes Slots Per Node Additional Resource Memory Per Node Node Name Prefix Total Cores
Login 1 20 none 128GB Turing 20
Standard Compute 220 16-32 none 128GB coreV1-22-###
coreV2-22-###
coreV2-25-###
coreV3-23-###
coreV4-21-###
5456
GPU 16 28-32 Nvidia K40 GPU
Nvidia K80 GPU
128GB coreV2-28-k40-###
coreV3-23-k40-###
coreV4-21-k80-###
512
Xeon Phi 10 20 Intel 2250 Xeon Phi MIC’s 128GB coreV2-25-knc-### 200
High Memory 7 32 none 512GB-768GB coreV2-23-himem-###
coreV4-21-himem-###
224

Network resources

Turing provides two kind of communication networks:

  • Infiniband

    • The FDR Infiniband network provides high speed message passing between compute nodes.
    • It is much faster than Ethernet on the cluster and we strongly encourage you to use it whenever possible.
    • The easiest way to utilize it is using MPI.

      If you can adjust your application to use MPI, you should do it.

    • Another way to utilize Infiniband may involve directly relying on OFED libraries, such libverbs or libfabric

      Please feel free to contact us if you need any help enabling Infiniband for your application.

  • Ethernet

    Standard Ethernet network which is much slower than Infiniband. If using Infiniband is not an option, you may have to settle for this network.

Partition Policies

The partition on Turing is defined by machine type. You can find detailed machine information here. Below is the list of partition on Turing:

Name Use case Slurm Submission Options
main no specialized hardware requirement
require less than 128G memory per node
-p main or nothing
(all job by default submit to main)
himem require more than 128G memory per node -p himem
gpu require GPU accelerator -p gpu
phi require Phi Coprocessor -p phi

For GPUs, we allow 16 process to share a single GPU on each node. Please submit the number of processes you plan to use. We also allow a user to reserve an entire GPU node. To do so please set exclusive to true. Whole node reservation usually requires longer waiting time before the job starts.

In addition, Turing has timed partition for each partition above. They are timed-main, timed-himem, timed-gpu, and timed-phi. The differences between the timed partition and their non-timed counterparts are listed below:

  1. Any job can execute on timed partition for a maximum of 2 hours

    It will be terminated after 2 hours, thus timed partition are best for short jobs

  2. Any job submitted to timed partition has high priority to start its execution

    We believe that if your job is very short, you should not wait for long

  3. Timed partition are larger than non-timed partition

    Our investors kindly lend their resources for community use. They agree to let all Turing users run short tasks on their hardware.

To submit a job to a timed partition, the following options in your submission are mandatory:

  • -p timed-main|timed-himem|timed-gpu|timed-phi # pick one

Storage information

Turing has several mounted storage resources available:

Mount Point Purpose Backup
/home User Home Directory, primary storage for user yes, once per day
/RC Archive and long term storage, only available to login node yes, once per day
/scratch Scratch Space (slow) no
/scratch-lustre Scratch Space (fast) no
/lustre Scratch Space (fast,deprecated) no

Mass Storage

/RC can be accessed both on Turing and as a shared drive on your workstation. You do not get /RC by default, A request for individual or group storage may be filed from [here] (http://ww2.odu.edu/forms_admin/viewform.php?formid=17391).

Once a request is granted, you will have access to:

  • /RC/home/YOUR_USERNAME for individual request
  • /RC/group/YOUR_GROUP_NAME for group request
  • Initial space limitation is 500 GB
  • Additional space can be added by sending an email to hpc@odu.edu

The following are ways to access /RC as a shared drive on your workstation:

  • For Windows:

    • If your workstation is managed by ODU ITS and on the ts.odu.edu domain, you should see a R: drive under your Computer
    • If the above is not the case or you are accessing ODU from VPN, you may map the \\research1.ts.odu.edu\RC drive with the following steps:
  • Step 1: Navigate to Computer

    step-0.png

  • Step 2: Click Map network drive

    step-1.png

  • Step 3: Input drive information

    step-2.png

    You may choose a drive letter other than R

  • Step 4: Click Finish

    step-3.png

  • Step 5: Input your credentials

    step-4.png

  • For macOS
  • Step 1: Select Finder and use key ⌘-K

    step-0.png

  • Step 2: Input drive information and click + (you only need do this once)

    step-1.png

  • Step 3: Select the drive and click connect

    step-2.png

  • Step 4: Input your credentials

    step-3.png

Technical Specifications

Compute Platform # Hosts CPU Memory Special Features
General Purpose 30 coreV1-22-### Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (16 slots) 128G  
  40 coreV2-22-### Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz (20 slots) 128G  
  80 coreV2-25-### Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz (20 slots) 128G  
  50 coreV3-23-### Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz (32 slots) 128G  
  30 coreV4-21-### Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz (32 slots) 128G  
High Memory 4 coreV2-23-himem-### Intel(R) Xeon(R) CPU E5-4610 v2 @ 2.30GHz (32 slots) 768G  
  3 coreV4-21-himem-### Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz (32 slots) 512G  
GPU 10 coreV2-28-k40-### Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz (28 slots) 128G Tesla™ K40m
12GB, 745 Mhz, 2880 core
  5 coreV4-21-k80-### Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz (32 slots) 128G Tesla™ K80m
24GB, 562 Mhz, 4992 core
MIC 10 coreV2-25-knc-### Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz (20 slots) 128G Xeon Phi™ Coprocessor 5120D
8GB, 1.053 GHz, 60 core

Wahab HPC cluster

Wahab is a traditional computation targeted cluster. It is interconnected with an Infiniband network and typically good at MPI or similar distributed applications.

You should consider running your application on Turing if the following conditions satisfy your needs:

  1. It runs on a Linux Platform

    We do not support any other platform at this point and it is unlikely that we will in the foreseeable feature.

  2. It benefits from parallel computing

    Your application could utilize more than one cpu core from multi-processing, multi-threading, MPI or a similar technique.

  3. It does not require Hadoop or a Hadoop based data analysis platform

    We do have a separate big data analysis cluster available at ODU. Please read Hadoop Big Data Cluster

Hardware resources

The following hardware resources are available on the Turing cluster:

Node Type Total nodes Slots Per Node Additional Resource Memory Per Node Node Name Prefix Total Cores
Login 2 20 none 128GB Wahab 20
Standard Compute 158 40 none 384GB d1-6420a-###
d4-6420b-###
d5-6420b-###
d6-6420b-###
6320
GPU 18 28-32 Nvidia V100 GPU 128GB d2-w4140a-###
d3-w4140b-###
d4-w4140b-###
512

Network resources

Turing provides two kind of communication networks:

  • Infiniband

    • The EDR Infiniband network provides high speed message passing between compute nodes.
    • It is much faster than Ethernet on the cluster and we strongly encourage you to use it whenever possible.
    • The easiest way to utilize it is using MPI.

      If you can adjust your application to use MPI, you should do it.

    • Another way to utilize Infiniband may involve directly relying on OFED libraries, such libverbs or libfabric

      Please feel free to contact us if you need any help enabling Infiniband for your application.

  • Ethernet

    Standard Ethernet network which is much slower than Infiniband. If using Infiniband is not an option, you may have to settle for this network.

Partition Policies

The partition on Turing is defined by machine type. You can find detailed machine information here. Below is the list of partition on Turing:

Name Use case Slurm Submission Options
main no specialized hardware requirement
require less than 384G memory per node
-p main or nothing
(all job by default submit to main)
gpu require GPU accelerator -p gpu

For GPUs, we allow 16 process to share a single GPU on each node. Please submit the number of processes you plan to use. We also allow a user to reserve an entire GPU node. To do so please set exclusive to true. Whole node reservation usually requires longer waiting time before the job starts.

In addition, Wahab has timed partition for each partition above. They are timed-main and timed-gpu. The differences between the timed partition and their non-timed counterparts are listed below:

  1. Any job can execute on timed partition for a maximum of 2 hours

    It will be terminated after 2 hours, thus timed partition are best for short jobs

  2. Any job submitted to timed partition has high priority to start its execution

    We believe that if your job is very short, you should not wait for long

  3. Timed partition are larger than non-timed partition

    Our investors kindly lend their resources for community use. They agree to let all Turing users run short tasks on their hardware.

To submit a job to a timed partition, the following options in your submission are mandatory:

  • -p timed-main|timed-gpu # pick one

Storage information

Wahab has several mounted storage resources available:

Mount Point Purpose Backup
/home User Home Directory, primary storage for user yes, once per day
/RC Archive and long term storage, only available to login node yes, once per day
/scratch Scratch Space (fast) no

Mass Storage

/RC can be accessed both on Turing and as a shared drive on your workstation. You do not get /RC by default, A request for individual or group storage may be filed from [here] (http://ww2.odu.edu/forms_admin/viewform.php?formid=17391).

Once a request is granted, you will have access to:

  • /RC/home/YOUR_USERNAME for individual request
  • /RC/group/YOUR_GROUP_NAME for group request
  • Initial space limitation is 500 GB
  • Additional space can be added by sending an email to hpc@odu.edu

The following are ways to access /RC as a shared drive on your workstation:

  • For Windows:

    • If your workstation is managed by ODU ITS and on the ts.odu.edu domain, you should see a R: drive under your Computer
    • If the above is not the case or you are accessing ODU from VPN, you may map the \\research1.ts.odu.edu\RC drive with the following steps:
  • Step 1: Navigate to Computer

    step-0.png

  • Step 2: Click Map network drive

    step-1.png

  • Step 3: Input drive information

    step-2.png

    You may choose a drive letter other than R

  • Step 4: Click Finish

    step-3.png

  • Step 5: Input your credentials

    step-4.png

  • For macOS
  • Step 1: Select Finder and use key ⌘-K

    step-0.png

  • Step 2: Input drive information and click + (you only need do this once)

    step-1.png

  • Step 3: Select the drive and click connect

    step-2.png

  • Step 4: Input your credentials

    step-3.png

Technical Specifications

Compute Platform # Hosts CPU Memory Special Features
General Purpose 158 d1-w6420a-[01-24]
d4-w6420b-[01-12]
d5-w6420b-[01-24]
d6-w6420b-[01-24]
e1-w6420b-[02-05,07-24]
e2-w6420b-[01-24]
e3-w6420b-[01-24]
Intel(R) Xeon(R) Gold 6148 CPU @ 2.4GHz (40 slots) 384G  
GPU 10 d2-w4140a-###
d3-w4140b-###
d4-w4140b-###
Intel(R) Xeon(R) Gold 6130 @ 2.1GHz (32 slots) 128G 4 x Nvidia Tesla V100 GPU

High Performance Computing Case Studies

In this section of case studies, we will follow some basic conventions.

  • All of our code will be written in C to ensure it is generic enough and can be replaced with another programming language
  • If a line starts with # or // then it is a comment
  • If a line starts with $ then it is a user input. You can type that into your console just without the $
  • If a line starts with no such special symbol, then it is output
  • There is a link to download example programs in the beginning of each case

MPI Programming

Goal

  • Compile and submit a MPI job on Turing

Prerequisites

  • Basic Unix/Linux experience
  • Basic knowledge regarding MPI programming

What to expect

  • Create a very simple MPI program
  • Manage the compiling with a basic generic Makefile
  • Submit to Slurm
  • Query status
  • Review output

Source code

#include <mpi.h>
#include <stdio.h>


int main(int argc, char **argv)
{
        char host[1024];
        int rank, procs, host_len;

        MPI_Init(&argc, &argv);

        MPI_Comm_size(MPI_COMM_WORLD, &procs);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);

        MPI_Get_processor_name(host, &host_len);

        printf("started %s at %s with rank: %d of %d", argv[0], host, rank, procs);

        MPI_Finalize();
}

Setup environment for compilation

  • Before compiling your program, you should always check if you have the correct modules loaded.
  • You do not want modules missing nor do you want to load unnecessary modules.
  • Sometimes, you may have loaded certain modules at login through your .profile,.tcshrc,.bashrc. You want check them from time to time to make sure they are still valid for your current work.
# Step 0: Load New Module System
$ enable_mod

# Step 1: Check modules
$ module list

Currently Loaded Modules:
  1) slurm/17.02

# Step 2: Load correct modules
$ module load gcc/4
$ module load openmpi/2.0

# Step 3: Check again
$ module list

Currently Loaded Modules:
  1) slurm/17.02   2) gcc/4   3) openmpi/2.0

Compile your code

For a simple program, you can compile it manually:

$ mpicc simple_mpi.c -o simple_mpi      

For a more complicated program, make command is recommended to manage the compilation process. In addition, you may wish to create separate directories to hold source code, object files and etc. So as the number of files for your program grows, everything is still neat. For instance, you may create a simple file directory structure like below:

$ tree
.
├── build
│   └── simple_mpi.o        # object files
├── Makefile                # generic Makefile
├── simple_mpi              # final binary
├── src
│   └── simple_mpi.c        # source code files
└── submission.sh           # submission script for Slurm

2 directories, 5 files

Here is a generic Makefile. You can expand on it for your own program.

# This file assumes you are writing your code in C. You can change the EXT variable to change this behavior
# to make it work for other source code as well.
# For example:
# to C++:     EXT = cpp
# to Fortran: EXT = F
EXT  = c

SRCS = $(shell find src -name '*.$(EXT)')
OBJS = $(SRCS:src/%.$(EXT)=build/%.o)

BIN  = simple_mpi

# You need change your compiler for different languages as well
# to C++: CC = mpiCC
#         LD = mpiCC
# to Fortran: CC = mpifort
#             LD = mpifort

CC = mpicc
LD = mpicc

#CFLAGS  is given to compiler while compiling each object file
#LDFLAGS is given to compiler at the linking stage

CFLAGS  = -O2
LDFLAGS =

all: $(BIN)

$(BIN): $(OBJS)
        $(LD) $(LDFLAGS) $(OBJS) -o $(BIN)

build/%.o: src/%.$(EXT)
        $(CC) $(CFLAGS) -c $< -o $@

clean:
        rm build/*.o
        rm $(BIN)

With this Makefile, your compilation process will be as simple as below:

$ make
mpicc -O2 -c src/simple_mpi.c -o build/simple_mpi.o
mpicc build/simple_mpi.o -o simple_mpi

Prepare the submission script for Slurm

This is a very basic and generic submission script. You can expand on it for your own program

#!/bin/bash -l

#SBATCH --job-name=simple_mpi # set job name to simple_mpi
#SBATCH --output=output       # set stdout and stderr output to same file
#SBATCH --ntasks=4            # launch 4 tasks, which will use 4 cores by default

enable_lmod                   # enable new module system
module load gcc/4             # load run time (gcc/openmpi) for your code
module load openmpi/2.0

mpirun simple_mpi

Submit to Slurm and review job status

To submit to Slurm, we will use sbatch command.

$ sbatch  submission.sh
Submitted batch job 740

# the job is submitted, and job number is 740
# at this point, this job should be in PD stage

#replace xxxxxxx with your username
$ squeue -u xxxxxxx 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               740      main simple_m mdong003 PD       0:00      1 (None)

# after a couple of seconds, the job start executing
$ squeue -u xxxxxxx 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               740      main simple_m mdong003  R       0:01      1 coreV2-22-001

# you can also check your job status with `scontrol show job`, which shows more information

$ scontrol show job 740
JobId=740 JobName=simple_mpi
   UserId=mdong003(30290) GroupId=users(14514) MCS_label=N/A
   Priority=10002 Nice=0 Account=odu QOS=turing_default_qos
   JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=134:0
   RunTime=00:00:05 TimeLimit=365-00:00:00 TimeMin=N/A
   SubmitTime=2017-05-30T10:47:54 EligibleTime=2017-05-30T10:47:54
   StartTime=2017-05-30T10:47:55 EndTime=2017-05-30T10:48:00 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=main AllocNode:Sid=turing2:164977
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=coreV2-22-001
   BatchHost=coreV2-22-001
   NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/mdong003/Source - HPCDocs/simple_mpi/submission.sh
   WorkDir=/home/mdong003/Source - HPCDocs/simple_mpi
   StdErr=/home/mdong003/Source - HPCDocs/simple_mpi/output
   StdIn=/dev/null
   StdOut=/home/mdong003/Source - HPCDocs/simple_mpi/output
   Power=

# you can check the status of your job even after your job finished its execution with `sacct -j 740`

740          simple_mpi       main        odu          4  COMPLETED      0:0
740.batch         batch                   odu          4  COMPLETED      0:0

# Checking your output file is a relatively simple method to monitoring your job
# Be aware, the output does not have a fixed order due to the nature of parallel execution

$ cat output
started simple_mpi at coreV2-22-001 with rank: 0 of 4
started simple_mpi at coreV2-22-001 with rank: 2 of 4
started simple_mpi at coreV2-22-001 with rank: 1 of 4
started simple_mpi at coreV2-22-001 with rank: 3 of 4

CUDA Programming

Goal

  • Compile and submit a CUDA job on Turing

Prerequisites

  • Basic knowledge regarding CUDA programming
  • Tried submitting a regular MPI job on Turing at least once
  • If not, please read MPI Programming

What to expect

  • Create a very simple CUDA program
  • Compile the code
  • Understand GPU sharing on Turing
  • Submit to Slurm with correct process thread configuration

Source Code

// saxpy example from https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/

#include <stdio.h>
#include <stdlib.h>

__global__
void saxpy(int n, float a, float  * x, float  * y)
{
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        if (i < n) y[i] = a * x[i] + y[i];
}

int main(void)
{
        int N = 1<<20;
        float *x, *y, *d_x, *d_y;
        x = (float*) malloc(N * sizeof(float));
        y = (float*) malloc(N * sizeof(float));

        cudaMalloc(&d_x, N * sizeof(float));
        cudaMalloc(&d_y, N * sizeof(float));

        for (int i = 0; i < N; i++) {
                x[i] = 1.0f;
                y[i] = 2.0f;
        }

        cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice);
        cudaMemcpy(d_y, y, N * sizeof(float), cudaMemcpyHostToDevice);

        // Perform SAXPY on 1M elements
        saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

        cudaMemcpy(y, d_y, N * sizeof(float), cudaMemcpyDeviceToHost);

        float maxError = 0.0f;
        for (int i = 0; i < N; i++)
                maxError = max(maxError, abs(y[i] - 4.0f));
        printf("Max error: %f\n", maxError);
}

Setup environment for compilation

To compile a CUDA program, please load the following modules:

$ module list

Currently Loaded Modules:
  1) Slurm/17.02   2) gcc/4   3) cuda/8.0

If you are planing on using the Fortran programming language with CUDA, please load the PGI compiler instead:

$ module load pgi/16

Lmod is automatically replacing "gcc/4" with "pgi/16"


Due to MODULEPATH changes the following have been reloaded:
  1) cuda/8.0

$ module list

Currently Loaded Modules:
  1) Slurm/17.02   2) binutils/2.28   3) pgi/16   4) cuda-supplement/8.0   5) cuda/8.0

Compile your code

You can still compile it manually:

$ nvcc simple_cuda.cu -o simple_cuda

For a larger project, we still recommend using a Makefile to manage your build process

EXT  = cu
SRCS = $(shell find src -name '*.$(EXT)')
OBJS = $(SRCS:src/%.$(EXT)=build/%.o)

BIN  = simple_cuda

CC = nvcc
LD = nvcc

CFLAGS  = -O2
LDFLAGS =

all: $(BIN)

$(BIN): $(OBJS)
        $(LD) $(LDFLAGS) $(OBJS) -o $(BIN)

build/%.o: src/%.$(EXT)
        $(CC) $(CFLAGS) -c $< -o $@

clean:
        rm build/*.o
        rm $(BIN)

Turing GPU sharing

Before you submit jobs to Turing, we need take a look at the Turing GPU sharing policy.

  • CUDA can only be executed on a GPU enabled host
  • There are 15 GPU enabled hosts on Turing

    • Each host allows up to 16 processes to share a single GPU
  • or

    • Each host can be reserved exclusively for one job
  • There is no fixed number of hosts that perform shared GPU access or exclusive GPU access. Hosts are in an undecided state before any job is scheduled on it

This can be explained by the table below:

Shared GPU slots used Exclusive GPU slot used Undecided Hosts Shared GPU slots available Exclusive GPU slot available
0 0 15 240 15
1 0 14 239 14
0 1 14 224 14
1 1 13 223 13
2 (on two nodes) 0 13 238 13
2 (on same node) 0 14 238 14

The purpose of this configuration is to avoid resource fragmentation in the cluster.

Shared GPU access gets scheduled faster since usually shared GPU slots are more available. But this depends on the load of the host. It may take longer to complete computing. It is not recommended if measurement of time is required.

Exclusive GPU access does not suffer from the same issue, but it does take longer to get scheduled. So it is in the user’s best interest to only request exclusive access for mature productional code. Researchers should use shared GPU access during testing or any other phase of development.

Prepare the submission script for Slurm

Below is a list of changes to the submission script that is worth noticing. This submission works for a Single Host CUDA program.

  • To use all GPU resources on a single node

    • --nodes=1 request 1 node

    • --exclusive request exclusive access to a GPU host.

      When your code is expected to utilize all GPU resources, it is best to ask for 1 node and have full access to the node.

      It is possible that requesting exclusive access to a node requires longer waiting time, but it provides the best performance.

  • To share GPU resources with another user

    • --ntasks=1 request 1 task

      When developing & debugging, you may wish to share GPU resources with another user. Shared requests generally require less waiting in the queue.

#!/bin/bash -l

#SBATCH --job-name=simple_cuda
#SBATCH --output=output
#SBATCH --partition=gpu

# only keep one method of request below
# to share resource
#SBATCH --ntasks=1

# exclusive access resource 
#SBATCH --nodes=1
#SBATCH --exclusive

module load gcc/4
module load cuda/8.0

./simple_cuda

Submit to Slurm, and review job status

Submitting and reviewing your job is the same procedure as the simple MPI program section.
Please read MPI Programming

OpenACC Programming

Goal

  • Compile and submit an OpenACC job on Turing

Prerequisites

  • Basic knowledge regarding OpenACC programming
  • Tried submitting regular MPI jobs on Turing at least once
  • If not, please read MPI Programming

What to expect

  • Create a very simple OpenACC program
  • Compile the code
  • Submit to Slurm with correct process thread configuration

Source Code

// modified saxpy example from https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

void saxpy(int n, float a, float *x, float *y)
{
#pragma acc kernels
  for (int i = 0; i < n; ++i)
      y[i] = a*x[i] + y[i];
}

int main(void)
{
        int N = 1<<20;
        float *x, *y;

        x = (float*) malloc(N * sizeof(float));
        y = (float*) malloc(N * sizeof(float));

        for (int i = 0; i < N; i++) {
                x[i] = 1.0f;
                y[i] = 2.0f;
        }

        saxpy(N, 2.0f, x, y);

        float maxError = 0.0f;
        for (int i = 0; i < N; i++)
                maxError = fmax(maxError, abs(y[i] - 4.0f));
        printf("Max error: %f\n", maxError);
}

Setup environment for compilation

To compile an OpenACC program, please load the following modules:

$ module list

Currently Loaded Modules:
  1) slurm/17.02   2) binutils/2.28   3) pgi/16   4) cuda-supplement/8.0   5) cuda/8.0

Compile your code

You can still compile it manually:

$ pgcc -acc simple_acc.c -o simple_acc

The flag -acc instructs the compiler to compile code into a OpenACC program. Please do not forget it since the compilation may still succeed but the compiler will ignore all OpenACC directives.

For a larger project, we still recommend using a Makefile to manage your build process

EXT  = c
SRCS = $(shell find src -name '*.$(EXT)')
OBJS = $(SRCS:src/%.$(EXT)=build/%.o)

BIN  = simple_acc

CC = pgcc
LD = pgcc

CFLAGS  = -acc -O2
LDFLAGS = -acc

all: $(BIN)

$(BIN): $(OBJS)
        $(LD) $(LDFLAGS) $(OBJS) -o $(BIN)

build/%.o: src/%.$(EXT)
        $(CC) $(CFLAGS) -c $< -o $@

clean:
        rm build/*.o
        rm $(BIN)

Prepare the submission script for Slurm

Since OpenACC requires GPU resources, the submission procedure is identical to a CUDA program. Please read the CUDA Programming section for details.

#!/bin/bash -l

#SBATCH --job-name=simple_acc
#SBATCH --output=output
#SBATCH --partition=gpu

# only keep one method of request below
# to share resource
#SBATCH --ntasks=1

# exclusive access resource 
#SBATCH --nodes=1
#SBATCH --exclusive

module load gcc/4
module load cuda/8.0

./simple_acc

Submit to Slurm and review job status

Submitting and reviewing your job is the same procedure as the simple MPI program section.
Please read MPI Programming

OpenMP Programming

Goal

  • Compile and submit an OpenMP job on Turing

Prerequisites

  • Basic knowledge regarding OpenMP programming
  • Tried submitting a regular MPI job on Turing at least once
  • If not, please read MPI Programming

What to expect

  • Create a very simple OpenMP program
  • Compile the code
  • Submit to Slurm with correct process thread configuration

Source code

// modified saxpy example from https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

void saxpy(int n, float a, float *x, float *y)
{
#pragma omp parallel for
        for (int i = 0; i < n; ++i)
                y[i] = a*x[i] + y[i];
}

int main(void)
{
        int N = 1<<20;
        float *x, *y;

        x = (float*) malloc(N * sizeof(float));
        y = (float*) malloc(N * sizeof(float));

        for (int i = 0; i < N; i++) {
                x[i] = 1.0f;
                y[i] = 2.0f;
        }

        saxpy(N, 2.0f, x, y);

        float maxError = 0.0f;
        for (int i = 0; i < N; i++)
                maxError = fmax(maxError, abs(y[i] - 4.0f));
        printf("Max error: %f\n", maxError);
}

Setup environment for compile

To compile an OpenMP program, please load the following modules:

$ module list

Currently Loaded Modules:
  1) slurm/17.02   2) gcc/4

Compile your code

You can still compile it manually:

$ gcc -lm -fopenmp -std=c99 simple_omp.c -o simple_omp

The flag -fopenmp instructs the compiler to compile code into an OpenMP program. Please do not forget it since the compilation may still succeed but the compiler may ignore all OpenMP directives.

For different compilers, you will also have to give different options to enable OpenMP. Here is a list of what you may see on Turing.

compiler options
GCC -fopenmp
LLVM -fopenmp
ICC -qopenmp
PGI -mp

For a larger project, we still recommend using a Makefile to manage your build process.

EXT  = c
SRCS = $(shell find src -name '*.$(EXT)')
OBJS = $(SRCS:src/%.$(EXT)=build/%.o)

BIN  = simple_omp

CC = gcc
LD = gcc

CFLAGS  = -fopenmp -std=c99 -O2
LDFLAGS = -fopenmp -lm

all: $(BIN)

$(BIN): $(OBJS)
        $(LD) $(LDFLAGS) $(OBJS) -o $(BIN)

build/%.o: src/%.$(EXT)
        $(CC) $(CFLAGS) -c $< -o $@

clean:
        rm build/*.o
        rm $(BIN)

Prepare the submission script for Slurm

Below is a list of changes to the submission script that are worth noticing:

  • CPUs Per Task --cpus-per-task

    A openmp program requires a single process but multiple threads. This configuration can be delivered by using the --cpus-per-task option to Slurm.

  • Environment Variable OMP_NUM_THREADS

    OpenMP uses a number of environment variables to control its behavior. OMP_NUM_THREADS is, perhaps, the most important one. It tells your application how many threads it can have and it should always be equal to –cpus-per-task.

    You can find more detailed information regarding OpenMP Environment Variables here.

#!/bin/bash -l

#SBATCH --job-name=simple_omp
#SBATCH --output=output
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8

enable_lmod
module load gcc/4

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

./simple_omp

Submit to Slurm, and review job status

The same procedure for submitting a job and reviewing a job’s status is listed under the MPI section. Please read MPI Programming

Phi Programming

Goal

  • Compile and submit a job utilizing a Intel® Xeon Phi™ Coprocessor on Turing

Prerequisites

  • Basic knowledge regarding OpenMP programming
  • Basic knowledge regarding MPI programming
  • Basic knowledge regarding Phi programming
  • Tried submitting a regular MPI job on Turing at least once
  • If not, please read MPI Programming

What to expect

  • Compile an OpenMP mic program in native, offloading, and hybrid mode
  • Submit to Slurm with correct process thread configuration

Intel® Xeon Phi™ Coprocessors support both OpenMP and MPI programming mode. This section will focus on OpenMP. For MPI, please read Hybrid: MPI + Phi. When describing Intel® Xeon Phi™ Coprocessors, Phi or MIC will be used as the abbreviation.

Native, Offload, or Hybrid

There are three mode to program with Phi Coprocessor:

Mode architecture use CPU use MIC
Native (CPU) x86_64 Yes Yes
Native (MIC) k1om No Yes
Offload x86_64 yes Yes
Hybrid x86_64 yes Yes

All modes use the same compilation environment, makefile, and submission scripts.

Setup environment for compilation

To compile programs that utilize MICs, please load the following modules:

$ module list

Currently Loaded Modules:
  1) slurm/17.02   2) binutils/2.28   3) icc/17

Makefile for OpenMP based Phi project

  • Compiler: icc

    Please use Intel’s C/C++/Fortran Compiler for code that utilizes Phi Coprocessors. Although many compilers actually provide MIC offloading with an OpenMP implementation, none are productional ready when this documentation was written.

  • Compiler Flag: -qopenmp

    This flag instructs the compiler to parse OpenMP directives.

  • Compiler Flag: -mmic

    This flag instructs the compiler to build a k1om binary instead of x86_64 binary

  • Compiler Flag: -std=c99

    This flag is optional, but highly recommended. It provides several beneficial improvement to C, for examples we just use the intermingled declarations function. It works nicely with OpenMP and reduces the needs for using the private directive in some cases

EXT  = c
SRCS = $(shell find src -name '*.$(EXT)')
OBJS = $(SRCS:src/%.$(EXT)=build/%.o)

BIN  = simple_phi

CC = icc
LD = icc

CFLAGS  = -qopenmp -std=c99 -O2
LDFLAGS = -qopenmp

#uncomment below two lines when MIC native mode is needed, otherwise -mmic should not be used
#CFLAGS  = -qopenmp -mmic -std=c99 -O2
#LDFLAGS = -qopenmp -mmic

all: $(BIN)

$(BIN): $(OBJS)
        $(LD) $(LDFLAGS) $(OBJS) -o $(BIN)

build/%.o: src/%.$(EXT)
        $(CC) $(CFLAGS) -c $< -o $@

clean:
        rm build/*.o
        rm $(BIN)

Submission Script

  • --nodes request number of nodes required by job

    It is assumed that OpenMP based Phi programs utilize as many threads as possible, so using --nodes instead of --ntasks is an easier way to request resources.

  • --exclusive request exclusive access to a Phi host.

    Due to the nature of Phi programming, we would suggest to always require phi jobs in exclusive mode

#!/bin/bash -l

#SBATCH --job-name=simple_phi_native
#SBATCH --output=output
#SBATCH --partition=phi
#SBATCH --nodes=1
#SBATCH --exclusive


enable_lmod
module load icc/17

# if running native MIC program
micnativeloadex simple_phi -e "OMP_NUM_THREADS=60"

# if running offload/hybrid program
export OMP_NUM_THREADS=4
./simple_phi

Source Code: Non-parallelized Version

// modified saxpy example from https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/

#include <stdio.h>
#include <stdlib.h>
#include <sys/utsname.h>
#include <math.h>

void saxpy(int n, float a, float *x, float *y)
{
        for (int i = 0; i < n; ++i)
                y[i] = a*x[i] + y[i];
}

void greeting()
{
        struct utsname sys_info;

        uname(&sys_info);
        printf("on host %s  1 threads will be launched\n", sys_info.nodename);
}

int main(void)
{
        int N = 1<<20;
        float *x, *y;

        x = (float*) malloc(N * sizeof(float));
        y = (float*) malloc(N * sizeof(float));

        for (int i = 0; i < N; i++) {
                x[i] = 1.0f;
                y[i] = 2.0f;
        }

        greeting();
        saxpy(N, 2.0f, x, y);

        float maxError = 0.0f;
        for (int i = 0; i < N; i++)
                maxError = fmax(maxError, abs(y[i] - 4.0f));
        printf("Max error: %f\n", maxError);
}

Source Code: Native Mode for CPU or MIC

// only showed the changed section from non-parallelized version

void saxpy(int n, float a, float *x, float *y)
{
        #pragma omp parallel for
        for (int i = 0; i < n; ++i)
            y[i] = a*x[i] + y[i];
}

void greeting()
{
        int thread_rank;
        struct utsname sys_info;

        uname(&sys_info);

        #pragma omp parallel private(thread_rank)
        {

                thread_rank = omp_get_thread_num();
                if (thread_rank == 0)
                        printf("on host %s  %d threads will be launched\n", 
							sys_info.nodename, omp_get_num_threads());

                printf("  Hey from thread %d on %s\n", thread_rank, sys_info.nodename);
        }
}

Compile and Execute: Native Mode for CPU or MIC

# to compile for CPU only
$ icc -qopenmp -std=c99 -O2 simple_phi.c -o simple_phi.cpu

# to compile for MIC only
$ icc -qopenmp -std=c99 -O2 -mmic simple_phi.c -o simple_phi.mic

# to execute CPU only version
$ export OMP_NUM_THREADS=4      # set to use 4 threads
$ ./simple_phi.cpu              # nothing special here
on host crphi-008  4 threads will be launched
  Hey from thread 0 on crphi-008
  Hey from thread 1 on crphi-008
  Hey from thread 3 on crphi-008
  Hey from thread 2 on crphi-008
Max error: 0.000000

# to execute MIC only version
$ micnativeloadex simple_phi -e "OMP_NUM_THREADS=4"
on host crphi-008-mic0  4 threads will be launched
  Hey from thread 2 on crphi-008-mic0
  Hey from thread 0 on crphi-008-mic0
  Hey from thread 1 on crphi-008-mic0
  Hey from thread 3 on crphi-008-mic0
Max error: 0.000000

# direct execution is not possible
$ ./simple_phi.mic
./simple_phi: Exec format error. Wrong Architecture.

Source Code: Offload Mode

void saxpy(int n, float a, float *x, float *y)
{
        #pragma offload target(mic) in(x:length(n)) inout(y:length(n))
        #pragma omp parallel for
        for (int i = 0; i < n; ++i)
                y[i] = a*x[i] + y[i];
}

void greeting()
{
        int thread_rank;
        struct utsname sys_info;

        #pragma offload target(mic)
        uname(&sys_info);

        #pragma offload target(mic)
        #pragma omp parallel private(thread_rank)
        {

                thread_rank = omp_get_thread_num();
                if (thread_rank == 0)
                        printf("on host %s  %d threads will be launched\n", 
							sys_info.nodename, omp_get_num_threads());

                printf("  Hey from thread %d on %s\n", thread_rank, sys_info.nodename);
        }
}

Compile and Execute: Offload Mode

$ icc -qopenmp -std=c99 -O2 simple_phi.c -o simple_phi.offload

# to execute offload mode
$ export OMP_NUM_THREADS=4
$ ./simple_phi.offload
Max error: 0.000000
on host crphi-008-mic0  4 threads will be launched
  Hey from thread 0 on crphi-008-mic0
  Hey from thread 2 on crphi-008-mic0
  Hey from thread 1 on crphi-008-mic0
  Hey from thread 3 on crphi-008-mic0

Source Code: OpenMP Hybrid Mode

// divide half of the load to MIC and half on CPU
// not practical for demonstration only
void saxpy(int n, float a, float *x, float *y)
{
        #pragma omp sections
        {
                #pragma omp section
                {
                        #pragma omp parallel for
                        for (int i = n/2; i < n; ++i)
                                y[i] = a*x[i] + y[i];
                }

                #pragma omp section
                #pragma offload target(mic) in(x:length(n/2)) inout(y:length(n/2))
                {
                        #pragma omp parallel for
                        for (int i = 0; i < n; ++i)
                                y[i] = a*x[i] + y[i];
                }
        }
}

// enable hybird on functions
void __attribute__((target(mic)))
greeting()
{
        int thread_rank;
        struct utsname sys_info;

        uname(&sys_info);
        #pragma omp parallel private(thread_rank)
        {

                thread_rank = omp_get_thread_num();
                if (thread_rank == 0)
                        printf("on host %s  %d threads will be launched\n", 
							sys_info.nodename, omp_get_num_threads());

                printf("  Hey from thread %d on %s\n", thread_rank, sys_info.nodename);
        }
}

// when calling the function:

#pragma omp sections
{
        #pragma omp section
        {
                greeting();
        }
        #pragma omp section
        #pragma offload target(mic)
        {
                greeting();
        }
}

Compile and Execute: Hybrid Mode

# to compile for hybird mode
$ icc -qopenmp  -std=c99 -O2 simple_phi.c -o simple_phi.hybird

# to execute hybird mode
$ export OMP_NUM_THREADS=8
$ ./simple_phi
  Hey from thread 4 on crphi-008
  Hey from thread 5 on crphi-008
on host crphi-008  8 threads will be launched
  Hey from thread 0 on crphi-008
  Hey from thread 3 on crphi-008
  Hey from thread 1 on crphi-008
  Hey from thread 2 on crphi-008
  Hey from thread 7 on crphi-008
  Hey from thread 6 on crphi-008
Max error: 0.000000
  Hey from thread 1 on crphi-008-mic0
on host crphi-008-mic0  8 threads will be launched
  Hey from thread 0 on crphi-008-mic0
  Hey from thread 7 on crphi-008-mic0
  Hey from thread 2 on crphi-008-mic0
  Hey from thread 3 on crphi-008-mic0
  Hey from thread 5 on crphi-008-mic0
  Hey from thread 4 on crphi-008-mic0
  Hey from thread 6 on crphi-008-mic0

Using multiple cards

Each Phi machine on Turing comes with two Coprocessors, you can utilize both in offload or hybrid mode:

// one way to use both cards
void greeting(int card)
{
        int thread_rank;
        struct utsname sys_info;

        #pragma offload target(mic:card)
        uname(&sys_info);

        #pragma offload target(mic:card)
        #pragma omp parallel private(thread_rank)
        {

                thread_rank = omp_get_thread_num();
                if (thread_rank == 0)
                        printf("on host %s  %d threads will be launched\n", 
							sys_info.nodename, omp_get_num_threads());

                printf("  Hey from thread %d on %s\n", thread_rank, sys_info.nodename);
        }
}
// when calling
greeting(0);
greeting(1);

// another way
void saxpy(int n, float a, float *x, float *y)
{
        int mid = n/2;

        #pragma omp sections
        {
                #pragma omp section
                #pragma offload target(mic:0) in(x[0:mid]) inout(y[0:mid])
                {
                        #pragma omp parallel for
                                for (int i = 0; i < n; ++i)
                                        y[i] = a*x[i] + y[i];
                }

                #pragma omp section
                #pragma offload target(mic:1) in(x[mid:mid]) inout(y[mid:mid])
                {
                        #pragma omp parallel for
                                for (int i = 0; i < n; ++i)
                                        y[i] = a*x[i] + y[i];
                }
        }
}

Compile and Execute

# to compile for multiple cards, same as offload or hybrid
$ icc -qopenmp -std=c99 -O2 simple_phi.c -o simple_phi

# to show the use of both cards
$ export OMP_NUM_THREADS=2
$ ./simple_phi
Max error: 0.000000
on host crphi-008-mic1  2 threads will be launched
  Hey from thread 0 on crphi-008-mic1
  Hey from thread 1 on crphi-008-mic1
on host crphi-008-mic0  2 threads will be launched
  Hey from thread 0 on crphi-008-mic0
  Hey from thread 1 on crphi-008-mic0

Submit to Slurm, and review job status

Submitting and reviewing your job is the same procedure as the simple MPI program section.
Please read MPI Programming

Hybrid: MPI + OpenMP

Goal

  • Compile and submit a MPI + OpenMP hybrid job on Turing
  • Understand Linux Processor Affinity
  • Understand how to apply this process to any MPI + multithreaded program within Slurm

Prerequisites

  • Basic knowledge regarding OpenMP programming
  • Tried submitting a regular MPI job on Turing at least once
  • If not, please read MPI Programming
  • Tried submitting a regular OpenMP job on Turing at least once
  • If not, please read OpenMP

What to expect

  • Create a very simple MPI + OpenMP program
  • Compile the code
  • Submit to Slurm with correct process thread configuration

Addition information

Sample Code

Source Code

#include <omp.h>
#include <mpi.h>

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

// a very simple fib number generator using openmp
// it is not the best way to compute fib, this is just an example
int simple_fib(int n)
{
        int i, j;

        if (n<2)
                return n;
        else {
#pragma omp task shared(i)
                i = simple_fib(n-1);

#pragma omp task shared(j)
                j = simple_fib(n-2);

#pragma omp taskwait
                return i+j;
        }
}

// a simple demonstration of threading from OpenMP, useful to understand
// Slurm and mpirun submission options
void greeting(const char* host)
{
        int thread_rank;

#pragma omp parallel private(thread_rank)
        {

                thread_rank = omp_get_thread_num();
                if (thread_rank == 0)
                        printf("on host %s  %d threads will be launched\n", host, omp_get_num_threads());

                printf("  Hey from thread %d on %s\n", thread_rank, host);
        }
}

int main (int argc, char *argv[])
{
        char host[1024];
        int rank, procs, host_len;

        int fib;
        int* fibs;

        MPI_Init(&argc, &argv);

        MPI_Comm_size(MPI_COMM_WORLD, &procs);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);

        MPI_Get_processor_name(host, &host_len);

        greeting(host);


        // demonstrate simple MPI communication, computes a fib number in each process
        // and send all results to rank 0 for display
        fib = simple_fib(rank + 10);
        printf("on host %s, compute fib of %d and it is %d\n", host, rank + 10, fib);

        if (rank == 0) {
                fibs = malloc(sizeof(int) * procs);
        }

        MPI_Gather(&fib, 1, MPI_INT, fibs, 1, MPI_INT, 0, MPI_COMM_WORLD);

        if (rank == 0) {
                for (int i = 0; i < procs; i++) {
                        printf("* from rank %d, gathered FIB number %d \n", i, fibs[i]);
                }
        }

        MPI_Finalize();
}

Setup environment for compilation

You should have the same environment as the simple MPI Example.

$ module list

Currently Loaded Modules:
  1) slurm/17.02   2) gcc/4   3) openmpi/2.0

Compile your code

You can still compile it manually:

$ mpicc -fopenmp mpi_omp.c -o mpi_omp

The flag -fopenmp instructs the compiler to compile code into an OpenMP program. Please do not forget it since the compilation may still succeed but the compiler may ignore all OpenMP directives.

For a different compiler, you will also have to give different options to enable OpenMP. Here is a list of what you may see on Turing.

compiler options
GCC -fopenmp
LLVM -fopenmp
ICC -qopenmp
PGI -mp

For a larger project, we still recommend using a Makefile to manage your build process.

EXT  = c
SRCS = $(shell find src -name '*.$(EXT)')
OBJS = $(SRCS:src/%.$(EXT)=build/%.o)

BIN  = mpi_omp

CC = mpicc
LD = mpicc

CFLAGS  = -fopenmp -O2
LDFLAGS = -fopenmp

all: $(BIN)

$(BIN): $(OBJS)
        $(LD) $(LDFLAGS) $(OBJS) -o $(BIN)

build/%.o: src/%.$(EXT)
        $(CC) $(CFLAGS) -c $< -o $@

clean:
        rm build/*.o
        rm $(BIN)

The only change to this file from the simple MPI example is adding -fopenmp to the compiler flags and linker flags.

Prepare the submission script for Slurm

Below is a list of changes to the submission script that is worth noticing. This submission works for a MPI + multithreaded program in general, such as MPI+OpenACC, MPI+Pthread

  • CPUs Per Task --cpus-per-task

    A openmp program requires a single process but multiple threads. This configuration can be delivered by using the --cpus-per-task option to Slurm.

  • Environment Variable OMP_NUM_THREADS

    OpenMP uses a number of environment variables to control its behavior. OMP_NUM_THREADS is, perhaps, the most important one as it tells your application how many threads it can have.

    There are many more. You can find detailed information here.

#!/bin/bash -l

#SBATCH --job-name=simple_mpi_omp
#SBATCH --output=output
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=10

enable_lmod
module load gcc/4
module load openmpi/2.0

mpirun -x OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK mpi_omp

Slurm and Linux Processor Affinity

Slurm will manage affinity on the user’s behalf. It will not allow users to use more cpu resources than the job being scheduled.

for instance:

  • configuration below launches 2 processes and each process is given access to 10 cores
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=10
  • If user launches 10 thread per process (OMP_NUM_THREADS set to 10), then the correct configuration is achieved

  • If user launches less than 10 thread per process (OMP_NUM_THREADS set to less than 10), then the cpu resource is wasted.

  • If user launches more than 10 thread per process (OMP_NUM_THREADS set to more than 10), each process still only has access to 10 cores. Some threads must be suspended to let other threads work and then resume. This frequent context switching will likely cause a decrease in performance.

Hybrid: MPI + Phi

Goal

  • Compile and submit a MPI job that also utilizes a Intel® Xeon Phi™ Coprocessor on Turing

Prerequisites

  • Basic knowledge regarding OpenMP programming
  • Basic knowledge regarding MPI programming
  • Tried submitting a regular MPI job on Turing at least once
  • If not, please read MPI Programming
  • Tried submitting a regular Phi job on Turing at least once
  • If not, please read Running Phi job on Turing

What to expect

  • Compile a MPI mic program in native, offloading, and hybrid mode
  • Submit to Slurm with correct process thread configuration

Addition information

Sample Code

Native, Offload, or Hybrid

  • Standard MPI

    Phis support all standard MPI. Any MPI code can be recompiled with the Intel Compiler and execute on Phi or Phi + CPU together

  • Standard MPI + Phi Offloading

    Similar to using MPI + OpenMP, Phi allows MPI to use OpenMP directives and utilize Phis in the process. It is not possible to offload if the code is running on Phi native mode or hybrid mode, since phi cannot offload to itself.

Mode architecture use CPU use MIC Offloadable
Native (CPU) x86_64 Yes Yes Yes
Native (MIC) k1om No Yes No
Hybrid x86_64 & k1om yes Yes No

Source Code

#include <stdio.h>
#include <sys/utsname.h>

#include <mpi.h>
#include <omp.h>


int main(int argc, char **argv)
{
        char host[1024];
        int rank, procs, host_len;
        struct utsname sys_info;

        MPI_Init(&argc, &argv);

        MPI_Comm_size(MPI_COMM_WORLD, &procs);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);

        #pragma offload target(mic)
        uname(&sys_info);

        MPI_Get_processor_name(host, &host_len);

        printf("Host: %s Rank: %d/%d Program: %s Computing on %s\n", 
			host, rank, procs, argv[0], sys_info.nodename);

        MPI_Finalize();
}

Setup Environment for Compilation

To compile a program that utilizes MICs, please load the following modules:

$ module list

Currently Loaded Modules:
  1) sge/2011   2) binutils/2.27   3) icc/17

Compile And Execute Your Code

# to compile for CPU only
$ mpiicc -qno-offload -O2 simple_phi.c -o simple_phi

# to compile for CPU + Offload
$ mpiicc -O2 simple_phi.c -o simple_phi.offload

# to compile for MIC only
$ mpiicc -O2 -mmic simple_phi.c -o simple_phi.mic

# to execute on CPU only
$ mpiexec.hydra -np 4 ./simple_phi
Host: crphi-008 Rank: 2/4 Program: ./simple_phi Computing on crphi-008
Host: crphi-008 Rank: 3/4 Program: ./simple_phi Computing on crphi-008
Host: crphi-008 Rank: 1/4 Program: ./simple_phi Computing on crphi-008
Host: crphi-008 Rank: 0/4 Program: ./simple_phi Computing on crphi-008

# to execute CPU + Offload
$ export I_MPI_MIC=enable
$ mpiexec.hydra -np 4 ./simple_phi.offload
Host: crphi-008 Rank: 3/4 Program: ./simple_phi.offload Computing on crphi-008-mic0
Host: crphi-008 Rank: 2/4 Program: ./simple_phi.offload Computing on crphi-008-mic0
Host: crphi-008 Rank: 0/4 Program: ./simple_phi.offload Computing on crphi-008-mic0
Host: crphi-008 Rank: 1/4 Program: ./simple_phi.offload Computing on crphi-008-mic0

# to execute on MIC only
$ export I_MPI_MIC=enable
$ mpiexec.hydra -np 4 -host mic0,mic1 ./simple_phi.mic
Host: crphi-008-mic0 Rank: 1/4 Program: ./simple_phi.mic Computing on crphi-008-mic0
Host: crphi-008-mic0 Rank: 2/4 Program: ./simple_phi.mic Computing on crphi-008-mic0
Host: crphi-008-mic0 Rank: 3/4 Program: ./simple_phi.mic Computing on crphi-008-mic0
Host: crphi-008-mic0 Rank: 0/4 Program: ./simple_phi.mic Computing on crphi-008-mic0

# to execute on CPU + MIC hybrid
$ export I_MPI_MIC_POSTFIX=.mic
$ export I_MPI_MIC=enable
$ mpiexec.hydra -np 4 -host crphi-008,mic0,mic1 ./simple_phi
Host: crphi-008 Rank: 2/4 Program: ./simple_phi Computing on crphi-008
Host: crphi-008 Rank: 3/4 Program: ./simple_phi Computing on crphi-008
Host: crphi-008-mic1 Rank: 1/4 Program: ./simple_phi.mic Computing on crphi-008-mic1
Host: crphi-008-mic0 Rank: 0/4 Program: ./simple_phi.mic Computing on crphi-008-mic0

For a larger project, we still recommend using a Makefile to manage your build process. The follow makefile builds all three kinds of binaries. In your project, you most likely do not need to do this.

EXT  = c
SRCS = $(shell find src -name '*.$(EXT)')
OBJS = $(SRCS:src/%.$(EXT)=build/%.o)
BIN  = simple_phi

MIC_OBJS = $(OBJS:%.o=%.o.mic) 
MIC_BIN  = $(BIN).mic

OFFLOAD_OBJS = $(OBJS:%.o=%.o.offload) 
OFFLOAD_BIN  = $(BIN).offload

CC = mpiicc
LD = mpiicc

CFLAGS  = -O2 
LDFLAGS = 

all: $(BIN) $(MIC_BIN) $(OFFLOAD_BIN)

$(BIN): $(OBJS)
        $(LD) $(LDFLAGS) -qno-offload $(OBJS) -o $(BIN)

$(MIC_BIN): $(MIC_OBJS)
        $(LD) $(LDFLAGS) -mmic $(MIC_OBJS) -o $(MIC_BIN)

$(OFFLOAD_BIN): $(OFFLOAD_OBJS)
        $(LD) $(LDFLAGS) $(OFFLOAD_OBJS) -o $(OFFLOAD_BIN)

build/%.o: src/%.$(EXT)
        $(CC) $(CFLAGS) -qno-offload -c $< -o $@

build/%.o.mic: src/%.$(EXT)
        $(CC) $(CFLAGS) -mmic -c $< -o $@

build/%.o.offload: src/%.$(EXT)
        $(CC) $(CFLAGS) -c $< -o $@

clean:
        rm build/*.o*
        rm $(BIN)
        rm $(MIC_BIN)
        rm $(OFFLOAD_BIN)

Submission Script

The following submission script demonstrates all three ways to run MPI. In your project, you most likely do not need to do this.

#!/bin/bash -l

#SBATCH --job-name=simple_mpi_mic
#SBATCH --output=output
#SBATCH --partition=phi
#SBATCH --nodes=2
#SBATCH --exclusive


enable_lmod
module load icc/17

export I_MPI_MIC=enable

# pick one mode and comment out others
#mode="native"
#mode="offload"
mode="hybrid"

hosts_count=$SLURM_NNODES
hosts=$(srun hostname -s)
mics=$(echo $hosts | xargs -n1 echo |  awk '{print $1 "-mic0\n" $1 "-mic1" }')


case $mode in
        "native")
                hosts="$mics"
                slots=$(( hosts_count * 2 * 60 ))
                binary=simple_phi.mic
                ;;
        "offload")
                slots=$hosts_count
                binary=simple_phi.offload
                ;;
        "hybrid")
                hosts="$hosts $mics"
                export I_MPI_MIC_POSTFIX=.mic
                slots=$(($hosts_count * 2 * 60 + $hosts_count * $(nproc)))
                binary=simple_phi
                ;;
esac

echo $hosts | tr " " "\n" > machinefile.$SLURM_JOB_ID

mpiexec.hydra -bootstrap ssh -np $slots -machine machinefile.$SLURM_JOB_ID ./$binary

rm machinefile.$SLURM_JOB_ID

Hadoop Big Data Cluster

Hadoop cluster hardware resources

The hadoop cluster is made up of a resource manager, name node and 6 data nodes. The cluster has the following hardware resources available:

Node Type Total Available RAM Storage
Resource Manager 1 n/a n/a
Name Node 1 n/a n/a
Data Node 6 128GB 3 x 400GB SSD

The HDFS file system comprises 7.71TB of total storage. The data nodes are connected over an FDR infiniband network for high speed internode communication and the name node and resource manager are connected via 10GB interfaces.

The name node handles the management of the HDFS file system while the resource manager schedules jobs on the data nodes.

Hadoop cluster software resources

The hadoop cluster runs the Hortonworks hadoop software packages. Hortonworks provides a reliable base of software that can be used for many data analytics purposes.

The cluster provides traditional map-reduce in addition to spark, hive and flume. The cluster uses the YARN resource scheduler.

Connecting to and using the Hadoop cluster

Users connect to the cluster using standard ssh and authenticate using Midas username and password.

$ ssh user@namenode.hpc.odu.edu

At this point you are ready to run a hadoop program or browse hdfs

To view data in the HDFS file system use the executable hdfs dfs followed by a command. For example, in order to list the current directory use: hdfs dfs -ls

$ hdfs dfs -ls
Found 32 items
drwx------   - user hdfs          0 2016-04-18 08:00 .Trash
drwx------   - user hdfs          0 2016-05-03 14:16 .staging
drwxr-xr-x   - user hdfs          0 2016-05-23 14:58 tweets

Here is a list of commands that can be used in HDFS:

Command Function
hdfs dfs -mkdir <directory name> creates a directory in HDFS
hdfs dfs -cat <filename> lists the contents of the specified file
hdfs dfs -put <filename> copy the specified file to HDFS
hdfs dfs -get <filename> copy the specified file from HDFS to the local directory
hdfs dfs -cp <source filename> <destination filename> copy the source file to the destination file in HDFS

Sample Job

We will now show a simple example job running on the hadoop cluster. A simple but powerfull function is a wordcount. Below is an example of a wordcount that uses the map-reduce functionality on the hadoop cluster:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

A mapreduce program has several stages: the map, the combine/sort and the reduce. In the example above is a mapper function called TokenizerMapper. It takes the input data and maps each word to a key, value pair. The key, value pairs are then combined a sorted and the reducer counts the matching words in the IntSumReducer function.

  1. First the text to be counted is copied into HDFS.
$ hdfs dfs -put file01
  1. The wordcount application is then compiled
$ hadoop com.sun.tools.javac.Main WordCount.java
$ jar cf wc.jar WordCount*.class
  1. The application can now be run on the cluster. Input files should be located in /user/<username>.
$ hadoop jar wc.jar WordCount file01 wcout
WARNING: Use "yarn jar" to launch YARN applications.
16/05/24 11:52:55 INFO impl.TimelineClientImpl: Timeline service address: http://rm.ib.cluster:8188/ws/v1/timeline/
16/05/24 11:52:55 INFO client.RMProxy: Connecting to ResourceManager at rm.ib.cluster/172.25.30.157:8050
16/05/24 11:52:56 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 930 for user on 172.25.30.156:8020
16/05/24 11:52:56 INFO security.TokenCache: Got dt for hdfs://namenode.ib.cluster:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 172.25.30.156:8020, Ident: (HDFS_DELEGATION_TOKEN token 930 for user)
16/05/24 11:52:56 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/05/24 11:52:56 INFO input.FileInputFormat: Total input paths to process : 1
16/05/24 11:52:56 INFO mapreduce.JobSubmitter: number of splits:1
16/05/24 11:52:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1461083135942_0013
16/05/24 11:52:56 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: 172.25.30.156:8020, Ident: (HDFS_DELEGATION_TOKEN token 930 for user)
16/05/24 11:52:57 INFO impl.YarnClientImpl: Submitted application application_1461083135942_0013
16/05/24 11:52:57 INFO mapreduce.Job: The url to track the job: http://rm.ib.cluster:8088/proxy/application_1461083135942_0013/
16/05/24 11:52:57 INFO mapreduce.Job: Running job: job_1461083135942_0013
16/05/24 11:53:04 INFO mapreduce.Job: Job job_1461083135942_0013 running in uber mode : false
16/05/24 11:53:04 INFO mapreduce.Job:  map 0% reduce 0%
16/05/24 11:53:11 INFO mapreduce.Job:  map 100% reduce 0%
16/05/24 11:53:16 INFO mapreduce.Job:  map 100% reduce 100%
16/05/24 11:53:16 INFO mapreduce.Job: Job job_1461083135942_0013 completed successfully
16/05/24 11:53:16 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=40
                FILE: Number of bytes written=272941
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=137
                HDFS: Number of bytes written=22
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Rack-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=5045
                Total time spent by all reduces in occupied slots (ms)=4596
                Total time spent by all map tasks (ms)=5045
                Total time spent by all reduce tasks (ms)=2298
                Total vcore-seconds taken by all map tasks=5045
                Total vcore-seconds taken by all reduce tasks=2298
                Total megabyte-seconds taken by all map tasks=67159040
                Total megabyte-seconds taken by all reduce tasks=61181952
        Map-Reduce Framework
                Map input records=1
                Map output records=4
                Map output bytes=38
                Map output materialized bytes=40
                Input split bytes=115
                Combine input records=4
                Combine output records=3
                Reduce input groups=3
                Reduce shuffle bytes=40
                Reduce input records=3
                Reduce output records=3
                Spilled Records=6
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=110
                CPU time spent (ms)=2800
                Physical memory (bytes) snapshot=2827038720
                Virtual memory (bytes) snapshot=38793658368
                Total committed heap usage (bytes)=3375366144
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=22
        File Output Format Counters
                Bytes Written=22

The output of the job is located in wcout:

$ hdfs dfs -cat wcout/*
Bye     1
Hello   1
World   2

Invest in Computing Program

If you would like to invest in the RCS Computing Program, please contact us at hpc@odu.edu.