Xeon Phi User Guide

Table of Contents

1. Introductionto top

1.1. Description

Intel Xeon Phi coprocessors are a complementary piece of hardware designed specifically to accelerate the performance and scalability of parallel applications. Each coprocessor, or MIC, is made up of Many Integrated Cores. Each core has a slower clock speed than a typical compute core on an HPC system but can support up to 4 execution threads at once. As a whole, these cores' performance total up to just over 1 TFLOP, or 1 trillion floating point operations per second.

From a hardware perspective, the Intel Xeon Phi coprocessors are accessible on a subset of standard compute nodes on Haise and Kilrain. The standard compute node is augmented by the two coprocessors built into each Intel Xeon Phi node, which allows for up to 480 threads of parallel execution per Intel Xeon Phi node. Each Xeon Phi coprocessor can support up to 240 concurrent threads of execution. These standard nodes also have 64 GBytes of total memory to provide space for the overhead incurred when transferring data between the compute node and the Intel Xeon Phi coprocessor node.

1.2. Usage and Availability

The Intel Xeon Phi accelerated nodes are available exclusively via the phi64 queue. Queue attributes can be displayed with the qstat command.

qstat -Qf phi64

The attributes will show there are 12 Intel Xeon Phi nodes available on both Haise and Kilrain. These nodes are intended for users to begin testing the applicability of the Intel Xeon Phi coprocessor technology to their codes.

Please Note: At this time, users who would like to run jobs utilizing the Intel Xeon Phi nodes are asked to request access by sending a note to dsrchelp@navydsrc.hpc.mil.

2. Modes of Operationto top

The Intel Xeon Phi coprocessors allow for several modes of operation. These are:

  • Offload
  • Native
  • Native/Symmetric

2.1. Offload Mode

In Offload Mode, a user must modify code to specifically direct the compiler to generate code that will run on an Intel Xeon Phi. One potentially quick means for targeting the coprocessors would be to wrap blocks of OpenMP code with Intel Xeon Phi offload directives. The OpenMP code would then execute on the Intel Xeon Phi coprocessor. Since each core on a Intel Xeon Phi can execute 4 threads, users may be able to utilize up to 240 threads per Intel Xeon Phi, and, with each Intel Xeon Phi compute node having 2 coprocessors available to it, up to 480 threads may be executed in total per Intel Xeon Phi node. Key to this, of course, is the scalability of the code as well as the size of the data structures being pushed to the Intel Xeon Phi node.

Below is a table outlining offload directives for both C/C++ and Fortran code.

Offload Clauses
C/C++ Fortran Description
#pragma offload target(mic) !dir$ offload target(mic) Tells the compiler to generate code for the MIC device; Can be used to offload single statements or blocks of code.
in(var[:modifiers]) in(var [:modifiers]) Tells the compiler which data to move to the MIC and attributes about the data.
out(var [:modifiers]) out(var [:modifiers]) Tells the compiler which data to move from the MIC and attributes about the data.
inout(var[:modifiers]) inout(var[:modifiers]) Tells the compiler which data to move into and out of the MIC and attributes about the data.
nocopy(var[:modifiers]) nocopy(var[:modifiers]) Tells the compiler to create persistent data on the MIC.
if(test)
where test evaluates to 0 or 1
if(test)
where test evaluates to .true. or .false.
Tests for a condition.
signal(&var) signal(var) Allows for asynchronous execution of offload code; give a signal.
wait(&var) wait(var) Allows for asynchronous execution of offload code; wait for a signal.
Modifiers for in, out, inout and nocopy
C/C++ Fortran Description
length(num_elements) length(num_elements) The length of a data element.
alloc_if(test)
where test evaluates to 0 or 1
alloc_if(test)
where test evaluates to .true. or .false.
Allocates space for data based on condition.
free_if(test)
where test evaluates to 0 or 1
free_if(test)
where test evaluates to 0 or 1
Free space used by data based on condition.
align(val) align(val) Aligns data on boundaries based on val.
alloc([first:last]) alloc([first:last]) Allocate space on the Intel Xeon Phi.
into(var) into(var) Copy data into specified location.

2.2. Native Mode

To run code in Native Mode, either MPI-based or threaded OpenMP code, the code must first be built to run directly on the Intel Xeon Phi node. The Intel Compiler Suite is the only compiler currently available that will build native MIC code. In order to do this, users must load or swap to the Intel Compiler module and add the "-mmic" option to the compiler flag list. For information on loading or swapping modules, please see the Modules User Guide.

The Intel Xeon Phi nodes currently available to users are attached to standard compute nodes. As such, in order to run native MIC code, users must first start an interactive PBS job in the phi64 queue. From there, users may SSH into the Intel Xeon Phi node(s) attached to the standard compute node. The Intel Xeon Phi node name will be "standard_compute_node_name-mic[0|1]" (i.e. k15n31-mic0 or h15n31-mic0). For more information on starting an interactive PBS job, please see Section 6.2.1 Interactive Batch Shell in the PBS Guides for Haise and Kilrain.

2.3. Native/Symmetric Mode

Native/Symmetric Mode involves running multiple MPI executables at once. To take advantage of Native/Symmetric Mode, users must compile the application's code both for the standard compute node and for the Intel Xeon Phi node.

NOTE: Native/Symmetric Mode is available on the Intel Xeon Phi nodes on Haise and Kilrain. However, in order to support MPI communication between multiple Intel Xeon Phi nodes, an update to the OFED software stack is required. Therefore, Native/Symmetric Mode is only available within a single Intel Xeon Phi node.

3. Code Optimizationsto top

3.1. General Optimizations

In general, optimizations applied to code designed or built to run on traditional x86-64 processors will also improve performance on code intended to run on Intel Xeon Phi coprocessors. The Intel Xeon Phi core architecture is also x86-64. Users should try to vectorize code as much as possible, as the multiple threads available to each Intel Xeon Phi core will take advantage of that and potentially provide massive parallelism. The "-vec_report" option listed in the table found in Section 4 Compiler Options will generate a listing of the vectorization efforts taken by the compiler. Examination of the results of a compilation using "-vec_report" should aid in finding areas of code on which to focus optimization efforts.

Efforts should be made to align data targeted for the Intel Xeon Phi coprocessors on 64-byte boundaries. This is due to the 512-bit SIMD width of the Intel Xeon Phis. Aligning data on 64-byte boundaries can be accomplished either directly via data structure declarations or with the "-align array64byte" compiler option available in the Intel Fortran compiler. For more information, see Intel's article "Data Alignment to Assist Vectorization".

For more information on optimizing code in general, please see Section 5.6 on Code Profiling and Optimization in the Haise and Kilrain User Guides.

3.2. Library Optimizations

Intel's Math Kernel Library (MKL) will take advantage of the Intel Xeon Phi nodes. Intel has provided it the ability to offload segments of computational work by default. Users who build code for Offload Mode while linking with MKL do not have to make any modifications to Makefiles or compilation options to take advantage of this.

Users who develop custom libraries for specific applications may also wish to implement Offload Mode directives. Source code areas for consideration would include OpenMP blocks and array-traversing loops.

4. Compiler Optionsto top

When using the Intel Compiler Suite, there are a number of compiler options that users may invoke related to Intel Xeon Phis. The table below contains a list of the options, the default value, and a brief description of the option.

Note: Users do not have to specify any additional compiler options when building code for Offload Mode. The Intel Compiler Suite will handle any offload directives found within the source code automatically.

Compiler Options
Option Default Value Description
-mmic Not set by default Compile code to run natively on Intel Xeon Phi.
-no-offload Not set by default Disable any offload directives.
-vec_report {0.6} Default value = 0 Display vectorization information about the compile process.
0 displays no information.
6 displays the most.
-openmp-report={0...2} Default value = 0 (Fortran only) Displays information about compilation of OpenMP code regions.
0 displays no information.
2 Displays the most.

5. Running Xeon Phi Jobsto top

5.1. Queue and Queue Resources

The Intel Xeon Phi nodes are exclusively available via the phi64 queue. PBS jobs that run within the phi64 queue are charged under the same policies that cover the general compute pool of the systems (i.e. #_of nodes * # of walltime hours * # of standard compute cores per node).

Below is an example of a job script targeting 1 Intel Xeon Phi node via the phi64 queue for 30 minutes:

#!/bin/bash
#PBS -N phi_test_job
#PBS -q phi64
#PBS -A NAVOS96390NTS
#PBS -j oe
#PBS -l walltime=30:00
#PBS -l select=1:ncpus=16:mpiprocs=16
...
...
...

5.2. Environment Variables

Users may modify the environment an application utilizes on the Intel Xeon Phi via a handful of environment variables. Some of the variables may look familiar, as the names are based off of variables often used in hybrid computing.

Environment Variables
Variable Default Value Description
MIC_ENV_PREFIX Not defined by default Sets the prefix for Intel Xeon Phi environment variables.
MIC_OMP_NUM_THREADS Not defined by default Sets the number of threads to utilize per Intel Xeon Phi.
MIC_KMP_AFFINITY Not defined by default Sets the thread layout on the Intel Xeon Phi.
balanced - place all threads on separate cores until all cores have at least one thread.
compact - compresses thread placement.
scatter - similar to balanced but separates consecutively numbered threads if possible.
MIC_LD_LIBRARY_PATH Not defined by default Sets the LD_LIBRARY_PATH value for the Intel Xeon Phi environment.
OFFLOAD_REPORT OFFLOAD_REPORT = 0 0 - No report.
1 - Name of function using Automatic Offload; Effective work division; Time spent on host during call; Time spent on Xeon Phi during call.
2 - Reports all of Level 1 plus the amount of data transferred to and from the coprocessor.
MKL_MIC_ENABLE MKL_MIC_ENABLE = 1 Enables automatic offload within MKL routines.

5.3. Launching Executables

Launching an executable intended for either Offload Mode or Native Mode use of the Intel Xeon Phi is no different than launching a CPU-based executable.

5.3.1. Example Launch Script (2-node MPI with offload directives)

Here's an example of launching a 2-node MPI executable that has offload directives, is linked with MKL, and will generate an offload report:

# Set up the environment
export MIC_ENV_PREFIX=MIC
export MIC_OMP_NUM_THREADS=60
export MIC_KMP_AFFINITY=balanced
export MKL_MIC_ENABLE=1
export OFFLOAD_REPORT=2

# Launch executable
mpirun -np 32 ./myexe.off
5.3.2. Example Launch Script (1-node Native Mode threaded OpenMP)

Here's an example of launching a single node Native Mode threaded OpenMP executable directly on the Intel Xeon Phi:

# Set up the environment
export MIC_ENV_PREFIX=MIC
export MIC_OMP_NUM_THREADS=60
export MIC_KMP_AFFINITY=balanced
export MKL_MIC_ENABLE=1

# Launch executable
./myexe.mic

6. Links to Vendor Documentationto top

Intel Xeon Phi Main Page: http://software.intel.com/mic-developer
Intel Xeon Phi Webinar: http://software.intel.com/en-us/articles/intel-xeon-phi-webinar
Programming and Compiling for Intel MIC Architecture: http://software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture