Xeon Phi User Guide

Table of Contents

1. Introductionto top

1.1. Description

Intel Xeon Phi coprocessors are a complementary piece of hardware designed specifically to accelerate the performance and scalability of parallel applications. Each coprocessor, or MIC, is made up of Many Integrated Cores. Each core has a slower clock speed than a typical compute core on an HPC system but can support up to 4 execution threads at once. Each Xeon Phi coprocessor can support up to 240 total concurrent threads of execution. As a whole, these cores' performance total up to just over 1 TFLOP, or 1 trillion floating point operations per second.

From a hardware perspective, the Intel Xeon Phi coprocessors are accessible on a subset of standard compute nodes on the Navy DSRC IBM iDataPlexes, Haise and Kilrain, as well as the Cray XC series systems, Armstrong, Shepard, Conrad, Gordon, and Bean. On the IBM iDataPlexes, a standard compute node is augmented by two coprocessors built into each Intel Xeon Phi node, which allows for up to 480 threads of parallel execution per Intel Xeon Phi node. On the Cray XC systems, a single Intel Xeon processor is paired with a single Intel Xeon Phi coprocessor.

1.2. Usage and Availability

The Intel Xeon Phi accelerated nodes are available exclusively via the phi queue. Queue attributes can be displayed with the qstat command.

qstat -Qf phi

The attributes will show there are 12 Intel Xeon Phi nodes available on both Haise and Kilrain, 124 on Armstrong and Shepard, and 168 on Conrad and Gordon. Bean contains 24 Intel Xeon Phi nodes.

These nodes are intended for users to begin testing the applicability of the Intel Xeon Phi coprocessor technology to their codes.

2. Modes of Operationto top

Navy DSRC systems currently support the following Intel Xeon Phi coprocessor modes of operation:

  • Offload
  • Native

2.1. Offload Mode

In Offload Mode, a user must modify code to specifically direct the compiler to generate code that will run on an Intel Xeon Phi. One potentially quick means for targeting the coprocessors would be to wrap blocks of OpenMP code with Intel Xeon Phi offload directives. The OpenMP code would then execute on the Intel Xeon Phi coprocessor. Since each core on an Intel Xeon Phi can execute 4 threads, users may be able to utilize up to 240 threads per Intel Xeon Phi, and with each Intel Xeon Phi compute node having up to 2 coprocessors available to it, up to 480 threads may be executed in total per Intel Xeon Phi node. Key to this, of course, is the scalability of the code as well as the size of the data structures being pushed to the Intel Xeon Phi node.

It is also possible to utilize high level language extensions such as OpenMP, OpenACC or OpenCL to program for Intel Xeon Phis. The OpenMP 4.0 specification has constructs to support the Intel Xeon Phi. The Cray and Intel compilers support OpenCL directives. The Cray and PGI compilers support the OpenACC directives.

Below is a table outlining Intel Xeon Phi offload directives for both C/C++ and Fortran code.

Offload Clauses
C/C++ Fortran Description
#pragma offload target(mic) !dir$ offload target(mic) Tells the compiler to generate code for the MIC device; Can be used to offload single statements or blocks of code.
in(var[:modifiers]) in(var [:modifiers]) Tells the compiler which data to move to the MIC and attributes about the data.
out(var [:modifiers]) out(var [:modifiers]) Tells the compiler which data to move from the MIC and attributes about the data.
inout(var[:modifiers]) inout(var[:modifiers]) Tells the compiler which data to move into and out of the MIC and attributes about the data.
nocopy(var[:modifiers]) nocopy(var[:modifiers]) Tells the compiler to create persistent data on the MIC.
if(test)
where test evaluates to 0 or 1
if(test)
where test evaluates to .true. or .false.
Tests for a condition.
signal(&var) signal(var) Allows for asynchronous execution of offload code; give a signal.
wait(&var) wait(var) Allows for asynchronous execution of offload code; wait for a signal.
Modifiers for in, out, inout, and nocopy
C/C++ Fortran Description
length(num_elements) length(num_elements) The length of a data element.
alloc_if(test)
where test evaluates to 0 or 1
alloc_if(test)
where test evaluates to .true. or .false.
Allocates space for data based on condition.
free_if(test)
where test evaluates to 0 or 1
free_if(test)
where test evaluates to 0 or 1
Free space used by data based on condition.
align(val) align(val) Aligns data on boundaries based on val.
alloc([first:last]) alloc([first:last]) Allocate space on the Intel Xeon Phi.
into(var) into(var) Copy data into specified location.

2.2. Native Mode

To run code in Native Mode, either MPI-based or threaded OpenMP code, the code must first be built to run directly on the Intel Xeon Phi node. The Intel Compiler Suite is the only compiler currently available that will build native MIC code. In order to do this, users must load or swap to the Intel Compiler module and add the "-mmic" option to the compiler flag list. For information on loading or swapping modules, please see the Modules User Guide.

On the IBM iDataPlex systems, the Intel Xeon Phi nodes currently available to users are attached to standard compute nodes. As such, in order to run native MIC code, users must first start an interactive PBS job in the phi queue. From there, users may SSH into the Intel Xeon Phi node(s) attached to the standard compute node. The Intel Xeon Phi node name will be "standard_compute_node_name-mic[0|1]" (i.e. k15n31-mic0 or h15n31-mic0). For more information on starting an interactive PBS job, please see Section 6.2.1 Interactive Batch Shell in the PBS Guides for Haise and Kilrain.

On the Cray XC systems, users should follow the steps below to run natively on an Intel Xeon Phi:

  1. Swap to the Intel Programming Environment
    module swap PrgEnv-cray PrgEnv-intel
  2. Unload the ATP and libsci modules
    module unload cray-libsci atp
  3. Set the target processor to Phi
    module swap craype-[non-Phi processor] craype-intel-knc
  4. Add -mmic and -openmp to the compiler options
  5. Add -k to your aprun command line to run code on the Phi portion of the node

3. Code Optimizationsto top

3.1. General Optimizations

In general, optimizations applied to code designed or built to run on traditional x86-64 processors will also improve performance on code intended to run on Intel Xeon Phi coprocessors. The Intel Xeon Phi core architecture is also x86-64. Users should try to vectorize code as much as possible, as the multiple threads available to each Intel Xeon Phi core will take advantage of that and potentially provide massive parallelism. The "-vec_report" option listed in the table found in Section 4 Compiler Options will generate a listing of the vectorization efforts taken by the compiler. Examination of the results of a compilation using "-vec_report" should aid in finding areas of code on which to focus optimization efforts.

Efforts should be made to align data targeted for the Intel Xeon Phi coprocessors on 64-byte boundaries. This is due to the 512-bit SIMD width of the Intel Xeon Phis. Aligning data on 64-byte boundaries can be accomplished either directly via data structure declarations or with the "-align array64byte" compiler option available in the Intel Fortran compiler. For more information, see Intel's article "Data Alignment to Assist Vectorization".

For more information on optimizing code in general, please see Section 5.6 on Code Profiling and Optimization in the relevant system User Guide available on the documentation page.

3.2. Library Optimizations

Intel's Math Kernel Library (MKL) will take advantage of the Intel Xeon Phi nodes. Intel has provided it the ability to offload segments of computational work by default. Users who build code for Offload Mode while linking with MKL do not have to make any modifications to Makefiles or compilation options to take advantage of this.

Users who develop custom libraries for specific applications may also wish to implement Offload Mode directives. Source code areas for consideration would include OpenMP blocks and array-traversing loops.

4. Compiler Optionsto top

When using the Intel Compiler Suite, there are a number of compiler options that users may invoke related to Intel Xeon Phis. The table below contains a list of the options, the default value, and a brief description of the option.

Note: Users do not have to specify any additional compiler options when building code for Offload Mode. The Intel Compiler Suite will handle any offload directives found within the source code automatically.

Compiler Options
Option Default Value Description
-mmic Not set by default Compile code to run natively on Intel Xeon Phi.
-no-offload Not set by default Disable any offload directives.
-vec_report {0...6} Default value = 0 Display vectorization information about the compile process.
0 displays no information.
6 displays the most.
-openmp-report={0...2} Default value = 0 (Fortran only) Displays information about compilation of OpenMP code regions.
0 displays no information.
2 Displays the most.

5. Running Xeon Phi Jobsto top

5.1. Queue and Queue Resources

The Intel Xeon Phi nodes are exclusively available via the phi queue. PBS jobs that run within the phi queue are charged under the same policies that cover the general compute pool of the systems (i.e. #_of nodes * # of walltime hours * # of standard compute cores per node).

Note: The IBM iDataPlex systems Haise and Kilrain have 16 standard compute cores on each Intel Xeon Phi node (i.e. #PBS -l select=1:ncpus=16). The Cray XC30 systems Armstrong and Shepard have 10 standard compute cores per Intel Xeon Phi node (i.e. #PBS -l select=1:ncpus=10), and the Cray XC40 systems Conrad and Gordon have 12 (i.e. #PBS -l select=1:ncpus=12).

Below is an example job script targeting 1 Intel Xeon Phi node for 30 minutes via the phi queue on the iDataPlex systems Haise and Kilrain:

#!/bin/bash
#PBS -N phi_test_job
#PBS -q phi
#PBS -A Project_ID
#PBS -j oe
#PBS -l walltime=30:00
#PBS -l select=1:ncpus=16:mpiprocs=16
...
...
...

5.2. Environment Variables

Users may modify the environment an application utilizes on the Intel Xeon Phi via a handful of environment variables. Some of the variables may look familiar, as the names are based off of variables often used in hybrid computing.

Environment Variables
Variable Default Value Description
MIC_ENV_PREFIX Not defined by default Sets the prefix for Intel Xeon Phi environment variables.
MIC_OMP_NUM_THREADS Not defined by default Sets the number of threads to utilize per Intel Xeon Phi.
MIC_KMP_AFFINITY Not defined by default Sets the thread layout on the Intel Xeon Phi.
balanced - place all threads on separate cores until all cores have at least one thread.
compact - compresses thread placement.
scatter - similar to balanced but separates consecutively numbered threads if possible.
MIC_LD_LIBRARY_PATH Not defined by default Sets the $LD_LIBRARY_PATH value for the Intel Xeon Phi environment.
OFFLOAD_REPORT OFFLOAD_REPORT = 0 0 - No report.
1 - Name of function using Automatic Offload; Effective work division; Time spent on host during call; Time spent on Xeon Phi during call.
2 - Reports all of Level 1 plus the amount of data transferred to and from the coprocessor.
MKL_MIC_ENABLE MKL_MIC_ENABLE = 1 Enables automatic offload within MKL routines.

5.3. Launching Executables

Launching an executable intended for either Offload Mode or Native Mode use of the Intel Xeon Phi is not much different than launching a CPU-based executable.

Note: Although Intel Xeon Phis may support up to 240 threads (4 threads per core x 60 cores), it is recommended that users leave at least 1 core available for the Intel Xeon Phi operating system to utilize. Therefore, when setting environment variables such as $MIC_OMP_NUM_THREADS or $MIC_KMP_PLACE_THREADS, it is generally better to use 236 total threads and 59 cores. See the examples below for reference.

5.3.1. Example Launch Script (2-node MPI with offload directives)

Here's an example of launching a 2-node MPI executable that has offload directives, is linked with MKL, and will generate an offload report:

# Set up the environment
export MIC_ENV_PREFIX=MIC
export MIC_OMP_NUM_THREADS=236
export MIC_KMP_AFFINITY="granularity=fine,compact"
export MIC_KMP_PLACE_THREADS="59c,4t"
export MKL_MIC_ENABLE=1     #MKL is only available on IBM iDataPlex systems
export OFFLOAD_REPORT=2

# Select one of the following launch options
# On IBM iDataPlex systems, the Intel MPI or OpenMPI launcher should be used
# On Cray XC systems, the Cray aprun launcher should be used

# Launch IBM iDataPlex executable
mpirun -np 32 ./myexe.off

# Launch Cray XC executable
aprun -cc none ./myexe.off
5.3.2. Example Launch Script (1-node Native Mode threaded OpenMP)

Here's an example of launching a single-node Native Mode threaded OpenMP executable directly on the Intel Xeon Phi:

# Set up the environment
export OMP_NUM_THREADS=236
export KMP_AFFINITY="granularity=fine,compact"
export KMP_PLACE_THREADS="59c,4t"
export MKL_MIC_ENABLE=1      #MKL is only available on IBM iDataPlex systems

# Select one of the below launch options
# On IBM iDataPlex systems, the executable may be called directly or launched with 
#    Intel MPI or OpenMPI's mpirun
# On Cray XC systems, the Cray aprun launcher should be used

# Launch IBM iDataPlex executable
 ./myexe.mic

# Launch Cray XC executable
aprun -cc none -k ./myexe.mic
5.3.3. Example Source Code

Below is example code that has been modified to take advantage of Intel Xeon Phi coprocessors. It contains both OpenMP and Xeon Phi offload directives. This code may be compiled for use as an Offload or Native mode code, using the appropriate compiler options. Given the inclusion of OpenMP directives, users should add the "-openmp" compiler option to utilize OpenMP threading in either scenario. For native mode compilation, users should add the "-mmic" option to a compilation command.

! scale the calculation across threads requested
! need to set environment variables OMP_NUM_THREADS and KMP_AFFINITY

!next directive needed for offload mode only
!dir$ offload target (mic : 0)
!$OMP PARALLEL do PRIVATE(i,j,k,offset)
    do i=1, numthreads
    ! each thread will work its own array section
    ! calc offset into the right section
    offset = i*LOOP_COUNT
        ! loop many times to get lots of calculations
        do j=1, MAXFLOPS_ITERS
            ! scale 1st array and add in the 2nd array
            !$omp simd aligned( fa,fb,a:64)
            do k=1, LOOP_COUNT
                fa(k+offset) = a * fa(k+offset) + fb(k+offset)
            end do
            !$omp end simd
        end do
    end do

6. Links to Vendor Documentationto top

Intel Xeon Phi Main Page: http://software.intel.com/mic-developer
Intel Xeon Phi Webinar: http://software.intel.com/en-us/articles/intel-xeon-phi-webinar
Programming and Compiling for Intel MIC Architecture: http://software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture