Xeon Phi User Guide

Table of Contents

1. Introductionto top

1.1. Description

Intel Xeon Phi coprocessors are a complementary piece of hardware designed specifically to accelerate the performance and scalability of parallel applications. Each coprocessor, or MIC, is made up of Many Integrated Cores. Each core has a slower clock speed than a typical compute core on an HPC system but can support up to 4 execution threads at once. Each Xeon Phi coprocessor can support up to 240 total concurrent threads of execution. As a whole, these cores' performance total up to just over 1 TFLOP, or 1 trillion floating point operations per second.

From a hardware perspective, the Intel Xeon Phi coprocessors are accessible on a subset of standard compute nodes on the Navy DSRC Cray XC series systems, Conrad and Gordon. On these systems, a single Intel Xeon processor is paired with a single Intel Xeon Phi coprocessor.

1.2. Usage and Availability

The Intel Xeon Phi accelerated nodes are available exclusively via the phi queue. Queue attributes can be displayed with the qstat command.

qstat -Qf phi

The attributes will show there are 168 Intel Xeon Phi nodes available on Conrad and Gordon.

These nodes are intended for users to begin testing the applicability of the Intel Xeon Phi coprocessor technology to their codes.

2. Modes of Operationto top

Navy DSRC systems currently support the following Intel Xeon Phi coprocessor modes of operation:

  • Offload
  • Native

2.1. Offload Mode

In Offload Mode, a user must modify code to specifically direct the compiler to generate code that will run on an Intel Xeon Phi. One potentially quick means for targeting the coprocessors would be to wrap blocks of OpenMP code with Intel Xeon Phi offload directives. The OpenMP code would then execute on the Intel Xeon Phi coprocessor. Since each core on an Intel Xeon Phi can execute 4 threads, users may be able to utilize up to 240 threads per Intel Xeon Phi, and with each Intel Xeon Phi compute node having up to 2 coprocessors available to it, up to 480 threads may be executed in total per Intel Xeon Phi node. Key to this, of course, is the scalability of the code as well as the size of the data structures being pushed to the Intel Xeon Phi node.

It is also possible to utilize high level language extensions such as OpenMP, OpenACC or OpenCL to program for Intel Xeon Phis. The OpenMP 4.0 specification has constructs to support the Intel Xeon Phi. The Cray and Intel compilers support OpenCL directives. The Cray and PGI compilers support the OpenACC directives.

Below is a table outlining Intel Xeon Phi offload directives for both C/C++ and Fortran code.

Offload Clauses
C/C++ Fortran Description
#pragma offload target(mic) !dir$ offload target(mic) Tells the compiler to generate code for the MIC device; Can be used to offload single statements or blocks of code.
in(var[:modifiers]) in(var [:modifiers]) Tells the compiler which data to move to the MIC and attributes about the data.
out(var [:modifiers]) out(var [:modifiers]) Tells the compiler which data to move from the MIC and attributes about the data.
inout(var[:modifiers]) inout(var[:modifiers]) Tells the compiler which data to move into and out of the MIC and attributes about the data.
nocopy(var[:modifiers]) nocopy(var[:modifiers]) Tells the compiler to create persistent data on the MIC.
where test evaluates to 0 or 1
where test evaluates to .true. or .false.
Tests for a condition.
signal(&var) signal(var) Allows for asynchronous execution of offload code; give a signal.
wait(&var) wait(var) Allows for asynchronous execution of offload code; wait for a signal.
Modifiers for in, out, inout, and nocopy
C/C++ Fortran Description
length(num_elements) length(num_elements) The length of a data element.
where test evaluates to 0 or 1
where test evaluates to .true. or .false.
Allocates space for data based on condition.
where test evaluates to 0 or 1
where test evaluates to 0 or 1
Free space used by data based on condition.
align(val) align(val) Aligns data on boundaries based on val.
alloc([first:last]) alloc([first:last]) Allocate space on the Intel Xeon Phi.
into(var) into(var) Copy data into specified location.

2.2. Native Mode

To run code in Native Mode, either MPI-based or threaded OpenMP code, the code must first be built to run directly on the Intel Xeon Phi node. The Intel Compiler Suite is the only compiler currently available that will build native MIC code. In order to do this, users must load or swap to the Intel Compiler module and add the "-mmic" option to the compiler flag list. For information on loading or swapping modules, please see the Modules User Guide.

On Cray XC systems, users should follow the steps below to run natively on an Intel Xeon Phi:

  1. Swap to the Intel Programming Environment
    module swap PrgEnv-cray PrgEnv-intel
  2. Unload the ATP and libsci modules
    module unload cray-libsci atp
  3. Set the target processor to Phi
    module swap craype-[non-Phi processor] craype-intel-knc
  4. Add -mmic and -openmp to the compiler options
  5. Add -k to your aprun command line to run code on the Phi portion of the node

3. Code Optimizationsto top

3.1. General Optimizations

In general, optimizations applied to code designed or built to run on traditional x86-64 processors will also improve performance on code intended to run on Intel Xeon Phi coprocessors. The Intel Xeon Phi core architecture is also x86-64. Users should try to vectorize code as much as possible, as the multiple threads available to each Intel Xeon Phi core will take advantage of that and potentially provide massive parallelism. The "-vec_report" option listed in the table found in Section 4 Compiler Options will generate a listing of the vectorization efforts taken by the compiler. Examination of the results of a compilation using "-vec_report" should aid in finding areas of code on which to focus optimization efforts.

Efforts should be made to align data targeted for the Intel Xeon Phi coprocessors on 64-byte boundaries. This is due to the 512-bit SIMD width of the Intel Xeon Phis. Aligning data on 64-byte boundaries can be accomplished either directly via data structure declarations or with the "-align array64byte" compiler option available in the Intel Fortran compiler. For more information, see Intel's article "Data Alignment to Assist Vectorization".

For more information on optimizing code in general, please see Section 5.6 on Code Profiling and Optimization in the relevant system User Guide available on the documentation page.

Note: It is strongly recommended that users compiling code for execution on Intel Xeon Phi nodes utilize the latest Intel compiler suite available on the system.

3.2. Library Optimizations

Intel's Math Kernel Library (MKL) will take advantage of the Intel Xeon Phi nodes. Intel has provided it the ability to offload segments of computational work by default. Users who build code for Offload Mode while linking with MKL do not have to make any modifications to Makefiles or compilation options to take advantage of this.

Users who develop custom libraries for specific applications may also wish to implement Offload Mode directives. Source code areas for consideration would include OpenMP blocks and array-traversing loops.

4. Compiler Optionsto top

When using the Intel Compiler Suite, there are a number of compiler options that users may invoke related to Intel Xeon Phis. The table below contains a list of the options, the default value, and a brief description of the option.

Note: Users do not have to specify any additional compiler options when building code for Offload Mode. The Intel Compiler Suite will handle any offload directives found within the source code automatically.

Compiler Options
Option Default Value Description
-mmic Not set by default Compile code to run natively on Intel Xeon Phi.
-no-offload Not set by default Disable any offload directives.
-vec_report {0...6} Default value = 0 Display vectorization information about the compile process.
0 displays no information.
6 displays the most.
-openmp-report={0...2} Default value = 0 (Fortran only) Displays information about compilation of OpenMP code regions.
0 displays no information.
2 Displays the most.

Note: It is strongly recommended that users compiling code for execution on Intel Xeon Phi nodes utilize the latest Intel compiler suite available on the system.

5. Running Xeon Phi Jobsto top

5.1. Queue and Queue Resources

The Intel Xeon Phi nodes are exclusively available via the phi queue. PBS jobs that run within the phi queue are charged under the same policies that cover the general compute pool of the systems (i.e. #_of nodes * # of walltime hours * # of standard compute cores per node).

Note: The Cray XC40 systems, Conrad and Gordon, have 12 standard compute cores per Intel Xeon Phi node (i.e. #PBS -l select=1:ncpus=12).

Below is an example job script targeting 1 Intel Xeon Phi node for 30 minutes via the phi queue on the Cray XC40, Conrad:

#PBS -N phi_test_job
#PBS -q phi
#PBS -A Project_ID
#PBS -j oe
#PBS -l walltime=30:00
#PBS -l select=1:ncpus=16:mpiprocs=16

5.2. Environment Variables

Users may modify the environment an application utilizes on the Intel Xeon Phi via a handful of environment variables. Some of the variables may look familiar, as the names are based off of variables often used in hybrid computing.

Environment Variables
Variable Default Value Description
MIC_ENV_PREFIX Not defined by default Sets the prefix for Intel Xeon Phi environment variables.
MIC_OMP_NUM_THREADS Not defined by default Sets the number of threads to utilize per Intel Xeon Phi.
MIC_KMP_AFFINITY Not defined by default Sets the thread layout on the Intel Xeon Phi.
balanced - place all threads on separate cores until all cores have at least one thread.
compact - compresses thread placement.
scatter - similar to balanced but separates consecutively numbered threads if possible.
MIC_LD_LIBRARY_PATH Not defined by default Sets the $LD_LIBRARY_PATH value for the Intel Xeon Phi environment.
1 - Name of function using Automatic Offload; Effective work division; Time spent on host during call; Time spent on Xeon Phi during call.
2 - Reports all of Level 1 plus the amount of data transferred to and from the coprocessor.
MKL_MIC_ENABLE MKL_MIC_ENABLE = 1 Enables automatic offload within MKL routines.

5.3. Launching Executables

Launching an executable intended for either Offload Mode or Native Mode use of the Intel Xeon Phi is not much different than launching a CPU-based executable.

Note: Although Intel Xeon Phis may support up to 240 threads (4 threads per core x 60 cores), it is recommended that users leave at least 1 core available for the Intel Xeon Phi operating system to utilize. Therefore, when setting environment variables such as $MIC_OMP_NUM_THREADS or $MIC_KMP_PLACE_THREADS, it is generally better to use 236 total threads and 59 cores. See the examples below for reference.

5.3.1. Example Launch Script (2-node MPI with offload directives)

Here's an example of launching a 2-node MPI executable that has offload directives, is linked with MKL, and will generate an offload report:

# Set up the environment
export MIC_KMP_AFFINITY="granularity=fine,compact"
export MIC_KMP_PLACE_THREADS="59c,4t"
# The Cray aprun launcher should be used
# Launch Cray XC executable
aprun -cc none ./myexe.off
5.3.2. Example Launch Script (1-node Native Mode threaded OpenMP)

Here's an example of launching a single-node Native Mode threaded OpenMP executable directly on the Intel Xeon Phi:

# Set up the environment
export OMP_NUM_THREADS=236
export KMP_AFFINITY="granularity=fine,compact"
export KMP_PLACE_THREADS="59c,4t"
# The Cray aprun launcher should be used
# Launch Cray XC executable
aprun -cc none -k ./myexe.mic
5.3.3. Example Source Code

Below is example code that has been modified to take advantage of Intel Xeon Phi coprocessors. It contains both OpenMP and Xeon Phi offload directives. This code may be compiled for use as an Offload or Native mode code, using the appropriate compiler options. Given the inclusion of OpenMP directives, users should add the "-openmp" compiler option to utilize OpenMP threading in either scenario. For native mode compilation, users should add the "-mmic" option to a compilation command.

! scale the calculation across threads requested
! need to set environment variables OMP_NUM_THREADS and KMP_AFFINITY
!next directive needed for offload mode only
!dir$ offload target (mic : 0)
!$OMP PARALLEL do PRIVATE(i,j,k,offset)
    do i=1, numthreads
    ! each thread will work its own array section
    ! calc offset into the right section
    offset = i*LOOP_COUNT
        ! loop many times to get lots of calculations
        do j=1, MAXFLOPS_ITERS
            ! scale 1st array and add in the 2nd array
            !$omp simd aligned( fa,fb,a:64)
            do k=1, LOOP_COUNT
                fa(k+offset) = a * fa(k+offset) + fb(k+offset)
            end do
            !$omp end simd
        end do
    end do

6. Links to Vendor Documentationto top

Intel Xeon Phi Main Page: http://software.intel.com/mic-developer
Intel Xeon Phi Webinar: http://software.intel.com/en-us/articles/intel-xeon-phi-webinar
Programming and Compiling for Intel MIC Architecture: http://software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture