Documente Academic
Documente Profesional
Documente Cultură
x
For IA-32 processors, Intel 64 processors, and IA-64 processors.
Application Performance
A Step-by-Step Approach to Application Tuning with Intel Compilers Before you begin performance tuning, you may want to check correctness of your application by building it without optimization using /Od (-O0).
1. 2. 3. 4. 5. 6.
Use the General Optimization Options (Windows* /O1, /O2 or /O3; Linux* and Mac OS* -O1, -O2, or -O3) and determine which one works best for your application by measuring performance with each. Most users should start at /O2 (O2) (default) before trying more advanced optimizations. Next, try /O3 (-O3) for loop-intensive applications, especially on IA-64-based systems. Fine-tune performance to target systems based on IA-32 and Intel 64 with processor-specific options such as /QxT (xT) for Intel Core2 processor family. For a complete list of recommended options for specific processors, see the table Recommended Processor-Specific Optimization Options for IA-32 and Intel 64 Architectures. For Dual-Core Intel Itanium 2 9000 Sequence processors, set /G2-p9000 (mtune=itanium2-p9000). Use the Intel VTune Performance Analyzer to help you identify performance hotspots so that you know which specific parts of your application could benefit from further tuning. The Intel Compilers optimization reports also help by showing where the compiler could benefit from your help. Add in interprocedural optimization (IPO), /Qipo (-ipo) and/or profile-guided optimization (PGO), /Qprof-gen and /Qprof-use (-prof-gen and -prof-use), then measure performance again to determine whether your application benefits from one or both of them. Optimize your application for multi-core, multi-processor, or Hyper-Threading Technology (HT Technology)-capable systems using the parallel performance options (/Qparallel (-parallel), /Qopenmp (-openmp)), or by using Intel Performance Libraries, or the Intel Threading Building Blocks. Use Intel Thread Profiler to help you understand the structure of your threaded applications and maximize their performance. Use Intel Thread Checker to reduce the time to market for threaded applications by diagnosing threading errors and speeding up the development process. Both threading tools work with binary instrumentation. Using the Intel Compiler with source code instrumentation will give you more complete source code information. Please consult the Compiler Documentation and the Optimizing Applications with the Intel C++ & Fortran Compilers white paper for more details.
Intel 64 = Intel Processors with Extended Memory 64 Technology [EM64T] IA-64 = Intel Itanium Processors
Parallel Performance
For systems with Hyper-Threading Technology , multi-core and/or multiple processors, Intel compilers support development of multi-threaded applications through two mechanisms, /Qparallel (-parallel) or /Qopenmp (-openmp). If you are using Intel Thread Profiler and Intel Thread Checker to tune your threaded application, use /Qtcheck (-tcheck) to enable source instrumentation for Intel Thread Checker and Qtprofile (-tprofile) to enable source instrumentation for Intel Thread Profiler.
/O1
-O1
/O3
-O3
/debug:full
-debug full
Parallel Performance
Windows* Linux* Mac OS* /Qopenmp -openmp Enables the parallelizer to generate multi-threaded code based on the OpenMP* directives. Comment
/Qopenmpreport {0|1|2}
-openmpreport {0|1|2}
/Qparallel
-parallel
Detects simply structured loops capable of being executed safely in parallel and automatically generates multi-threaded code for these loops. Controls the auto-parallelizers diagnostic levels as follows: 0 Displays no diagnostic information. 1 Indicates loops successfully parallelized (default). 2 Adds information on loops that were not parallelized.. 3 Adds information about any proven or assumed dependencies inhibiting auto-parallelization (reasons for not parallelizing).
/Qpar-report {0|1|2|3}
-par-report {0|1|2|3}
/Qparthreshold[n]
-parthreshold[n]
Sets a threshold for the auto-parallelization of loops based on the probability of profitable execution of the loop in parallel, n=0 to 100. Default: n=100. 0 Parallelize loops regardless of computation work volume. 100 Parallelize loops only if profitable parallel execution is almost certain. Must be used in conjunction with /Qparallel (-parallel ).
/Qtprofile
-tprofile
Enables source instrumentation to capture information about the structure of threaded applications for use in tuning them to maximize performance. This option creates a binary which will generate results that can be viewed with Intel Thread Profiler. Enables source instrumentation to capture information for diagnosing threading errors in threaded applications. This option creates a binary which will generate diagnostics that can be viewed with Intel Thread Checker. Restricts certain optimizations that may increase memory bandwidth requirements. /Qopt-mem-bandwidth0 (-opt-mem-bandwidth0) - no restriction (default for serial compilation) /Qopt-mem-bandwidth1 (-opt-mem-bandwidth1) restricts optimizations for loops in OpenMP parallel regions (default with /Qparallel (-parallel ) or /Qopenmp (-openmp) ) /Qopt-mem-bandwidth2 (-opt-mem-bandwidth2 ) - restricts optimizations for all loops. May be useful for MPI or other parallel applications. Note: For Mac OS*, this option is not supported.
/Qtcheck
-tcheck
Comment
/G2
-mtune=itanium2
Targets optimization for the Intel Itanium 2 processor. Generated code is also compatible with the older IA-64 processor (default). Targets optimizations for Dual-Core Intel Itanium 2 9000 Sequence processors. Generated code is also compatible with all IA-64 processors, unless the user program calls intrinsic functions specific to the Dual-Core Intel Itanium 2 9000 Sequence processors. Enables [disables] the combining of floating-point multiply operations and add/subtract operations. (Enabled by default) Indicates that there is no forward or backward loop-carried memory dependency in the loop where the IVDEP directive is specified. Typically used in conjunction with /Qparallel (-parallel). Enables or disables prefetch insertion.
/G2-p9000
-mtune=itanium2p9000
/QIPF-fma[-]
-IPF-fma[-]
/Qivdep-parallel
-ivdep-parallel
/Qprefetch[-]
-prefetch[-]
/Qip
Single file optimization. Interprocedural optimizations, including selective inlining, within the current source file. Caution: For large files, this option may sometimes significantly increase compile time and code size.
/Qipo[value]
-ipo[value]
Permits inlining and other interprocedural optimizations among multiple source files. The optional value argument controls the maximum number of link-time compilations (or number of object files) spawned. Default for value is 0 (the compiler chooses). Caution: This option can in some cases significantly increase compile time and code size.
/Qipo-jobs[n]
-ipo-jobs[n]
Specifies the number of commands (jobs) to be executed simultaneously during the link phase of Interprocedural Optimization (IPO). The default is 1 job. This option enables function inlining within the current source file at the compilers discretion. This option is enabled by default at /O2 and /O3 (-O2 and O3). Caution: For large files, this option may sometimes significantly increase compile time and code size. It can be disabled by /Ob0 (-fno-inlinefunctions on Linux* and Mac OS*).
/Ob2
-finlinefunctions -finlinelevel=2
This option scales the total and maximum sizes of functions that can be inlined. The default value of n is 100, i.e., 100% or a scale factor of one. Instruments a program for profiling. Enables the use of profiling information during optimization. Specifies a directory for the profiling output files, *.dyn and *.dpi.
/fp:name
This method of controlling the consistency of floating point results by restricting certain optimizations is recommended in preference to the /Op (-mp) and /Qprec (-mp1) switches. The possible values of name are: precise Enables only value-safe optimizations on floating point code. double/extended/source Implies precise and causes intermediates to be computed in double, extended or source precision. The double and extended options are not available for Intel Fortran. fast=[1|2] Allows more aggressive optimizations at a slight cost in accuracy or consistency. (fast=1 is the default) except Enables floating point exception semantics. strict Strictest mode of operation, enables both the precise and except options and disables fma contractions. Recommendation: /fp:source (-fp-model source) is the recommended form for the majority of situations on IA-64 processors, on processors supporting Intel 64, and on IA-32 when SSE are enabled with /QxW (-xW) or higher when enhanced floating point consistency and reproducibility are needed.
/Qfpspeculation mode
-fpspeculation mode
Enables floating-point speculations with one of the following modes: fast Speculate floating-point operations. (default) off Disables speculation of floating-point operations. safe Do not speculate if this could expose a floating-point exception. strict This is the same as specifying off.
/Qftz[-]
-ftz[-]
When the main program or dll main is compiled with this option, denormal results are flushed to zero for the whole program (dll). Setting this option does not guarantee that all denormals in a program are flushed to zero. It only causes denormals generated at run time to be flushed to zero. On IA-64-based systems, the default is off except at /O3 (-O3). On IA-32- based systems and Intel 64-based systems, the default is on except at /Od (-O0), but only denormals resulting from SSE instructions are flushed to zero.
-fnoexceptions
A future Intel processor that supports SSE4 Vectorizing Compiler and Media Accelerators Intel Core2 Extreme processor Intel Core2 Duo processor Dual-Core Intel Xeon 5300, 5100,and 3000 series processors Quad-Core Intel Xeon processors Intel Core Duo, Intel Core Solo processor Intel Pentium 4 processor with Streaming SIMD Extension 3 (SSE3) instruction support Intel Pentium D processor Intel Xeon processor (only on processors that support SSE3) Intel Pentium dual-core processor T2060 Intel Pentium Extreme Edition processor Dual-Core IntelXeon7000, 5000, and 3200 Sequence processors Dual-Core Intel Xeon ULV and LV processor Dual-Core Intel Xeon 2.8 processor Intel processor-based systems supporting SSE2 and SSE* Non-Intel processor-based systems supporting SSE3, SSE2, and SSE* such as AMD processors Intel Pentium 4 processor Intel Pentium M processor Intel Xeon processors without SSE3 support (IA-32 only) Intel processor-based systems supporting SSE* Non-Intel processor-based systems supporting SSE2 and SSE* such as AMD processors Intel Pentium III processors Intel Pentium III Xeon processors Non-Intel x86 processor-based systems supporting SSE* such as AMD processors Intel Itanium 2 processor Dual-core Intel Itanium 2 9000 Sequence processors
/QxP /QaxP
-xP
-xP -axP
/QxO
-xO
/QxN /QaxN
-xN -axN
/QxW /QaxW
-xW -axW
/QxK /QaxK
-xK -axK
/G2 /G2-p9000
* The option values O, W, and K produce binaries that should run on processors not made by Intel such as AMD processors that implement the same capabilities as the corresponding Intel processors. P and N option values perform additional optimizations that are not enabled with option values O and W.
For product and purchase information, visit the Intel Software Development Products site at: www.intel.com/software/products/compilers
Intel, the Intel logo, Itanium, Pentium, Intel Centrino, Intel Xeon, Intel XScale, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. * Other names and brands may be claimed as the property of others.
0306/DAM/OMD/PP/3000
254349-006