Quantcast
Channel: 英特尔开发人员专区文章
Viewing all 49 articles
Browse latest View live

英特尔® 集成性能基元(Intel® IPP)函数面向英特尔® 高级矢量扩展指令集(Intel® AVX)而优化

$
0
0
  • 下表列出了英特尔® IPP 7.0.2 函数库版本提供的英特尔® AVX 支持。
  • 7.0 函数库 32位和 64 位版本提供经优化的英特尔® AVX 代码。
  • 6.1 函数库对英特尔® AVX 的支持非常有限;如果您打算在英特尔® AVX 平台上使用英特尔® IPP,您应升级至采用英特尔® IPP 函数库 7.0 版本。

英特尔® AVX(英特尔® 高级矢量扩展指令集)是一套针对 SSE 的 256 位扩展指令集,可为浮点密集型应用提供更高的性能。英特尔® AVX 向现有英特尔® SIMD 指令集(基于 SSE)添加新函数,并采用更紧凑的 SIMD 编码格式。AVX 中的大量(200 多个)英特尔 SSEx 指令经过升级,可充分利用多种特性,如各自不同的目的操作数和灵活的内存对齐等。约 100 个原来的 128 位英特尔 SSEx 指令可用于处理 256 位矢量数据。此外,该版本还添加了约 100 个全新的数据处理和算法运算。这些运算不包含于传统英特尔 SSEx SIMD 指令集内。

英特尔® AVX 的主要优势在于:

  • 支持更广泛的矢量数据(多至 256 位)。
  • 高效的指令编码模式,支持3运算指令或 4 运算指令语法。
  • 灵活的编程环境--从分支处理到宽松的内存对齐要求。
  • 全新的数据处理和算术计算基元,包括播送、乱序、融合乘加等



ippGetCpuFeatures()报告与 SIMD 特性有关可应用于您的处理器的信息。此外,ippGetCpuType()检测您系统中的处理器类型。ippCpuAVX返回值表示您的处理器支持英特尔® AVX 指令集。这些函数将在 ippcore.h中声明。

隐藏 ippGetCpuFeatures()(采用ippCPUID_AVX (0x0100))返回的值,以确定英特尔® AVX SIMD 指令集是否获得您的处理器的支持(ippGetCpuFeatures() 和 ippCPUID_AVX 为 TRUE)。为了确定您的操作系统是否还支持英特尔® AVX 指令(保存扩展的 SIMD 寄存器)、隐藏 ippGetCpuFeatures()(采用ippAVX_ENABLEDBYOS (0x0200))返回的值。在您的应用使用英特尔® AVX SIMD 指令前,必须满足两种条件( CPU 和操作系统支持)。



英特尔® 函数库已面向各种 SIMD 指令集而优化。自动“分派”检测运行处理器上可用的 SIMD 指令集,并为该处理器选择最佳的 SIMD 指令。如欲了解更多有关分派的更多信息,请参阅了解英特尔® IPP 函数库中的 CPU 分派

英特尔® IPP 函数库中的英特尔® AVX 优化包含“人工优化”和“编译器调试”函数 - 直接针对英特尔® AVX 指令集而优化的代码。由于英特尔® IPP 函数库中包含大量基元,因此不可能针对一次产品发行或更新期间内的英特尔® AVX 指令集所代表的大量新指令,来直接优化每个英特尔® IPP 函数(处理器特定优化可能还需要考虑高速缓存大小和内核/线程的数量)。因此,下表中的函数表示全新英特尔® AVX 指令所带来的最大优势,或是英特尔® IPP 客户最广泛使用的函数。

如果您有一些特定的英特尔® IPP 函数但下表没有列明,如果您希望将它们添加至未来 AVX 优化的优先级列表中,请在 IPP 论坛上创建一条线程,说明您希望将哪些函数添加至 AVX 优化优先级列表中。

随着面向英特尔® AVX 直接优化的函数在函数库每个新版本或更新版本中的应用,这些函数将添加至下表中。

下表中使用的规则支持多个类似的函数在一行上指示:

  • {x} – 括号包含所需(函数名称)元素。
  • [x] – 方括号内包含可选(函数名称)元素。
  • | - 垂直线表示可选或所需元素集中独一无二的选择。
  • {x|y|z} – 函数名称中所需元素内三个互斥选择示例。
  • [x|y|z] – 函数名称中可选元素内三个互斥选择示例。

信号处理

ippsAbs_{16s|32s|32f|64f}[_I] 
ippsAdd_{32f|32fc|64f|64fc}[_I] 
ippsAddC_{32f|64f}[_I] 
ippsAddProductC_32f 
ippsAddProduct_{32fc|64f|64fc} 
ippsAutoCorr_{32f|64f}
ippsConv_32f 
ippsConvert_{8s|8u|16s|16u|32s|64f}32f 
ippsConvert_{32s|32f}64f 
ippsConvert_32f{8s|8u|16s|16u}_Sfs 
ippsConvert_64f32s_Sfs 
ippsCopy_{16s|32s|32f|64f} 
ippsCrossCorr_{32f|64f} 
ippsDFTFwd_CToC_{32f|32fc|64f|64fc} 
ippsDFTFwd_RTo{CCS|Pack|Perm}_{32f|64f} 
ippsDFTInv_CCSToR_{32f|64f} 
ippsDFTInv_CToC_{32f|32fc|64f|64fc} 
ippsDFTInv_{Pack|Perm}ToR_{32f|64f} 
ippsDFTOutOrd{Fwd|Inv}_CToC_{32fc|64fc} 
ippsDiv[C]_32f[_I] 
ippsDotProd_32f64f 
ippsFFTFwd_CToC_{32f|32fc|64f|64fc}[_I] 
ippsFFTFwd_RTo{CCS|Pack|Perm}_{32f|64f}[_I] 
ippsFFTInv_CCSToR_{32f|64f}[_I] 
ippsFFTInv_CToC_{32f|32fc|64f|64fc}[_I] 
ippsFFTInv_{Pack|Perm}ToR_{32f|64f}[_I] 
ippsFIR64f_32f[_I] 
ippsFIR64fc_32fc[_I] 
ippsFIRLMS_32f 
ippsFIR_{32f|32fc|64f|64fc}[_I] 
ippsIIR32fc_16sc_[I]Sfs 
ippsIIR64fc_32fc[_I] 
ippsIIR_32f[_I] 
ippsLShiftC_16s_I 
ippsMagnitude_16sc_Sfs 
ipps{Min|Max}Indx_{32f|64f} 
ippsMul_32fc[_I] 
ippsMul[C]_{32f|32fc|64f|64fc}[_I] 
ippsMulC_64f64s_ISfs 
ipps{Not|Or}_8u 
ippsPhase_{16s|16sc|32sc}_Sfs 
ippsPowerSpectr_{32f|32fc} 
ippsRShiftC_16u_I 
ippsSet_{8u|16s|32s} 
ippsSqr_{8u|16s|16u|16sc}_[I]Sfs 
ippsSqr_{32f|32fc|64f|64fc}[_I] 
ippsSqrt_32f[_I] 
ippsSub_{32f|32fc|64f|64fc}[_I] 
ippsSubC_{32f|32fc|64f|64fc}[_I] 
ippsSubCRev_{32f|32fc|64f|64fc}[_I] 
ippsSum_{32f|64f} 
ippsThreshold_{32f|GT_32f|LT_32f}_[_I] 
ippsThreshold_{GT|LT}Abs_{32f|64f}[_I] 
ippsThreshold_GTVal_32f[_I] 
ippsWinBartlett_{32f|32fc|64f|64fc}[_I] 
ippsWinBlackman_{32f|64f|64fc}[_I] 
ippsWinBlackmanOpt_{32f|64f|64fc}[_I] 
ippsWinBlackmanStd_{32f|64f|64fc}[_I] 
ippsWinKaiser_{32f|64f|64fc}[_I] 
ippsZero_{8u|16s|32f}

 

SPIRAL (GEN) 函数

ippgDFTFwd_CToC_8_64fc ippgDFTFwd_CToC_12_64fc 
ippgDFTFwd_CToC_16_{32fc|64fc}
ippgDFTFwd_CToC_20_64fc
ippgDFTFwd_CToC_24_64fc
ippgDFTFwd_CToC_28_64fc 
ippgDFTFwd_CToC_32_{32fc|64fc}
ippgDFTFwd_CToC_36_64fc
ippgDFTFwd_CToC_40_64fc
ippgDFTFwd_CToC_44_64fc 
ippgDFTFwd_CToC_48_{32fc|64fc}
ippgDFTFwd_CToC_52_64fc 
ippgDFTFwd_CToC_56_64fc 
ippgDFTFwd_CToC_60_64fc 
ippgDFTFwd_CToC_64_{32fc|64fc} 
ippgDFTInv_CToC_8_64fc 
ippgDFTInv_CToC_12_64fc 
ippgDFTInv_CToC_16_{32fc|64fc} 
ippgDFTInv_CToC_20_64fc 
ippgDFTInv_CToC_24_64fc 
ippgDFTInv_CToC_28_64fc 
ippgDFTInv_CToC_32_{32fc|64fc} 
ippgDFTInv_CToC_36_64fc 
ippgDFTInv_CToC_40_64fc 
ippgDFTInv_CToC_44_64fc 
ippgDFTInv_CToC_48_{32fc|64fc} 
ippgDFTInv_CToC_52_64fc 
ippgDFTInv_CToC_56_64fc 
ippgDFTInv_CToC_60_64fc 
ippgDFTInv_CToC_64_{32fc|64fc}

 

音频编码

iippsDeinterleave_32f

 

语音编码

ippsAdaptiveCodebookSearch_RTA_32f
ippsFixedCodebookSearch_RTA_32f
ippsFixedCodebookSearchRandom_RTA_32f
ippsHighPassFilter_RTA_32f
ippsLSPQuant_RTA_32f
ippsLSPToLPC_RTA_32f
ippsPostFilter_RTA_32f_I
ippsQMFDecode_RTA_32f
ippsSynthesisFilter_G729_32f

 

色彩转换

ippiRGBToHLS_8u_AC4R
ippiRGBToHLS_8u_C3R

 

逼真渲染

ipprCastEye_32f
ipprCastShadowSO_32f
ipprDot_32f_P3C1M
ipprHitPoint3DEpsM0_32f_M
ipprHitPoint3DEpsS0_32f_M
ipprMul_32f_C1P3IM

 

计算机视觉

ippiEigenValsVecs_[8u]32f_C1R 
ippiFilterGaussBorder_32f_C1R 
ippiMinEigenVal_[8u]32f_C1R 
ippiNorm_Inf_{8u|8s|16u|32f}_C{1|3C}MR 
ippiNorm_L1_{8u|8s|16u|32f}_C{1|3C}MR 
ippiNorm_L2_{8u|8s|16u|32f}_C{1|3C}MR 
ippiNormRel_L2_32f_C3CMR 
ippiUpdateMotionHistory_[8u|16u]32f_C1IR

 

图像处理

ippiAddC_32f_C1[I]R 
ippiConvert_32f* 
ippiCopy_16s* 
ippiCopy_8u* 
ippiConvFull_32f_{AC4|C1|C3}R 
ippiConvValid_32f_{AC4|C1|C3}R 
ippiCrossCorrFull_NormLevel_16u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_NormLevel_32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_NormLevel_64f_C1R 
ippiCrossCorrFull_NormLevel_8s32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_NormLevel_8u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_NormLevel_8u_{AC4|C1|C3|C4}RSfs 
ippiCrossCorrFull_Norm_16u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_Norm_32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_Norm_8s32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_Norm_8u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_Norm_8u_{AC4|C1|C3|C4}RSfs 
ippiCrossCorrSame_NormLevel_16u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_NormLevel_32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_NormLevel_8s32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_NormLevel_8u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_NormLevel_8u_{AC4|C1|C3|C4}RSfs 
ippiCrossCorrSame_Norm_16u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_Norm_32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_Norm_8s32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_Norm_8u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_Norm_8u_{AC4|C1|C3|C4}RSfs 
ippiCrossCorrValid_{8u32f|8s32f|16u32f|32f}_C1R 
ippiCrossCorrValid_NormLevel_16u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_NormLevel_32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_NormLevel_64f_C1R 
ippiCrossCorrValid_NormLevel_8s32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_NormLevel_8u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_NormLevel_8u_{AC4|C1|C3|C4}RSfs 
ippiCrossCorrValid_Norm_16u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_Norm_32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_Norm_8s32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_Norm_8u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_Norm_8u_{AC4|C1|C3|C4}RSfs 
ippiDCT8x8FwdLS_8u16s_C1R 
ippiDCT8x8Fwd_16s_C1[I|R] 
ippiDCT8x8Fwd_32f_C1[I] 
ippiDCT8x8Fwd_8u16s_C1R 
ippiDCT8x8InvLSClip_16s8u_C1R 
ippiDCT8x8Inv_16s8u_C1R 
ippiDCT8x8Inv_16s_C1[I|R] 
ippiDCT8x8Inv_2x2_16s_C1[I] 
ippiDCT8x8Inv_32f_C1[I] 
ippiDCT8x8Inv_4x4_16s_C1[I] 
ippiDCT8x8Inv_A10_16s_C1[I] 
ippiDCT8x8To2x2Inv_16s_C1[I] 
ippiDCT8x8To4x4Inv_16s_C1[I] 
ippiDFTFwd_CToC_32fc_C1[I]R 
ippiDFTFwd_RToPack_32f_{AC4|C1|C3|C4}[I]R 
ippiDFTFwd_RToPack_8u32s_{AC4|C1|C3|C4}RSfs 
ippiDFTInv_CToC_32fc_C1[I]R 
ippiDFTInv_PackToR_32f_{AC4|C1|C3|C4}[I]R 
ippiDFTInv_PackToR_32s8u_{AC4|C1|C3|C4}RSfs 
ippiDilate3x3_32f_C1[I]R 
ippiDilate3x3_64f_C1R 
ippiDivC_32f_C1[I]R 
ippiDiv_32f_{C1|C3}[I]R 
ippiDotProd_32f64f_{C1|C3}R 
ippiErode3x3_64f_C1R 
ippiFFTFwd_CToC_32fc_C1[I]R 
ippiFFTFwd_RToPack_32f_{AC4|C1|C3|C4}[I]R 
ippiFFTFwd_RToPack_8u32s_{AC4|C1|C3|C4}RSfs 
ippiFFTInv_CToC_32fc_C1[I]R 
ippiFFTInv_PackToR_32f_{AC4|C1|C3|C4}[I]R 
ippiFFTInv_PackToR_32s8u_{AC4|C1|C3|C4}RSfs 
ippiFilter_32f_{C1|C3|C4}R 
ippiFilter_32f_AC4R 
ippiFilter_64f_{C1|C3}R 
ippiFilter32f_{8s|8u|16s|16u|32s}_C{1|3|4}R 
ippiFilter32f_{8u|16s|16u}_AC4R 
ippiFilter32f_{8s|8u}16s_C{1|3|4}R 
ippiFilterBox_8u_{C1|C3}R 
ippiFilterBox_32f_{C1|C4|AC4}R 
ippiFilterColumn32f_{8u|16s|16u}_{C1|C3|C4|AC4}R 
ippiFilterColumn_32f_{C1|C3|C4|AC4}R 
ippiFilterGauss_32f_{C1|C3}R 
ippiFilterHipass_32f_{C1|C3|C4|AC4}R 
ippiFilterLaplace_32f_{C1|C3|C4|AC4}R 
ippiFilterLowpass_32f_{C1|C3|AC4}R 
ippiFilterMax_32f_{C1|C3|C4|AC4}R 
ippiFilterMedian_32f_C1R 
ippiFilterMin_32f_{C1|C3|C4|AC4}R 
ippiFilterRow_32f_{C1|C3|C4|AC4}R 
ippiFilterRow32f_{8u|16s|16u}_{C1|C3|C4|AC4}R 
ippiFilterSobelHoriz_32f_{C1|C3}R 
ippiFilterSobelVert_32f_{C1|C3}R 
ippiMean_32f_{C1|C3}R 
ippiMulC_32f_C1[I]R 
ippiMul_32f_{C1|C3|C4}[I]R 
ippiResizeSqrPixel_{32f|64f}_{C1|C3|C4|AC4}R 
ippiResizeSqrPixel_{32f|64f}_{P3|P4}R 
ippiSqrDistanceFull_Norm_16u32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceFull_Norm_32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceFull_Norm_8s32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceFull_Norm_8u32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceFull_Norm_8u_{AC4|C1|C3|C4}RSfs 
ippiSqrDistanceSame_Norm_16u32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceSame_Norm_32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceSame_Norm_8s32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceSame_Norm_8u32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceSame_Norm_8u_{AC4|C1|C3|C4}RSfs 
ippiSqrDistanceValid_Norm_16u32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceValid_Norm_32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceValid_Norm_8s32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceValid_Norm_8u32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceValid_Norm_8u_{AC4|C1|C3|C4}RSfs 
ippiSqrt_32f_C1R 
ippiSqrt_32f_C3IR 
ippiSubC_32f_C1[I]R 
ippiSub_32f_{C1|C3|C4}[I]R 
ippiSum_32f_C{1|3}R 
ippiTranspose_32f_C1R

 

图像压缩

ippiPCTFwd_JPEGXR_32f_C1IR 
ippiPCTFwd16x16_JPEGXR_32f_C1IR 
ippiPCTFwd8x16_JPEGXR_32f_C1IR 
ippiPCTFwd8x8_JPEGXR_32f_C1IR 
ippiPCTInv_JPEGXR_32f_C1IR_128 
ippiPCTInv16x16_JPEGXR_32f_C1IR 
ippiPCTInv8x16_JPEGXR_32f_C1IR 
ippiPCTInv8x8_JPEGXR_32f_C1IR

 

未直接面向 AVX 而优化的那些函数(即,未出现在表中的函数)已利用英特尔编译器“xG”切换(支持 AVX 优化)得到编译。通过遵循 AVX ABI(应用二级制接口)特性还可进一步提升性能。该特性在具有 AVX 代码的任何函数后插入特定的 AVX“vzeroupper”,以避免任意 AVX 的转换处罚。

对于那些未直接面向 AVX 而优化的函数,g9/e9 函数库利用来自之前兼容的 SSE 优化的函数优化,如,面向 p8/y8 函数库调试的优化和之前的 SIMD 优化(如,SSE4.x、AES-NI 和 SSE2/3)。上表中未列出的函数将包括最高等级直接优化的代码(基于 AES-NI、SSE4.x、SSSE3、SSE3 和 SSE2 SIMD 指令集,如适用)。

如欲了解有关 g9/e9 优化层和英特尔® IPP 函数库中英特尔 AVX 的更多信息,请参阅Intel® 64 架构用户指南中面向 Windows* 操作系统的英特尔® 集成性能基元

如欲了解更多信息,请参阅“如何面向英特尔® AVX 进行编译”,并访问Intel Parallel Studio 网站,了解有关开发、调试和优化多线程应用可用工具的更多信息。

优化通知

英特尔® 编译器\相关库和相关开发工具可能包含或使用针对英特尔® 和非英特尔微处理器中都存在的指令集(例如 SIMD 指令集)实现优化的选项,不过面向非英特尔微处理器的优化程度可能稍差一些。此外,部分面向英特尔编译器的编译器选项(包括并非英特尔微架构专用的选项)也为英特尔微处理器保留了下来。如欲获得有关英特尔编译器选项包括指令集和相关特定微处理器的详细说明,请参阅“编译器选项”下的“英特尔® 编译器用户和参考指南”。英特尔® 编译器产品中的大多数库例程面向英特尔微处理器的优化程度要高于其它微处理器。英特尔® 编译器产品中的编译器和库可为英特尔和英特尔兼容微处理器提供优化特性,具体取决于您所选择的选项以及代码和其它因素。您将能够在英特尔微处理器上获得额外性能。

英特尔® 编译器、相关库和相关开发工具针对非英特尔微处理器的优化程度可能与英特尔微处理器相同或不同。这些优化特性包括英特尔® SIMD 流指令扩展 2(英特尔® SSE2)、英特尔® SIMD 流指令扩展 3(英特尔® SSE3)和追加 SIMD 流指令扩展 3(英特尔® SSSE3)指令集以及其它优化性能。对于在非英特尔制造的微处理器上进行的优化,英特尔不对相应的可用性、功能或有效性提供担保。该产品中依赖于处理器的优化仅适用于英特尔微处理器。

虽然英特尔坚信我们的编译器和库是帮助您在英特尔® 和非英特尔微处理器上获得最佳性能的理想选择,不过仍建议您对其它编译器和库进行评估,选择最符合您要求的解决方案。我们希望通过致力于最大限度提升所有编译器或库的性能来赢得您的青睐,如果您发现我们存在不足之处,欢迎您指正。

通知版本 #20101101


英特尔® MKL 中的英特尔® AVX 优化代码

$
0
0

英特尔® AVX(英特尔® 高级矢量扩展指令集)是英特尔处理器发展的下一步。自英特尔® MKL 10.2 版本开始,英特尔® MKL 都具有英特尔® AVX 优化代码,但是,如果要激活 10.2 版本中的英特尔® AVX 代码,用户需要使用 mkl_enable_instructions()。从英特尔® MKL 10.3 版本开始,英特尔® AVX 代码将自动得到分派,并且不需要专门激活。在英特尔® MKL 10.3 版本中,英特尔® AVX 优化已扩展至 DGEMM/SGEMM、radix-2 Complex-to-Complex FFT、大部分实际的 VML 函数和 VSL 分布生成器。

在英特尔® MKL 10.3 版本中,在 Nehalem(NHM)架构上,支持英特尔® AVX 的处理器运行支持英特尔® AVX 的操作系统会实现加速,如下所示:

英特尔® AVX DGEMM (M, N, K=8Kx4Kx128) 的运行速度比在 NHM 上运行快 1.8 倍。英特尔® AVX DGEMM/SGEMM 达到 88-90% 的设备峰值。

英特尔® AVX/NHM 运行 radix-2 1D 集群 FFT(N=1024)的速度加快 1.8 倍

利用英特尔® AVX 优化代码的 Intel® Optimized LINPACK 性能指标评测在 4 枚内核(N=20000)上的运行速度加快 1.86 倍(或总效率超过 80%)。


在不编写 AVX 的情况下使用英特尔® AVX

$
0
0

下载文章

下载在不编写 AVX 的情况下使用英特尔® AVX [PDF 326KB]

简介和工具

英特尔® 高级矢量扩展指令集(AVX)是一套针对英特尔® SIMD 流指令扩展(英特尔® SSE)的 256 位扩展指令集,专为浮点密集型应用而设计。英特尔® SSE 和英特尔® AVX 均为单指令多数据指令集的示例。英特尔® AVX 作为第二代英特尔® 酷睿™ 处理器家族的一部分发布。英特尔® 采用更宽的 256 位矢量 - 一种全新的扩展指令格式(矢量扩展指令集或 VEX)并具备丰富的功能,使系统性能得到显著提升。该指令集架构支持三种操作数,可提升指令编程灵活性,并支持非破坏性的源操作数。传统的 128 位 SIMD 指令也经过了扩展,支持三种操作数和新的指令加密格式 (VEX)。指令加密格式介绍了使用操作码和前缀,以处理器能够理解的格式来表达更高级别的指令的方式。这有助于实现对数据和一般应用的更好的管理,例如图像、音视频处理、科研模拟、金融分析和三维建模与分析。

本文讨论了开发人员可通过哪些方式将英特尔® AVX 集成到应用中,且无需在低级别汇编语言中进行明确地编码。对于 C/C++ 开发人员来说,访问英特尔® AVX 的最直接方式是使用兼容 C 的内部指令。这些内部函数提供了到英特尔® AVX 指令集的访问,以及英特尔® 短矢量数学库 (SVML) 中的更高级别的数学函数。这些函数分别在 immintrin.h 和 ia32intrin.h 头文件中进行声明。应用编程人员还可通过其它方法来使用英特尔® AVX,而且无需将英特尔® AVX 指令添加至其源代码。本文针对这些方法进行了调查(使用英特尔® C++ Composer XE 2011,定位于在 Sandy Bridge 系统上执行)。Linux*、Windows* 和 Mac OS* X 平台均支持英特尔® C++ Composer XE。本文将使用面向 Windows* 平台的命令行开关。本文结尾处的表_ 1 - 英特尔® 编译器命令行开关总结提供了针对每个平台的命令行开关列表。


单指令多数据 (SIMD) 概念回顾

支持矢量或 SIMD 的处理器能够在一次指令中,同时在多个数据操作数上执行一个操作。在一个数字上由另外一个数字执行的操作以生成单个结果的流程被称作标量流程。在 N 个数字上同时执行的操作以生成 N 个结果的流程被称作矢量流程 (N > 1)。英特尔处理器或支持 SIMD 或 AVX 指令的兼容的非英特尔处理器均支持该技术。将算法从标量转化为矢量的流程被称作矢量化。


一般性的 multiplyValues 示例可帮助解释标量和矢量流程(使用英特尔® AVX)之间的区别。

void multiplyValues(float *a, float *b, float * c, int size)
{
	for (i = 0; i < size; i++) {
		a[i] = b[i] * c[i];
	}
}




面向英特尔® AVX 的重新编译

第一个方法是使用 /QaxAVX 编译器开关进行重新编译。无需对源代码进行修改。英特尔® 编译器将生成相应的 128 和 256 位英特尔® AVX VEX 加密指令。当有助于提高性能时,英特尔® 编译器将针对英特尔处理器生成多个特定处理器,且具备自动分布功能的代码路径。最合适的代码将在运行时执行。


编译器自动矢量化

借助合适的架构开关来编译应用,是构建英特尔® AVX 就绪型应用的第一步。借助自动矢量化功能,编译器可代表软件开发人员执行大部分矢量化工作。自动矢量化是满足特定条件时编译器执行的优化。英特尔® C++ 编译器可在生成代码期间自动执行相应的矢量化操作。英特尔® C++ 编译器矢量化指南详细介绍了矢量化。当优化级别为 /O2 或更高时,英特尔编译器将寻找矢量化机遇。

让我们来考虑一个简单的矩阵矢量乘法示例,该示例随英特尔® C++ ComposeXE 提供,详细阐释了矢量化的概念。下列代码片段来自 vec_samples 归档的 Multiply.c 中的 matvec 函数:

void matvec(int size1, int size2, FTYPE a[][size2], FTYPE b[], FTYPE x[])
{
	for (i = 0; i < size1; i++) {
		b[i] = 0;

		for (j = 0;j < size2; j++) {
			b[i] += a[i][j] * x[j];
		}
	}
}

如果没有矢量化,外层循环将执行 size1 时间,内层循环将执行 size1*size2 时间。借助 /QaxAVX 开关实现矢量化以后,内层循环可以展开(unrolled),这是因为可在每次操作的单个指令中执行四次乘法和四次加法。矢量化循环的效率比标量循环高得多。英特尔® AVX 的优势还适用于单精度浮点数字,因为 8 个单精度浮点操作数可以存于 ymm 寄存器中。

循环必须满足特定的标准才能实现矢量化。在运行时进入循环时,必须要知道循环运行次数。运行次数可以是变量,但在执行循环时必须是常量。循环必须具备单进和单出能力,而且退出不能依赖于输入数据。此外还存在一些分支标准,例如不允许开关语句(switch statement)。如果 If 语句可作为隐蔽任务实施,则可允许这种类型的语句。最内层的循环最有可能是矢量化的对象,而且在循环内部使用函数调用可能会影响矢量化。内联函数和固有的 SVML 函数可增加矢量化机遇。

在应用开发的实施和调试阶段,建议对矢量化信息进行检查。英特尔® 编译器提供了矢量化报告,可帮助你了解被矢量化以及未被矢量化的元素。该报告可通过 /Qvec-report=<n> 命令行选项提供,其中 n 指定了报告的详细级别。详细级别随 n 数值的增加而增加。如果 n=3,则可以提供相关性信息、被矢量化的循环和未被矢量化的循环。开发人员可根据报告中的信息来修改实施,循环未被矢量化的原因提供了非常有帮助的信息。

开发人员在其具体应用方面具有深入的专业知识,因此有时可以忽略自动矢量化行为。编译指示提供了额外的信息,以便为自动矢量化流程提供帮助。部分示例包括:一直对循环进行矢量化操作、确定循环内的数据保持一致、忽略潜在的数据相关性等。addFloats 示例对部分重要点进行了说明。你需要检查生成的汇编语言指令,以了解所生成的编译器。当指定 /S 命令行选项时,英特尔编译器将在当前的工作目录中生成汇编文件。

void addFloats(float *a, float *b, float *c, float *d, float *e, int n)	{
	int i;
#pragma simd
#pragma vector aligned
	for(i = 0; i < n; ++i)	{
		a[i] = b[i] + c[i] + d[i] + e[i];
	}
}


请注意 simd 和矢量编译指示的使用。它们在实现所期望的英特尔® AVX 256 位矢量化方面起着重要作用。当使用 /QaxAVX 选项来编译 addFloats 时(没有包含 simd 和矢量编译指示的行),将生成下列代码。

没有 simd 和矢量编译指示

.B46.3::
        vmovss    xmm0, DWORD PTR [rdx+r10*4]
        vaddss    xmm1, xmm0, DWORD PTR [r8+r10*4]  
        vaddss    xmm2, xmm1, DWORD PTR [r9+r10*4]  
        vaddss    xmm3, xmm2, DWORD PTR [rax+r10*4]
        vmovss    DWORD PTR [rcx+r10*4], xmm3
        inc       r10
        cmp       r10, r11
        jl        .B46.3


汇编代码显示的是使用英特尔® 128 位 AVX 指令的标量版本。目标是提供英特尔® 256 位 AVX 指令的打包(“矢量”的另一种说法)版本。Vaddss 中的 ss 表示只将一组单精度浮点操作数添加在一起 - 标量操作。如果使用 vaddps,那么该算法将更加高效;ps 表示在单精度浮点操作数上执行打包操作。

向代码只添加 "#pragma simd"有助于生成英特尔® 128 位 AVX 指令的打包版本。此外,编译器还将展开循环,从而减少与循环测试结束相关的执行指令数量。由于每个指令只运行四个操作数,因此仍有进一步的优化空间。

With #pragma simd

.B46.11::
        vmovups   xmm0, XMMWORD PTR [rdx+r10*4]
        vaddps    xmm1, xmm0, XMMWORD PTR [r8+r10*4]
        vaddps    xmm2, xmm1, XMMWORD PTR [r9+r10*4]
        vaddps    xmm3, xmm2, XMMWORD PTR [rax+r10*4]
        vmovups   XMMWORD PTR [rcx+r10*4], xmm3
        vmovups   xmm4, XMMWORD PTR [16+rdx+r10*4]
        vaddps    xmm5, xmm4, XMMWORD PTR [16+r8+r10*4]
        vaddps    xmm0, xmm5, XMMWORD PTR [16+r9+r10*4]
        vaddps    xmm1, xmm0, XMMWORD PTR [16+rax+r10*4]
        vmovups   XMMWORD PTR [16+rcx+r10*4], xmm1
        add       r10, 8
        cmp       r10, rbp
        jb        .B46.11

指定 "pragma vector aligned"有助于编译器针对所有阵列参考使用一致的数据移动指令。使用 "pragma simd"和 "pragma vector aligned."可生成期望的 256 位英特尔® AVX 指令。英特尔® 编译器选择 vmovups,这是因为当访问第二代英特尔®酷睿TM处理器上的一致内存时,不会出现任何问题。

With #pragma simd and #pragma vector aligned

.B46.4::
        vmovups   ymm0, YMMWORD PTR [rdx+rax*4]
        vaddps    ymm1, ymm0, YMMWORD PTR [r8+rax*4]
        vaddps    ymm2, ymm1, YMMWORD PTR [r9+rax*4]
        vaddps    ymm3, ymm2, YMMWORD PTR [rbx+rax*4]
        vmovups   YMMWORD PTR [rcx+rax*4], ymm3
        vmovups   ymm4, YMMWORD PTR [32+rdx+rax*4]
        vaddps    ymm5, ymm4, YMMWORD PTR [32+r8+rax*4]
        vaddps    ymm0, ymm5, YMMWORD PTR [32+r9+rax*4]
        vaddps    ymm1, ymm0, YMMWORD PTR [32+rbx+rax*4]
        vmovups   YMMWORD PTR [32+rcx+rax*4], ymm1
        add       rax, 16
        cmp       rax, r11
        jb        .B46.4

这展示了英特尔® 编译器的部分自动矢量化能力。矢量化可通过矢量报告确认(simd 声称编译指令),或者通过检查生成的汇编语言指令来确认。如果开发人员对其应用有着深刻的了解,那么编译指令能够为编译器提供进一步的帮助。请参考英特尔® C++ 编译器矢量化指南了解关于英特尔编译器中的矢量化的更多信息。英特尔® C++ 编译器 XE 12.0 用户和参考指南提供了关于使用矢量化、编译指令和编译器开关的额外信息。英特尔编译器可为您完成大部分的矢量化工作,因此您的应用可以随时使用英特尔® AVX。


面向阵列符号(Notation)的英特尔® Cilk™ Plus C/C++ 扩展

面向阵列符号的英特尔® Cilk™ Plus C/C++ 语言扩展是专用于英特尔的语言扩展,适用于算法在阵列上运行的情况,不需要阵列元素之间的特定操作顺序。如果使用阵列符号来表达算法并通过 AVX 开关进行编译,英特尔® 编译器将生成英特尔® AVX 指令。面向阵列符号的 C/C++ 语言扩展旨在帮助用户在其程序中直接表达高级并行矢量阵列操作。这可帮助编译器执行数据相关性分析、矢量化和自动并行化。从开发人员的角度来看,他们将获得更加可预测的矢量化、改进的性能和更高的硬件资源利用率。通过结合使用面向阵列符号的 C/C++ 语言扩展和其它英特尔® CilkTM Plus 语言扩展,有助于简化并行和矢量化应用开发流程。

要实现上述优势,开发人员可以编写标准的 C/C++ 基本函数,以便通过标量句法来表示操作。在不使用面向阵列符号的 C/C++ 语言扩展的情况下调用时,该基本函数可用于在一个元素上进行操作,必须使用"__declspec(vector)"对该基础函数进行声明,以便用户能够通过面向阵列符号的 C/C++ 语言扩展来调用。

multiplyValues 示例作为一个基础函数来展示:

__declspec(vector) float multiplyValues(float b, float c)	
{
	return b*c;
}

该标量调用通过该简单的示例进行说明:

	float a[12], b[12], c[12];
a[j] = multiplyValues(b[j], c[j]);	

此外,借助面向阵列符号的 C/C++ 语言扩展,该函数还可在整个阵列或阵列的一部分上来操作。片段操作符(section operator)可用于要在其上进行操作的阵列部分。句法: [<lower bound> : <length> : <stride>]

下限是源阵列的开始索引、长度是结果阵列的长度,跨度表示的是整个源阵列的跨度。跨度是可选的,默认是一个。

这些阵列部分示例有助于阐释具体的使用方式:

	float a[12];

	a[:] refers to the entire a array
	a[0:2:3] refers to elements 0 and 3 of array a.  
	a[2:2:3] refers to elements 2 and 5 of array a
	a[2:3:3] refers to elements 2, 5, and 8 of array a

此外,符号还支持多维阵列。

	float a2d[12][4];

	a2d[:][:] refers to the entire a2d array
	a[:][0:2:2] refers to elements 0 and 2 of the columns for all rows of a2d.  
	a[2:2:3] refers to elements 2 and 5 of array a
	a[2:3:3] refers to elements 2, 5, and 8 of array a

借助阵列符号,用户可以轻松地调用使用阵列的 multiplyValues。英特尔® 编译器提供了矢量化版本,可以分别执行相应的操作。以下为您列举了部分实例:第一个示例在整个阵列上操作,第二个则在阵列的一个子集或部分上操作。

该示例调用了整个阵列的函数:

a[:] = multiplyValues(b[:], c[:]);	

该示例调用了阵列的一个子集的函数:

a[0:5] = multiplyValues(b[0:5], c[0:5]);	

面向阵列标记的 C/C++ 语言扩展可简化阵列应用的开发。如果采用自动矢量化功能,则需要检查英特尔® 编译器生成的指令,以确认是否正在使用英特尔® AVX 指令。通过 /S 开关进行编译可生成汇编文件。搜索 mutliplyValues 将得到标量和矢量化这两个版本。

标量实施使用 VEX 编码 128 位英特尔® AVX 指令的标量(ss)版本:

        vmulss    xmm0, xmm0, xmm1
        ret  

矢量实施使用 VEX 编码 256 位英特尔® AVX 指令的标量(ps)版本:

	  sub       rsp, 40
        vmulps    ymm0, ymm0, ymm1
        mov       QWORD PTR [32+rsp], r13
        lea       r13, QWORD PTR [63+rsp]
        and       r13, -32
        mov       r13, QWORD PTR [32+rsp]
        add       rsp, 40
        ret

这些简单的示例显示了,面向阵列标记的 C/C++ 语言扩展如何使用英特尔® AVX 的特性,而且不需要开发人员明确地使用任何英特尔® AVX 指令。无论是否使用基础函数,都可以使用面向阵列标记的 C/C++ 语言扩展。该技术使用最新的英特尔® AVX 指令集架构,为开发人员提供了更高的灵活性和更多的选择。请参考英特尔® C++ 编译器 XE 12.0 用户和参考指南,了解面向阵列标记的英特尔® Cilk™ Plus C/C++ 语言扩展的更多信息。

开发人员可使用编译器来生成英特尔® AVX 指令,并对其应用进行自动矢量化处理。此外,他们也可以选择面向阵列标记的英特尔® Cilk™ Plus C/C++ 语言扩展进行开发,以便充分利用英特尔® AVX。开发团队还可以通过另外一种方式在不编写汇编语言的情况下使用英特尔® AVX。英特尔® 集成性能基元库(英特尔® IPP)和英特尔® 数学核心函数库(英特尔® MKL)为开发人员带来了许多优势,包括支持英特尔® AVX 等最新的英特尔技术。


使用英特尔® IPP 和英特尔® MKL 库

借助英特尔® 集成性能基元库和英特尔® 数学核心函数库,英特尔针对多媒体、数据处理、加密和通信应用提供了数千个高度优化的软件函数。这些线程安全库支持多种操作系统,最快的代码将在指定平台上运行。通过这种方式,用户可以轻松地向应用添加多核并行化和矢量化能力,并利用最新的处理器指令来执行代码。英特尔® 集成性能基元库 7.0 包括大约 175 个针对英特尔® AVX 而优化的函数。这些函数可用于执行 FFT、过滤、卷积、重新调整大小等操作。英特尔® 数学核心函数库 10.2 支持面向 BLASS (dgemm)、FFT 和 VML (exp, log, pow) 的英特尔® AVX。实施过程在英特尔® MKL 10.3 中得到了简化,因为开始不再需要调用 mkl_enable_instructions。英特尔® MKL 10.3 可扩展英特尔® AVX,以便支持 DGMM/SGEMM、radix-2 Complex FFT、最真实的 VML 函数以及 VSL 分布生成器。

如果您已经在使用,或者考虑使用这些版本的库,那么您的应用将能够使用英特尔® AVX 指令集。在 Sandy Bridge 平台上运行时,库将执行英特尔® AVX 指令,并且支持 Linux*、Windows* 和 Mac OS* X 平台。

如欲了解关于针对英特尔® AVX 而优化的英特尔® IPP 函数的更多信息,请访问:/zh-cn/articles/intel-ipp-intel-avx

如欲了解关于英特尔® MKL AVX 支持的更多信息,请访问:英特尔® MKL V10.3 中的英特尔® AVX优化


总结

人们对更高计算性能的需求促使英特尔在微架构和指令集领域不断进行创新。应用开发人员希望确保他们的产品能够利用技术上的进步,且无需投入更多的开发资源。本文介绍的方法、工具和库可帮助开发人员从英特尔® 高级矢量扩展指令集的发展上获益,而且无需编写英特尔® AVX 汇编语言。


更多信息和参考资料

英特尔® 高级矢量扩展指令集

英特尔® 编译器

英特尔® C++ Composer XE 2011 文档

如何编译英特尔® AVX

英特尔® C++ 编译器矢量化指南

针对英特尔® 高级矢量扩展指令集而优化的英特尔® 集成性能基元库函数

英特尔® MKL V10.3 中的英特尔® AVX(高级矢量扩展指令集) 优化


表 1 – 英特尔® 编译器命令行开关总结

英特尔® 编译器命令行开关描述Windows*Linux*Mac OS* X
如果存在性能优势,则可以针对英特尔处理器生成多个特定处理器的自动分布代码路径/Qax-ax-ax
生成矢量化报告/Qvec-report<n>-vec-report<n>-vec-report<n>
生成汇编语言文件/S-S-S

在不编写 AVX 代码的情况下使用 AVX

$
0
0

Using AVX Without Writing AVX Code (PDF 260KB)

摘要

英特尔® 高级矢量扩展指令集(AVX)是一套针对英特尔® SIMD 流指令扩展(英特尔® SSE)的 256 位扩展指令集,专为浮点密集型应用而设计。英特尔® SSE 和英特尔® AVX 均为单指令多数据指令集的示例。英特尔® AVX 作为第二代英特尔® 酷睿™ 处理器家族的一部分发布。英特尔® AVX 采用更宽的 256 位矢量 - 一种全新的扩展指令格式(矢量扩展指令集或 VEX)并具备丰富的功能,使系统性能得到显著提升。

该指令集架构支持三种操作数,可提升指令编程灵活性,并支持非破坏性的源操作数。传统的 128 位 SIMD 指令也经过了扩展,支持三种操作数和新的指令加密格式 (VEX)。指令加密格式介绍了使用操作码和前缀,以处理器能够理解的格式来表达更高级别的指令的方式。这有助于实现对数据和一般应用的更好的管理,例如图像、音视频处理、科研模拟、金融分析和三维建模与分析。

本文讨论了开发人员可通过哪些方式将英特尔® AVX 集成到应用中,且无需在低级别汇编语言中进行明确地编码。对于 C/C++ 开发人员来说,访问英特尔® AVX 的最直接方式是使用兼容 C 的内部指令。这些内部函数提供了到英特尔® AVX 指令集的访问,以及英特尔® 短矢量数学库 (SVML) 中的更高级别的数学函数。这些函数分别在 immintrin.h 和 ia32intrin.h 头文件中进行声明。应用编程人员还可通过其它方法来使用英特尔® AVX,而且无需将英特尔® AVX 指令添加至其源代码。本文针对这些方法进行了调查(使用英特尔® C++ Composer XE 2011,定位于在 Sandy Bridge 系统上执行)。Linux*、Windows* 和 Mac OS* X 平台均支持英特尔® C++ Composer XE。本文将使用面向 Windows* 平台的命令行开关。

本文是《英特尔® 多线程应用开发指南》系列文章中的一篇,旨在为开发人员开发适用于英特尔® 平台的高效多线程应用提供指导。

背景

支持矢量或 SIMD 的处理器能够在一次指令中,同时在多个数据操作数上执行一个操作。在一个数字上由另外一个数字执行的操作以生成单个结果的流程被称作标量流程。在 N 个数字上同时执行的操作以生成 N 个结果的流程被称作矢量流程 (N > 1)。英特尔处理器或支持 SIMD 或 AVX 指令的兼容的非英特尔处理器均支持该技术。将算法从标量转化为矢量的流程被称作矢量化。

建议

面向英特尔® AVX 的重新编译

第一个方法是使用 /QaxAVX 编译器开关进行重新编译。无需对源代码进行修改。英特尔® 编译器将生成相应的 128 和 256 位英特尔® AVX VEX 加密指令。当有助于提高性能时,英特尔® 编译器将针对英特尔处理器生成多个特定处理器,且具备自动分布功能的代码路径。最合适的代码将在运行时执行。

编译器自动矢量化

借助合适的架构开关来编译应用,是构建英特尔® AVX 就绪型应用的第一步。借助自动矢量化功能,编译器可代表软件开发人员执行大部分矢量化工作。自动矢量化是满足特定条件时编译器执行的优化。英特尔® C++ 编译器可在生成代码期间自动执行相应的矢量化操作。英特尔® C++ 编译器矢量化指南详细介绍了矢量化。当优化级别为 /O2 或更高时,英特尔编译器将寻找矢量化机遇。

让我们来考虑一个简单的矩阵矢量乘法示例,该示例随英特尔® C++ ComposeXE 提供,详细阐释了矢量化的概念。下列代码片段来自 vec_samples 归档的 Multiply.c 中的 matvec 函数:

如果没有矢量化,外层循环将执行 size1 时间,内层循环将执行 size1*size2 时间。借助 /QaxAVX 开关实现矢量化以后,内层循环可以展开(unrolled),这是因为可在每次操作的单个指令中执行四次乘法和四次加法。矢量化循环的效率比标量循环高得多。英特尔® AVX 的优势还适用于单精度浮点数字,因为 8 个单精度浮点操作数可以存于 ymm 寄存器中。

循环必须满足特定的标准才能实现矢量化。在运行时进入循环时,必须要知道循环运行次数。运行次数可以是变量,但在执行循环时必须是常量。循环必须具备单进和单出能力,而且退出不能依赖于输入数据。此外还存在一些分支标准,例如不允许开关语句(switch statement)。如果 If 语句可作为隐蔽任务实施,则可允许这种类型的语句。最内层的循环最有可能是矢量化的对象,而且在循环内部使用函数调用可能会影响矢量化。内联函数和固有的 SVML 函数可增加矢量化机遇。

在应用开发的实施和调试阶段,建议对矢量化信息进行检查。英特尔® 编译器提供了矢量化报告,可帮助你了解被矢量化以及未被矢量化的元素。该报告可通过 /Qvec-report= 命令行选项提供,其中 n 指定了报告的详细级别。详细级别随 n 数值的增加而增加。如果 n=3,则可以提供相关性信息、被矢量化的循环和未被矢量化的循环。开发人员可根据报告中的信息来修改实施,循环未被矢量化的原因提供了非常有帮助的信息。

开发人员在其具体应用方面具有深入的专业知识,因此有时可以忽略自动矢量化行为。编译指示提供了额外的信息,以便为自动矢量化流程提供帮助。部分示例包括:一直对循环进行矢量化操作、确定循环内的数据保持一致、忽略潜在的数据相关性等。addFloats 示例对部分重要点进行了说明。你需要检查生成的汇编语言指令,以了解所生成的编译器。当指定 /S 命令行选项时,英特尔编译器将在当前的工作目录中生成汇编文件。

请注意 simd 和矢量编译指示的使用。它们在实现所期望的英特尔® AVX 256 位矢量化方面起着重要作用。向代码添加 "#pragma simd"有助于生成英特尔® 128 位 AVX 指令的打包版本。此外,编译器还将展开循环,从而减少与循环测试结束相关的执行指令数量。指定 "pragma vector aligned"有助于编译器针对所有阵列参考使用一致的数据移动指令。使用 "pragma simd"和 "pragma vector aligned."可生成期望的 256 位英特尔® AVX 指令。英特尔® 编译器选择 vmovups,这是因为当访问第二代英特尔®酷睿TM处理器上的一致内存时,使用不一致的转移指令不会出现任何问题。

使 #pragma simd 和 #pragma 矢量保持一致

这展示了英特尔® 编译器的部分自动矢量化能力。矢量化可通过矢量报告确认(simd 声称编译指令),或者通过检查生成的汇编语言指令来确认。如果开发人员对其应用有着深刻的了解,那么编译指令能够为编译器提供进一步的帮助。请参考英特尔® C++ 编译器矢量化指南了解关于英特尔编译器中的矢量化的更多信息。英特尔® C++ 编译器 XE 12.0 用户和参考指南提供了关于使用矢量化、编译指令和编译器开关的额外信息。英特尔编译器可为您完成大部分的矢量化工作,因此您的应用可以随时使用英特尔® AVX。

面向阵列符号(Notation)的英特尔® Cilk™ Plus C/C++ 扩展

面向阵列符号的英特尔® Cilk™ Plus C/C++ 语言扩展是专用于英特尔的语言扩展,适用于算法在阵列上运行的情况,不需要阵列元素之间的特定操作顺序。如果使用阵列符号来表达算法并通过 AVX 开关进行编译,英特尔® 编译器将生成英特尔® AVX 指令。面向阵列符号的 C/C++ 语言扩展旨在帮助用户在其程序中直接表达高级并行矢量阵列操作。这可帮助编译器执行数据相关性分析、矢量化和自动并行化。从开发人员的角度来看,他们将获得更加可预测的矢量化、改进的性能和更高的硬件资源利用率。通过结合使用面向阵列符号的 C/C++ 语言扩展和其它英特尔® CilkTM Plus 语言扩展,有助于简化并行和矢量化应用开发流程。

要实现上述优势,开发人员可以编写标准的 C/C++ 基本函数,以便通过标量句法来表示操作。在不使用面向阵列符号的 C/C++ 语言扩展的情况下调用时,该基本函数可用于在一个元素上进行操作,必须使用“__declspec(vector)”对该基础函数进行声明,以便用户能够通过面向阵列符号的 C/C++ 语言扩展来调用。

multiplyValues 示例作为一个基础函数来展示:

该标量调用通过该简单的示例进行说明:

此外,借助面向阵列符号的 C/C++ 语言扩展,该函数还可在整个阵列或阵列的一部分上来操作。片段操作符(section operator)可用于要在其上进行操作的阵列部分。句法: [ : : ]

下限是源阵列的开始索引、长度是结果阵列的长度,跨度表示的是整个源阵列的跨度。跨度是可选的,默认是一个。

这些阵列部分示例有助于阐释具体的使用方式:

此外,符号还支持多维阵列。

借助阵列符号,用户可以轻松地调用使用阵列的 multiplyValues。英特尔® 编译器提供了矢量化版本,可以分别执行相应的操作。以下为您列举了部分实例:第一个示例在整个阵列上操作,第二个则在阵列的一个子集或部分上操作。

该示例调用了整个阵列的函数:

a[:] = multiplyValues(b[:], c[:]);

该示例调用了阵列的一个子集的函数:

a[0:5] = multiplyValues(b[0:5], c[0:5]);

这些简单的示例显示了,面向阵列标记的 C/C++ 语言扩展如何使用英特尔® AVX 的特性,而且不需要开发人员明确地使用任何英特尔® AVX 指令。无论是否使用基础函数,都可以使用面向阵列标记的 C/C++ 语言扩展。该技术使用最新的英特尔® AVX 指令集架构,为开发人员提供了更高的灵活性和更多的选择。请参考英特尔® C++ 编译器 XE 12.0 用户和参考指南,了解面向阵列标记的英特尔® Cilk™ Plus C/C++ 语言扩展的更多信息。

使用英特尔® IPP 和英特尔® MKL 库

借助英特尔® 集成性能基元库和英特尔® 数学核心函数库,英特尔针对多媒体、数据处理、加密和通信应用提供了数千个高度优化的软件函数。这些线程安全库支持多种操作系统,最快的代码将在指定平台上运行。通过这种方式,用户可以轻松地向应用添加多核并行化和矢量化能力,并利用最新的处理器指令来执行代码。英特尔® 集成性能基元库 7.0 包括大约 175 个针对英特尔® AVX 而优化的函数。这些函数可用于执行 FFT、过滤、卷积、重新调整大小等操作。英特尔® 数学核心函数库 10.2 支持面向 BLASS (dgemm)、FFT 和 VML (exp, log, pow) 的英特尔® AVX。实施过程在英特尔® MKL 10.3 中得到了简化,因为开始不再需要调用 mkl_enable_instructions。英特尔® MKL 10.3 可扩展英特尔® AVX,以便支持 DGMM/SGEMM、radix-2 Complex FFT、最真实的 VML 函数以及 VSL 分布生成器。

如果您已经在使用,或者考虑使用这些版本的库,那么您的应用将能够使用英特尔® AVX 指令集。在 Sandy Bridge 平台上运行时,库将执行英特尔® AVX 指令,并且支持 Linux*、Windows* 和 Mac OS* X 平台。

如欲了解关于针对英特尔® AVX 而优化的英特尔® IPP 函数的更多信息,请访问:/en-us/articles/intel-ipp-functions-optimized-for-intel-avx-intel-advanced-vector-extensions。如欲了解关于英特尔® MKL AVX 支持的更多信息,请访问:Intel® AVX Optimization in Intel® MKL V10.3

使用准则

人们对更高计算性能的需求促使英特尔在微架构和指令集领域不断进行创新。应用开发人员希望确保他们的产品能够利用技术上的进步,且无需投入更多的开发资源。本文介绍的方法、工具和库可帮助开发人员从英特尔® 高级矢量扩展指令集的发展上获益,而且无需编写英特尔® AVX 汇编语言。

其它资源

利用英特尔高级矢量扩展指令集进行 Wiener 过滤

$
0
0

1 内容简介
英特尔® 高级矢量扩展指令集(英特尔® AVX)是一套针对英特尔® SIMD 流指令扩展(英特尔® SSE)的 256 位扩展指令集,专为浮点密集型应用而设计。对于过分依赖浮点运算的应用,如 3D 几何、视频处理、图像处理和空间 (3D) 音频,这些指令为加快其运行速度提供了一种方法。本应用注释介绍了 Wiener 过滤,并包含一个已经使用英特尔® AVX 进行优化的代码示例。本文中代码和应用注释的源代码参考为利用 SIMD 流指令扩展 [4] 的 AP-807 Wiener 过滤。原始文章包含利用 SIMD 流指令扩展的优化。此外,本文还介绍了如何将代码迁移到 256 位扩展指令集(即英特尔® AVX)。


2 Wiener 过滤器算法
Wiener 过滤也称为最小均方过滤,是一种用于消除图像中不必要的噪声的技术。对该算法的描述摘自 Harley R Myler 和 Aruthur R. Weeks [1] 编著的《C 语言中图像处理算法手册》(Pocket Handbook of ImageProcessing Algorithms in C)。该算法包含四个(经傅里叶变换)矢量输入值,分别代表(一部分)源图像(图像)、降质图像 (Guv)、噪声图像光谱(噪声)和退化函数 (Huv)。每个输入值均是一个 row*col 复数矢量。该复数表示为两个连续的浮点数,代表数字的真实部分和想象部分。计算结果中还包括另一个参数 gamma。当 gamma 为 1.0 时,过滤器被认为是非参数型。直到经过过滤的图像满足要求,才能对过滤器的参数进行调整。


2.1 支持 Wiener 过滤器的应用
Wiener 过滤器通常用在图像处理应用中,用于消除重建图像中的噪声。Wiener 过滤通常用于还原模糊图像。然而,经证实 Wiener 过滤器在自适应滤波中非常重要,曾用于微波变换,并且在通信和其它 DSP 相关领域中得到应用。读者还应意识到傅里叶变换是所有信号处理领域中的一个关键要素。如欲了解有关如何实施傅里叶变换的更多信息,请参考英特尔应用注释 Split-Radix FFT (AP-808)。


2.2 实施 Wiener 过滤
如 2.1 节中所述,函数的输入值为四组复数。对于图像的每一部分而言,以下运算均采用复数算法进行。复杂变量 D 和 Hs 为运算过程中使用的中间变量。函数 Complex_conj 用于提取复数的复共轭。进行除法运算时,必须利用 if 语句进行检查,确保分母不为零。当分母为零时,应将结果设置为零。
1. 复杂噪声 = gamma *(噪声* Complex_conj(噪声))
2. 复杂变量 D = 图像 * Complex_conj(图像)
3. 复杂变量 D = 噪声/D
4. 复杂变量 Hs = Huv * Complex_conj ( Huv )
5. 复数 = Complex_conj ( Huv ) * Guv
6. 复杂图像 = 数字/(Hs + D)


3 利用 128 位 SIMD 对代码进行矢量化
本文的参考源代码(C 版本和 128 位 SIMD 版本)来自英特尔应用注释 AP-807。以下内容介绍了标量代码到 128 位 SIMD 的原始端口。首先,该代码优化简单,只需观察到运算中的许多步骤均涉及到用一个数字乘以其复共轭即可。由于运算结果不包含想象部分,因此 2.2 节中指出的许多运算步骤可以简化。由此产生的 C 代码将在第 5 节中提供。在针对英特尔® SSE 优化代码前,必须将代码转换为适合于 SIMD 执行的形式。将源 C 代码的四次迭代集中在一起,并在新循环的单个迭代中处理。每完成一次新循环相当于完成四次原始迭代。根据 2.2 节中列出的运算,每次迭代过程必须进行三次除法运算。其中一次除法运算发生在步骤 3(由于想象部分为零,因此想象部分无需进行除法运算),另外两次除法运算发生在步骤 6。通过注释后两次除法运算包含相同的分母,可省略其中的一次除法运算。通过使用掩蔽技术,可删除需要检查是否存在为零分母的 if 语句。通过将所有除法运算替换为倒数近似,可实现进一步改进。下面将介绍这些技术。将代码转换为 SIMD 格式后,可通过以下方法省略对分母为零(if 语句)的检查:为非零分母元素创建掩码,将该掩码添加到除法运算的结果中,以清除被零相除的元素。通过掩蔽 MXCSR 寄存器中的 SIMD 浮点异常,避免发生浮点被零除的异常,此技术假设产生的是 QNAN 而非 SNAN。例如,假设您希望计算数量( N / D ),其中 N 和 D 为浮点。典型代码顺序如下所示:

If ( D != 0 )
Result = N / D;
Else
Result = 0.0;

结果可以使用 (and ( div ( N, D ), cmp_neq ( D, 0 ) ) ) 计算得出,而无需使用 if 语句。使用内在函数 (intrinsic) 时,表达如下:

mm_and_ps ( mm_div_ps ( N, D ), mm_cmpneq_ps ( D, zero ) ).

该技术专用于矢量代码,在代码的附加内在函数和汇编语言两个版本中实施。

Newton-Raphson 方法是近似函数的一种经典技术。首先,利用 rcpps 指令计算出倒数的初始“猜测值”。接着,利用 Newton-Raphson 方法对“猜测值”进行改进。由此得出的结果虽不及除法指令提供的准确,但是获得结果的速度要快得多。(程序员必须确定其应用是否允许精确度较低的答案。)有关此技术的详细信息和介绍可在 Newton-Raphson 应用注释 [2] 中找到。Wiener 过滤器采用的特定代码顺序如下所示:(必须对分母进行检查,确保不会出现被零除的现象。)

RC = _mm_rcpps( D );
RECIP = _mm_sub( _mm_add( RC, RC ), _mm_mul( RC, _mm_mul( RC, D ) ) );


4 利用 256 位 SIMD 对代码进行矢量化
通过迁移更改,即可将 128 位矢量代码轻松移植到 256 位 SIMD。256 位代码将在一次循环中进行 8 次迭代,而 128 位英特尔® SSE 在一次循环中只进行 4 次迭代。现在,负载/存储指令和 rcp/mul/add/sub 指令将对 8 个数据点进行运算。上述无分支技术在英特尔® AVX 代码中仍然保留。从代码中可以看出,移植采用相应的英特尔® AVX 内在函数(例如针对 256 位的 _mm256_mul_ps 和针对 128 位的 _mm_mul_ps)完成。6.3 节中提供了完整的 256 位英特尔® AVX 代码。


5 对输入/输出数组进行分组
此外,按照顺序次序对输入和输出数组进行分组可提升 256 位 SIMD 代码的性能。通过高速缓存/内存连续访问内存可减少潜在的高速缓存路径冲突 (cache way conflict)。这为 CPU 硬件预取器提供了一种更为简单的访问模式,从而提高了数据的精确度。


6 结论

对于大量迭代处理,128 位 SIMD 代码与 256 位 SIMD 代码的性能结果(在 CPU 时钟周期方面)比较如下:

 

英特尔® AVX(256 位)

英特尔® SSE (128-bit)

英特尔® AVX 对比 英特尔® SSE

Wiener过滤器

45871

66933

1.46倍

包含分组数组的 Wiener 过滤器

42464

64473

1.51倍


相对于英特尔® SSE,英特尔® AVX 的 Wiener 过滤性能整体上提升了 1.46 倍。与采用 128 位 SIMD 编码相比,英特尔® AVX 大幅提升了 Wiener 过滤算法的性能。通过对输入/输出数组进行分组,英特尔® AVX 的 Wiener 过滤性能在英特尔® SSE 的基础上整体提升了 1.51 倍。本文中提到的提升是采用几种技术的结果。其中包括采用英特尔® AVX(矢量化代码)和通过利用英特尔® SSE 和英特尔® AVX 提供的遮罩运算消除条件分支指令(if 语句)。如果可以接受数值精度较低的结果,则可通过将除法运算替换为倒数近似(采用 Newton-Raphson 技术)实现进一步的性能提升。此外,本文还重点介绍了如何将现有浮点代码轻松移植到英特尔® AVX。


7 代码示例

/*
* Wiener Filter (also known as the Least Mean Square filter)
*
* Reference: The Pocket Handbook of Image Processing Algorithms in C
* by Harley R Myler & Arthur R. Weeks
* 1993 Prentice-Hall, ISBN 0-13-642240-3 p260-3.
*
* The data is several arrays of complex floats in row major order.
* The description for the algorithm from p260 states:
*
* The algorithm computes a parametric Wiener filter on the
* Fourier transform of a degraded image, Guv, with noise
* spectra N, degradation function Huv, and original image Img.
* The computation is in place, so that the filtered version of
* the input is returned in the original image variable. The
* original and noise images are either estimations form some
* predictive function or ad hoc approximations. If the noise
* image is zero, the process reduces to the inverse filter.
*
* The Weiner parameter gamma is passed to the algorithm.
* If this parameter is 1.0, the filter is non-parametric.
* Methods exist in the literature to derive the parameter value;
* however, it is sometimes determined from trial and error.
*
*NOTE!!!! The code on page 263 has an error. In cxml, the complex
* multiply routine, the imaginary part of the computation should be
* a*d + b*c, not a*d - b*c.
*
*NOTE! (another error) The *complex* array length is rows*cols, so the
* *float* array length should be 2*rows*cols. Also, note that the
* algorithm operates on one component of the pixel.
*/
void wiener_filter ( float *Img,
float *Huv,
float *No,
float *Guv,
float gamma,
int rows,
int cols)
{
int i, sz;
float numr, numi, dr, hsr;
sz = 2 * rows * cols;
for (i = 0; i < sz; i += 2)
{
/* Compute (in place) the noise spectral density with Wiener gamma*/
No[i] = (float) ( gamma * ( No[i]*No[i] + No[i+1]*No[i+1] ) );
No[i+1] = (float) 0.0;
/* Compute image spectral density */
dr = (float) ( Img[i]*Img[i] + Img[i+1]*Img[i+1] );
/* Compute denominator spectral density term */
if (dr != 0.0)
dr = (float) (No[i] / dr) ;
/* Compute degradation power spectrum */
hsr = (float) ( Huv[i]*Huv[i] + Huv[i+1]*Huv[i+1] );
/* Compute numerator term */
numr = (float) ( Huv[i]*Guv[i] + Huv[i+1]*Guv[i+1] );
numi = (float) ( Huv[i]*Guv[i+1] - Huv[i+1]*Guv[i ] );
/* Final computation */
if ( (hsr + dr) != 0.0 )
{
Img[i] = (float) (numr / (hsr + dr));
Img[i+1] = (float) (numi / (hsr + dr));
}
else
{
Img[i] = (float) 0.0;
Img[i+1] = (float) 0.0;
}
}
} /* wiener_filter */



7.2 128 位内在代码 (intrinsics code)

/*
#include 
//#define MM_FUNCTIONALITY
#include 
#include 
void intrin_wiener_rcp_sse( float *Img,
float *Huv,
float *No,
float *Guv,
float gamma,
int rows,
int cols )
{
int i, sz;
__m128 first2, next2, nor4, noi4, nr4, inr4, ini4, dr4;
__m128 hr4, hi4, hsr4, gr4, gi4, numr4, numi4;
__m128 rc, denom;
__m128 zero = _mm_set_ps1 (0.0);
sz = 2 * rows * cols;
assert( (sz > 3) & !(sz & 3) );
assert( !( ((int)Img) & 15 ) ); /* Assume alignment */
assert( !( ((int)Huv) & 15 ) );
assert( !( ((int)No) & 15 ) );
assert( !( ((int)Guv) & 15 ) );
for (i = 0; i < sz; i += 8)
{
* Compute (in place) the noise spectral density with Wiener gamma
*
* complex Noise = gamma * (Noise * complex conj Noise)
*
* No[i] = (float) ( gamma * ( No[i]*No[i] + No[i+1]*No[i+1] ) );
* No[i+1] = (float) 0.0;
*/
first2 = _mm_load_ps ( &No[i] );
next2 = _mm_load_ps ( &No[i+4] );
nor4 = _mm_shuffle_ps( first2, next2, 0x88 );
noi4 = _mm_shuffle_ps( first2, next2, 0xdd );
nr4 = _mm_mul_ps ( _mm_set_ps1( gamma ) ,
_mm_add_ps ( _mm_mul_ps( nor4 , nor4 ),
_mm_mul_ps( noi4 , noi4 ) ) );
_mm_store_ps( &No[i ], _mm_unpacklo_ps ( nr4, zero ) );
_mm_store_ps( &No[i+4], _mm_unpackhi_ps ( nr4, zero ) );
/*
* Compute image spectral density
*
* Complex D = Image * complex conj Image
*
* dr = (float) ( Img[i]*Img[i] + Img[i+1]*Img[i+1] );
*/
first2 = _mm_load_ps ( &Img[i] );
next2 = _mm_load_ps ( &Img[i+4] );
inr4 = _mm_shuffle_ps( first2, next2, 0x88 );
ini4 = _mm_shuffle_ps( first2, next2, 0xdd );
dr4 = _mm_add_ps ( _mm_mul_ps( inr4 , inr4),
_mm_mul_ps( ini4 , ini4) );
/*
* Compute denominator spectral density term
*
* Complex D = noise / D
*
* if (dr != 0.0)
* dr = (float) (No[i] / dr) ;
*
* Do that reciprical division thing!
*/
rc = _mm_rcp_ps(dr4);
rc = _mm_sub_ps( _mm_add_ps( rc, rc),
_mm_mul_ps( rc, _mm_mul_ps( rc, dr4) ) );
dr4 = _mm_and_ps ( _mm_mul_ps ( nr4 , rc ),
_mm_cmpneq_ps( dr4, zero ) );
/*
* Compute degradation power spectrum
*
* Complex Hs = Huv * complex conj Huv
*
* hsr = (float) ( Huv[i]*Huv[i] + Huv[i+1]*Huv[i+1] );
*/
first2 = _mm_load_ps ( &Huv[i] );
next2 = _mm_load_ps ( &Huv[i+4] );
hr4 = _mm_shuffle_ps( first2, next2, 0x88 );
hi4 = _mm_shuffle_ps( first2, next2, 0xdd );
hsr4 = _mm_add_ps ( _mm_mul_ps (hr4 , hr4 ),
_mm_mul_ps (hi4 , hi4 ) );
/*
* Compute numerator term
*
* Complex Num = complex conj Huv * Guv
*
* numr = (float) ( Huv[i]*Guv[i] + Huv[i+1]*Guv[i+1] );
* numi = (float) ( Huv[i]*Guv[i+1] - Huv[i+1]*Guv[i ] );
*/
first2 = _mm_load_ps ( &Guv[i] );
next2 = _mm_load_ps ( &Guv[i+4] );
gr4 = _mm_shuffle_ps( first2, next2, 0x88 );
gi4 = _mm_shuffle_ps( first2, next2, 0xdd );
numr4 = _mm_add_ps ( _mm_mul_ps (hr4 , gr4),
_mm_mul_ps (hi4 , gi4) );
numi4 = _mm_sub_ps ( _mm_mul_ps (hr4 , gi4),
_mm_mul_ps (hi4 , gr4) );
/*
* Final computation
*
* Complex Image = Num / (Hs + D)
*
* if ( (hsr + dr) != 0.0 )
* {
* Img[i] = (float) (numr / (hsr + dr));
* Img[i+1] = (float) (numi / (hsr + dr));
* }
* else
* {
* Img[i] = (float) 0.0;
* Img[i+1] = (float) 0.0;
* }
*
* Do the reciprical division thing
*/
denom = _mm_add_ps( hsr4, dr4 );
rc = _mm_rcp_ps(denom);
rc = _mm_sub_ps( _mm_add_ps( rc, rc),
_mm_mul_ps( rc, _mm_mul_ps( rc, denom) ) );
inr4 = _mm_and_ps( _mm_mul_ps ( numr4 , rc ) ,
_mm_cmpneq_ps( denom, zero ) );
ini4 = _mm_and_ps( _mm_mul_ps ( numi4 , rc ) ,
_mm_cmpneq_ps( denom, zero ) );
_mm_store_ps( &Img[i ], _mm_unpacklo_ps ( inr4, ini4 ) );
_mm_store_ps( &Img[i+4], _mm_unpackhi_ps ( inr4, ini4 ) );
}
} /* intrin_wiener_rcp */


    


7.3 256 位内在代码 (intrinsics code)

void intrin_wiener_rcp_avx( float *Img,
					   float *Huv,
					   float *No,
					   float *Guv,
					   float gamma,
					   int rows,
					   int cols )
{
	int i, sz;
	__m256 first2, next2, nor4, noi4, nr4, inr4, ini4, dr4;
	__m256 hr4, hi4, hsr4, gr4, gi4, numr4, numi4;
	__m256 rc, denom;
	__m256 zero = _mm256_setzero_ps();
	sz = 2 * rows * cols;
	assert( (sz > 3) & !(sz & 3) );
	assert( !( ((int)Img) & 15 ) ); /* Assume alignment */
	assert( !( ((int)Huv) & 15 ) );
	assert( !( ((int)No) & 15 ) );
	assert( !( ((int)Guv) & 15 ) );
	for (i = 0; i < sz; i += 16)
	{
		* Compute (in place) the noise spectral density with Wiener gamma
		*
		* complex Noise = gamma * (Noise * complex conj Noise)
		*
		* No[i] = (float) ( gamma * ( No[i]*No[i] + No[i+1]*No[i+1] ) );
		* No[i+1] = (float) 0.0;
		*/
		first2 = _mm256_load_ps ( &No[i] );
		next2 = _mm256_load_ps ( &No[i+4*2] );
		nor4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		noi4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		nr4 = _mm256_mul_ps ( _mm256_set1_ps( gamma ) ,
		_mm256_add_ps ( _mm256_mul_ps( nor4 , nor4 ),
		_mm256_mul_ps( noi4 , noi4 ) ) );
		_mm256_store_ps( &No[i ], _mm256_unpacklo_ps ( nr4, zero ) );
		_mm256_store_ps( &No[i+4*2], _mm256_unpackhi_ps ( nr4, zero ) );
		
		/*
		* Compute image spectral density
		*
		* Complex D = Image * complex conj Image
		*
		* dr = (float) ( Img[i]*Img[i] + Img[i+1]*Img[i+1] );
		*/
		first2 = _mm256_load_ps ( &Img[i] );
		next2 = _mm256_load_ps ( &Img[i+4*2] );
		inr4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		ini4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		dr4 = _mm256_add_ps ( _mm256_mul_ps( inr4 , inr4),
		_mm256_mul_ps( ini4 , ini4) );
		/*
		* Compute denominator spectral density term
		*
		* Complex D = noise / D
		*
		* if (dr != 0.0)
		* dr = (float) (No[i] / dr) ;
		*
		* Do that reciprical division thing!
		*/
		rc = _mm256_rcp_ps(dr4);
		rc = _mm256_sub_ps( _mm256_add_ps( rc, rc),
		_mm256_mul_ps( rc, _mm256_mul_ps( rc, dr4) ) );
		dr4 = _mm256_and_ps ( _mm256_mul_ps ( nr4 , rc ),
		_mm256_cmpneq_ps( dr4, zero ) );
		/*
		* Compute degradation power spectrum
		*
		* Complex Hs = Huv * complex conj Huv
		*
		* hsr = (float) ( Huv[i]*Huv[i] + Huv[i+1]*Huv[i+1] );
		*/
		first2 = _mm256_load_ps ( &Huv[i] );
		next2 = _mm256_load_ps ( &Huv[i+4*2] );
		hr4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		hi4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		hsr4 = _mm256_add_ps ( _mm256_mul_ps (hr4 , hr4 ),
		_mm256_mul_ps (hi4 , hi4 ) );
		/*
		* Compute numerator term
		*
		* Complex Num = complex conj Huv * Guv
		*
		* numr = (float) ( Huv[i]*Guv[i] + Huv[i+1]*Guv[i+1] );
		* numi = (float) ( Huv[i]*Guv[i+1] - Huv[i+1]*Guv[i ] );
		*/
		first2 = _mm256_load_ps ( &Guv[i] );
		next2 = _mm256_load_ps ( &Guv[i+4*2] );
		gr4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		gi4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		numr4 = _mm256_add_ps ( _mm256_mul_ps (hr4 , gr4),
		_mm256_mul_ps (hi4 , gi4) );
		numi4 = _mm256_sub_ps ( _mm256_mul_ps (hr4 , gi4),
		_mm256_mul_ps (hi4 , gr4) );
		/*
		* Final computation
		*
		* Complex Image = Num / (Hs + D)
		*
		* if ( (hsr + dr) != 0.0 )
		AP-807 Wiener Filtering Using Streaming SIMD Extensions
		01/28/99 15
		* {
		* Img[i] = (float) (numr / (hsr + dr));
		* Img[i+1] = (float) (numi / (hsr + dr));
		* }
		* else
		* {
		* Img[i] = (float) 0.0;
		* Img[i+1] = (float) 0.0;
		* }
		*
		* Do the reciprical division thing
		*/
		denom = _mm256_add_ps( hsr4, dr4 );
		rc = _mm256_rcp_ps(denom);
		rc = _mm256_sub_ps( _mm256_add_ps( rc, rc),
		_mm256_mul_ps( rc, _mm256_mul_ps( rc, denom) ) );
		inr4 = _mm256_and_ps( _mm256_mul_ps ( numr4 , rc ) ,
		_mm256_cmpneq_ps( denom, zero ) );
		ini4 = _mm256_and_ps( _mm256_mul_ps ( numi4 , rc ) ,
		_mm256_cmpneq_ps( denom, zero ) );
		_mm256_store_ps( &Img[i ], _mm256_unpacklo_ps ( inr4, ini4 ) );
		_mm256_store_ps( &Img[i+4*2], _mm256_unpackhi_ps ( inr4, ini4 ) );

	}
} /* intrin_wiener_rcp */


    


7.4 包含分组数组的 256 位内在代码 (intrinsics code)

blockHNG Structure

Huv[0]Huv[15]No[0]No[15]Guv[0]Guv[15]Huv[16]Huv[31]No[16]
void intrin_wiener_rcp_avx ( float *Img,
					float *_blockHNG,
					float gamma,
					int rows,
					int cols)
{
	int sz;
	__m256 first2, next2, nor4, noi4, nr4, inr4, ini4, dr4;
	__m256 hr4, hi4, hsr4, gr4, gi4, numr4, numi4;
	__m256 rc, denom;
	__m256 zero = _mm256_setzero_ps();
	sz = 2 * rows * cols;

	assert( (sz > 3) & !(sz & 3) );
	assert( !( ((int)Img) & 15 ) ); // Assume alignment 
	assert( !( ((int)_blockHNG) & 15 ) ); // Assume alignment 

	float *Huv;
	float *No;
	float *Guv;
	
	int j = 0;	// img index
	for (int _blockHNG_tracker = 0; _blockHNG_tracker < 2 * rows * cols * 3; _blockHNG_tracker += 48)
	{
		Huv = &(_blockHNG[_blockHNG_tracker]);
		No = &(_blockHNG[_blockHNG_tracker + 16]);
		Guv = &(_blockHNG[_blockHNG_tracker + 32]);

		/*
		* Compute (in place) the noise spectral density with Wiener gamma
		*
		* complex Noise = gamma * (Noise * complex conj Noise)
		*
		AP-807 Wiener Filtering Using Streaming SIMD Extensions
		01/28/99 13
		* No[i] = (float) ( gamma * ( No[i]*No[i] + No[i+1]*No[i+1] ) );
		* No[i+1] = (float) 0.0;
		*/
		first2 = _mm256_load_ps ( &No[0] );
		next2 = _mm256_load_ps ( &No[8] );
		nor4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		noi4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		nr4 = _mm256_mul_ps ( _mm256_set1_ps( gamma ) ,
		_mm256_add_ps ( _mm256_mul_ps( nor4 , nor4 ),
		_mm256_mul_ps( noi4 , noi4 ) ) );

		_mm256_store_ps( &No[0], _mm256_unpacklo_ps ( nr4, zero ) );
		_mm256_store_ps( &No[8], _mm256_unpackhi_ps ( nr4, zero ) );

		/*
		* Compute image spectral density
		*
		* Complex D = Image * complex conj Image
		*
		* dr = (float) ( Img[i]*Img[i] + Img[i+1]*Img[i+1] );
		*/
		first2 = _mm256_load_ps ( &Img[j] );
		next2 = _mm256_load_ps ( &Img[j+8] );
		inr4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		ini4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		dr4 = _mm256_add_ps ( _mm256_mul_ps( inr4 , inr4),
		_mm256_mul_ps( ini4 , ini4) );
		/*
		* Compute denominator spectral density term
		*
		* Complex D = noise / D
		*
		* if (dr != 0.0)
		* dr = (float) (No[i] / dr) ;
		*
		* Do that reciprical division thing!
		*/
		rc = _mm256_rcp_ps(dr4);
		rc = _mm256_sub_ps( _mm256_add_ps( rc, rc),
		_mm256_mul_ps( rc, _mm256_mul_ps( rc, dr4) ) );
		dr4 = _mm256_and_ps ( _mm256_mul_ps ( nr4 , rc ),
		_mm256_cmpneq_ps( dr4, zero ) );
		/*
		* Compute degradation power spectrum
		*
		* Complex Hs = Huv * complex conj Huv
		*
		* hsr = (float) ( Huv[i]*Huv[i] + Huv[i+1]*Huv[i+1] );
		*/
		first2 = _mm256_load_ps ( &Huv[0] );
		next2 = _mm256_load_ps ( &Huv[8] );
		hr4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		hi4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		hsr4 = _mm256_add_ps ( _mm256_mul_ps (hr4 , hr4 ),
		_mm256_mul_ps (hi4 , hi4 ) );
		/*
		* Compute numerator term
		*
		* Complex Num = complex conj Huv * Guv
		*
		* numr = (float) ( Huv[i]*Guv[i] + Huv[i+1]*Guv[i+1] );
		* numi = (float) ( Huv[i]*Guv[i+1] - Huv[i+1]*Guv[i ] );
		*/
		first2 = _mm256_load_ps ( &Guv[0] );
		next2 = _mm256_load_ps ( &Guv[8] );
		gr4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		gi4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		numr4 = _mm256_add_ps ( _mm256_mul_ps (hr4 , gr4),
		_mm256_mul_ps (hi4 , gi4) );
		numi4 = _mm256_sub_ps ( _mm256_mul_ps (hr4 , gi4),
		_mm256_mul_ps (hi4 , gr4) );
		/*
		* Final computation
		*
		* Complex Image = Num / (Hs + D)
		*
		* if ( (hsr + dr) != 0.0 )
		AP-807 Wiener Filtering Using Streaming SIMD Extensions
		01/28/99 15
		* {
		* Img[i] = (float) (numr / (hsr + dr));
		* Img[i+1] = (float) (numi / (hsr + dr));
		* }
		* else
		* {
		* Img[i] = (float) 0.0;
		* Img[i+1] = (float) 0.0;
		* }
		*
		* Do the reciprical division thing
		*/
		denom = _mm256_add_ps( hsr4, dr4 );
		rc = _mm256_rcp_ps(denom);
		rc = _mm256_sub_ps( _mm256_add_ps( rc, rc),
		_mm256_mul_ps( rc, _mm256_mul_ps( rc, denom) ) );
		inr4 = _mm256_and_ps( _mm256_mul_ps ( numr4 , rc ) ,
		_mm256_cmpneq_ps( denom, zero ) );
		ini4 = _mm256_and_ps( _mm256_mul_ps ( numi4 , rc ) ,
		_mm256_cmpneq_ps( denom, zero ) );

		_mm256_store_ps( &Img[j ], _mm256_unpacklo_ps ( inr4, ini4 ) );
		_mm256_store_ps( &Img[j+8], _mm256_unpackhi_ps ( inr4, ini4 ) );
		j+=16;
	}
} /* Intrin_wiener_rcp_avx */



致谢
作者在编著本白皮书时,曾受到 Phil Kerly、Raghu Muthyalampalli 和 Justin Landon 的大力支持。他们帮助作者评估代码性能、提供性能建议以及审阅此白皮书。作者在此表示衷心感谢。

参考资料
本应用注释参考了以下文档,旨在为理解本文提到的主题提供背景或支持信息。

1. 《C 语言中图像处理算法手册》(The Pocket Handbook of Image Processing Algorithms in C), 作者:Harley R Myler 和 Aruthur R.
Weeks. ISBN 0-13-642240-3.
2. 《利用 Newton-Raphson 方法提高倒数和倒数方根指令结果的精确度》(Increasing the Accuracy of the Results from the Reciprocal and Reciprocal Square RootInstructions using the Newton-Raphson Method), 英特尔应用注释(AP-803,编号: 243637-001)。
3. 《Split-Radix FFT》, 英特尔应用注释(AP-808,编号: 243642-001)。
4. 《利用 SIMD 流指令扩展进行 Wiener 过滤》(Wiener Filtering Using Streaming SIMD Extensions), 英特尔应用注释(AP-807)。

利用英特尔® SIMD 流指令扩展和英特尔® 高级矢量扩展指令集的图像处理加速技术

$
0
0

内容简介


现代英特尔处理器通过使用 SIMD(单指令多数据)指令集来实现加速。SIMD 指令集包括广泛的可用英特尔® SIMD 流指令扩展(英特尔® SSE)指令集和全新的英特尔® 高级矢量扩展(英特尔® AVX)指令集。图像处理数据结果和算法通常是利用这些指令集进行优化的理想之选。与英特尔® C++ 编译器自动矢量化循环的功能结合使用时,它为提高图像处理应用的性能提供了一种高效的方法。
本文中,我们将详细介绍一些知名的转换技术,并会提供一些用于说明如何充分利用英特尔® SSE 和英特尔® AVX 转换图像数据的代码示例,同时还将介绍图像处理算法的编译器自动矢量化信息。本文详细介绍了如何优化实施(利用各种数据类型和大小)数据转换和算法以及如何分析比较性能,并为测量英特尔® SSE 优化代码和估算英特尔® AVX 优化代码提供了更快的方法。

英特尔® AVX 是一套针对英特尔® SSE 的 256 位扩展指令集,专为浮点密集型应用而设计。英特尔® AVX 将所有 16 XMM 寄存器扩展到 256 位 YMM 寄存器,使寄存器的带宽加倍,从而能够比 128 位 SIMD 指令提供更高的性能和能效。利用英特尔® AVX 还能够减少寄存器拷贝数,更加高效地使用寄存器和减少代码量。

如以下的性能加速总结所示,利用建议的技术我们能够实现更好的性能加速。


过滤器

英特尔® SSE 加速

英特尔® AVX 加速

Sepia (int base)

2.6倍

3.1倍

Sepia (float base)

1.9倍

2.2倍

Crossfade (int base)

2.7倍

3.6倍

Crossfade (float base)

1.9倍

2.4倍

以上结果系测量建议数据块大小约为 50000 像素的英特尔® 酷睿TM i7 处理器得出。注意英特尔® AVX 性能系采用模拟器估算得出,未考虑未来的架构改进。


概述


本文中提供的代码示例假设使用的是英特尔® C++ 编译器,并且需要对 SIMD、英特尔® SSE 指令内部函数 (intrinsics) 以及如何执行自动矢量化有基本的了解。编译器特性、选项和程序适合于使用英特尔® C++ 编译器 11.1.35 或支持诸如英特尔® AVX 等全新指令集的更高版本。
代码示例出自 C++ 编译器,基于 Microsoft Windows*(Vista 和 XP)进行构建和分析。

范围和假设:

  1. 图像用未压缩的 RGBA 像素值表示,其中每个色彩通道用一个整数(8 位)或一个浮点数(32 位)来表示
  2. 为了简化转换,用存储在 8 位整数或 32 位浮点数中的 0 至255 之间的数字来表示色彩值。
  3. 除非数据处理需要使用英特尔® AVX(使用 32 字节对齐),否则将数据分配给 16 个字节。

有关性能/加速的说明:

  • 由于采用英特尔® AVX 的英特尔处理器尚未上市,因此本文中利用英特尔® AVX 的函数的性能为估计值。架构模拟器(英特尔® 软件开发模拟器)和模拟器(英特尔® 架构代码分析器)用于验证行为和估算英特尔® AVX 性能
  • 实际性能取决于处理器架构、高速缓存配置和大小以及频率等。
  • 本文仅提供了有限的图像处理过滤器(算法)。性能加速是否适用于其它过滤器取决于过滤器复杂性和像素之间的相关性。不保证其它过滤器利用所述的技术也能提高性能。

下载 PDF 文档

要阅读本文的其余部分,请点击此处下载 PDF 文档(pdf 大小:1MB)

英特尔® 编译器英特尔® SSE 与英特尔® AVX 生成(SSE2、SSE3、SSE3_ATOM、SSSE3、SSE4.1、SSE4.2、AVX、AVX2)选项和特定处理器优化选项

$
0
0

在 11.1、12.0 和 12.1 编译器中哪些是 IA-32 和英特尔® 64 处理器的目标选项?
特定处理器优化选项主要有三类:

  1. 格式为 /arch:<code>(Windows*)(Linux* 或 Mac OS* X 上为 -m<code>)的特定处理器选项生成由 <code> 指定的专门的处理器代码。这些特定处理器选项生成的可执行文件,可以在支持此指令集的指定或较新的英特尔® 处理器和兼容的非英特尔® 处理器上运行。可执行文件可能具有这些处理器特有的优化特性,使用特定版本的英特尔® SIMD 流指令扩展 (SSE) 指令集和/或英特尔® 高级矢量扩展 (AVX) 指令集;在不支持相应指令集的较早处理器上,可能出现非法指令或类似错误。
  2. <code> 值可能的位置:

    AVX可能生成英特尔® AVX、英特尔® SSE4.2、SSE4.1、SSSE3、SSE3、SSE2 和 SSE 指令。
    SSE4.2可能生成英特尔® SSE4.2、SSE4.1、SSSE3、SSE3、SSE2 和 SSE 指令。
    SSE4.1可能生成英特尔® SSE4.1、SSSE3、SSE3、SSE2 和 SSE 指令。
    SSSE3可能生成英特尔® SSSE3、SSE3、SSE2 和 SSE 指令。
    SSE3可能生成英特尔® SSSE3、SSE2 和 SSE 指令。
    SSE2可能生成英特尔® SSE2 和 SSE 指令。/arch:SSE2 是 Windows* 的默认值,-msse2 是 Linux* 的默认值。
    IA32生成通用的 IA-32 兼容代码。只能与 /arch: 或 -m 交换结合使用。(仅 IA-32 编译器)。

  3. 格式为 /Qx<code>(Windows*)(Linux* 或 Mac OS* X 上为 -x<code>)的特定处理器选项生成由 <code> 指定的专门的处理器代码。由于特定处理器选项生成的可执行文件可能具有这些处理器特有的优化特性,且使用特定版本的 SIMD 流指令扩展 (SSE) 指令集和/或英特尔® 高级矢量扩展 (AVX) 指令集,因而只能在指定或较新的英特尔® 处理器上运行。此交换使某些优化无法在相应的交换 /arch:x<code> 或 -m<code> 中实现。运行时检查将被插入到生成的可执行文件中,如果在不兼容的处理器上运行,此文件将停止该应用。这将帮助您快速发现不用于所运行的处理器的程序,有可能避免出现非法指令错误。为使这一检查行为有效,包含主程序或动态库主函数的的源文件应该使用此选项进行编译。

  4. <code> 值可能的位置:

    CORE-AVX2可能为英特尔® 处理器生成英特尔® AVX2、英特尔® AVX、SSE4.2、SSE4.1、SSSE3、SSE3、SSE2 和 SSE 指令。面向未来的英特尔处理器进行优化。
    CORE-AVX-I
    core-avx-i
    可能为英特尔® 处理器生成英特尔® AVX、SSE4.2、SSE4.1、SSSE3、SSE3、SSE2 和 SSE 指令,包括针对第三代英特尔® 酷睿™ 处理器的指令。针对第三代英特尔® 酷睿™ 处理器进行优化。
    AVX可能为英特尔® 处理器生成英特尔® AVX、SSE4.2、SSE4.1、SSSE3、SSE3、SSE2 和 SSE 指令。针对第二代英特尔® 酷睿™ i7、i5 和 i3 处理器家族,以及英特尔® 至强® 处理器 E5 和 E3 家族进行优化。
    SSE4.2可能为英特尔® 处理器生成英特尔® SSE4.2、SSE4.1、SSSE3、SSE3、SSE2 和 SSE 指令。针对第二代英特尔® 酷睿™ i7、i5 和 i3 处理器家族,英特尔® 至强® 55XX、56XX 和 75XX 系列,以及英特尔® 至强® 处理器 E7 家族进行优化。
    SSE4.1可能为英特尔® 处理器生成英特尔® SSE4.1、SSSE3、SSE3、SSE2 和 SSE 指令。针对 45 纳米高-k 新一代英特尔® 酷睿™ 微架构进行优化。
    SSSE3可能为英特尔® 处理器生成英特尔® SSSE3、SSE3、SSE2 和 SSE 指令。针对英特尔® 酷睿™ 微架构进行优化。-xssse3 是 Mac OS* X 上英特尔® 64 编译器的默认值。
    SSE3_ATOM 可能为英特尔® 处理器生成英特尔® SSSE3、SSE3、SSE2 和 SSE 指令。针对英特尔® 凌动™ 处理器家族和英特尔® 迅驰® 凌动™ 处理器技术进行优化。
    SSE3可能生成英特尔® SSSE3、SSE2 和 SSE 指令。针对增强的奔腾 M 处理器微架构和英特尔® Netburst 微架构进行优化。-xssse3 是 Mac OS* X 上 IA-32 编译器的默认值。
    SSE2 可能生成英特尔® SSE2 和 SSE 指令。针对英特尔® Netburst 微架构而优化。

  5. 格式为 /Qax<code>(Windows*)(Linux* 或 Mac OS* X 上为 -ax<code>)的处理器调度选项允许生成多个英特尔® 处理器代码路径。处理器调度技术执行执行时间检查,以确定应用运行的处理器,并针对该处理器使用最适合的代码路径。兼容的非英特尔处理器将使用经过优化的默认代码路径。上述第 1. 和 2. 条介绍的交换可用于修改经过优化的默认代码路径。

  6. <code> 值可能的位置:

    CORE-AVX2 可能为英特尔® 处理器生成英特尔® AVX2、英特尔® AVX、SSE4.2、SSE4.1、SSSE3、SSE3、SSE2 和 SSE 指令。
    CORE-AVX-I
    core-avx-i
    可能为英特尔® 处理器生成英特尔® AVX、SSE4.2、SSE4.1、SSSE3、SSE3、SSE2 和 SSE 指令,包括针对第三代英特尔® 酷睿™ 处理器的指令。
    AVX可能为英特尔® 处理器生成英特尔® AVX、SSE4.2、SSE4.1、SSSE3、SSE3、SSE2 和 SSE 指令。
    SSE4.2可能为英特尔® 处理器生成英特尔® SSE4.2、SSE4.1、SSSE3、SSE3、SSE2 和 SSE 指令。
    SSE4.1可能为英特尔® 处理器生成英特尔® SSE4.1、SSSE3、SSE3、SSE2 和 SSE 指令。
    SSSE3可能为英特尔® 处理器生成英特尔® SSSE3、SSE3、SSE2 和 SSE 指令。
    SSE3可能为英特尔® 处理器生成英特尔® SSE3、SSE2 和 SSE 指令。
    SSE2 可能为英特尔® 处理器生成英特尔® SSE2 和 SSE 指令。

哪个特定处理器选项最适合我的处理器?

CORE-AVX2未来的英特尔处理器
CORE-AVX-I 第三代英特尔® 酷睿™ i7 处理器
第三代英特尔® 酷睿™ i5 处理器
AVX第二代英特尔® 酷睿™ i7 处理器
第二代英特尔® 酷睿™ i5 处理器
第二代英特尔® 酷睿™ i3 处理器
英特尔® 至强® 处理器 E5 家族
英特尔® 至强® 处理器 E3 家族
SSE4.2上一代英特尔® 酷睿™ i7 处理器
上一代英特尔® 酷睿™ i5 处理器
上一代英特尔® 酷睿™ i3 处理器
英特尔® 至强® 55XX 系列
英特尔® 至强® 56XX 系列
英特尔® 至强® 75XX 系列
英特尔® 至强® 处理器 E7 家族
SSE4.1英特尔® 至强® 74XX 系列
四核英特尔® 至强 54XX、33XX 系列
双核英特尔® 至强 52XX、31XX 系列
英特尔® 酷睿™ 2 至尊 9XXX 系列
英特尔® 酷睿™ 2 四核 9XXX 系列
英特尔® 酷睿™ 2 双核 8XXX 系列
英特尔® 酷睿™ 2 双核 E7200
SSSE3四核英特尔® 至强® 73XX、53XX、32XX 系列
双核英特尔® 至强® 72XX、53XX、51XX、30XX 系列
英特尔® 酷睿™ 2 至尊 7XXX、6XXX 系列
英特尔® 酷睿™ 2 四核 6XXX 系列
英特尔® 酷睿™ 2 双核 7XXX(不包括 E7200)、6XXX、5XXX、4XXX 系列
英特尔® 酷睿™ 2 单核 2XXX 系列
英特尔® 奔腾® 双核处理器 E2XXX、T23XX 系列
SSE3_ATOM 英特尔® 凌动™ 处理器
SSE3双核英特尔® 至强® 70XX、71XX、50XX 系列
双核英特尔® 至强® 处理器(ULV 和 LV)1.66、2.0、2.16
双核英特尔® 至强® 2.8
支持 SSE3 指令集的英特尔® 至强® 处理器
英特尔® 酷睿™ 双核
英特尔® 酷睿™ 单核
英特尔® 奔腾® 双核处理器 T21XX、T20XX 系列
英特尔® 奔腾® 处理器至尊版
英特尔® 奔腾® D
支持 SSE3 指令集的英特尔® 奔腾® 4 处理器
SSE2(default)英特尔® 至强® 处理器
英特尔® 奔腾® 4 处理器
英特尔® 奔腾® M
IA32英特尔® 奔腾® III 处理器
英特尔® 奔腾® II 处理器
英特尔® 奔腾® 处理器


默认定位的是哪路处理器?

  • 在运行 Windows* 的 IA-32 系统上,默认开启 /arch:SSE2。在运行 Linux* 的 IA-32 系统上,默认开启 -msse2。生成的代码路径应在以下服务器上运行:支持 SSE2 的英特尔奔腾 4 和英特尔至强处理器,以及其他支持 SSE2 的较新版本英特尔处理器或兼容的非英特尔处理器。
  • 在运行 Mac OS* X 的 IA-32 系统上,默认开启 -xSSE3。编译器可能生成 SSE3、SSE2 和 SSE 指令,代码针对增强的奔腾 M 处理器微架构进行优化。
  • 在运行 Mac OS* X 的英特尔 64 系统上,默认开启 -xSSE3。编译器可能生成 SSSE3、SSE3、SSE2 和 SSE 指令,代码针对英特尔® 酷睿™ 微架构进行优化。

要定位不支持 SSE2 指令的较早 IA-32 系统(如基于英特尔® 奔腾® III 处理器系统),请使用交换 /arch:ia32 (Windows*) 或 -mia32 (Linux*)。

有关其他较早版本处理器定位选项的信息,以及这些选项与上述推荐的选项的关系,请参阅
/en-us/articles/ia-32-and-intel64-processor-targeting-overview


其他常见问题
(后续文章)

 

优化声明

英特尔编译器针对非英特尔微处理器的优化程度可能与英特尔微处理器相同(或不同)。这些优化包括 SSE2、SSE3 和 SSSE3 指令集以及其它优化。对于在非英特尔制造的微处理器上进行的优化,英特尔不对相应的可用性、功能或有效性提供担保。该产品中依赖于处理器的优化仅适用于英特尔微处理器。部分非针对英特尔微体系架构的优化也为英特尔微处理器保留了下来。如欲了解更多有关本声明所涉及的特定指令集的信息,请参阅适用产品的“用户和参考指南”。

声明版本 #20110804

英特尔® MKL 的新特性

$
0
0

英特尔®MKL 10.3 的新特性

英特尔 AVX 是英特尔处理器演进的下一步。英特尔 AVX 优化已扩展至更多 MKL 函数,以便在英特尔的未来架构中获得更出色的性能。

  • 概要统计库:一个使用最新统计进步的优化并行库,通过提供先进算法以增强统计计算的准确度和性能。

扩展的 MKL C 语言支持:LAPACK 的 C 界面PARDISO 中 C 格式基于 0 的索引阵列

面向 Windows 的动态界面库:已添加新的动态界面库,以改进 Windows 中的 C# 或 Java 链接。

VML 中的例程级别模式控制:用户现在可以借助各函数中的新变量,在 VLM 中为各函数单独控制或设置精确度。

英特尔® MKL 10.3 目录变更


英特尔® MKL 10.2 新特性

 

 

优化声明

英特尔编译器针对非英特尔微处理器的优化程度可能与英特尔微处理器相同(或不同)。这些优化包括 SSE2®、SSE3 和 SSSE3 指令集以及其它优化。对于在非英特尔制造的微处理器上进行的优化,英特尔不对相应的可用性、功能或有效性提供担保。该产品中依赖于处理器的优化仅适用于英特尔微处理器。部分非针对英特尔微体系架构的优化也为英特尔微处理器保留了下来。如欲了解更多有关本声明所涉及的特定指令集的信息,请参阅适用产品的《用户和参考指南》。

声明版本 #20110804


Webinar: Get Ready for Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors

$
0
0

Intel recently unveiled the new Intel® Xeon Phi™ product – a coprocessor based on the Intel® Many Integrated Core architecture. Intel® Math Kernel Library (Intel® MKL) 11.0 introduces high-performance and comprehensive math functionality support for the Intel® Xeon Phi™ coprocessor. You can download the audio recording of the webinar and the presentation slides from the links below.

  • Webinar video recording (Link)
  • Webinar presentation slides (Link)

More information can be found from our "Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors" central page. If you have questions, please ask them either on the public Intel MKL forum or in a priviate secure Intel® Premier Support.

Also, please visit this page for a replay of highly popular webinar series that introduces you to other Intel software tools for the Intel® Xeon® Phi™ coprocessors.

Questions and Answers from the webinar

  • Is there anyone using Intel Xeon Phi product? What Kind of applications they run on it?
    Many users have successfully benefited from it. For example, seven supercomputers on the most recent Top 500 list already use Intel Xeon Phi coprocessors in combination with Intel Xeon processors. A lot of HPC applications, for example, those in the areas of new drug discovery, weather prediction, global financial analysis, oil exploration, Hollywood movie special effects, can make good use of all the power provided by Intel Xeon Phi.

  • Is Intel® Cluster Studio XE 2013 or Intel® Parallel Studio XE 2013 required in order to use Intel Xeon Phi coprocessors?
    Intel Cluster Studio XE 2013 and Intel Parallel Studio XE 2013 are bundle products that contain necessary tools for programming the coprocessor. For example, Intel compilers (FORTRAN or C/C++) are required to build code for native execution on the coprocessor. The pragmas and directives used to offload computations to the coprocessor are only supported by Intel compilers. Intel MKL provides highly optimized math functions for the coprocessor. Intel MPI (a component of Intel Cluster Studio XE) enables building code scalable to multiple coprocessors and hosts. These bundle products also provide tools for threading assistant, performance and thread profiling, memory and threading error detection, etc.


  • What if a system has multiple coprocessors? Does Intel MKL try to balance the tasks across them?
    In the case of automatic offload, MKL will try to make use of multiple coprocessors for a computation. Users can also pick which coprocessors to use. In the case of compiler-assisted offload, it is up to the user to specify which coprocessors to use and to orchestrate the work division among them.

  • Do the performance charts published online include cost of data transfer between host and coprocessors?
    The performance charts compare native execution performance on the coprocessor with host execution performance on the host processor. Hence, data transfer cost is not reflected.

  • Do the performance charts published online compare the dual-socket E5-2680 CPU performance against single coprocessor performance?
    Yes. The host CPU used to obtain the performance charts is an Intel Xeon E5-2680 CPU with 2 sockets and 8 cores per socket. The coprocessor is an Intel Xeon Phi SE10, with 61 cores. Each of the online performance charts has detailed configuration listed at the bottom.

  • What happens if multiple user processes or threads call Intel MKL functions with automatic offload?
    Currently, a process/thread doing automatic offload is not aware of other processes/threads that may also be offloading at the same time. In this scenario, all processes/threads will offload to a coprocessor. This leads to the risks of thread oversubscription and running out of memory on the coprocessor. It is possible, however, with careful memory management and thread affinity settings, to have multiple offloading processes/threads use different group of cores on the coprocessor at the same time.

  • Will more routines get automatic offload support in future?
    Automatic offload works well when there is enough computation in a function to offset the data transfer overhead. Currently, only GEMM, TRSM, TRMM and LU, QR, Cholesky are supported with this model. There might be other functions in Intel MKL that can be good candidates for automatic offload. We are investigating all opportunities. Please contact us via our support channels if you see more needs for automatic offload.

  • Can you show us in detail the configurations of running the LINPACK benchmark natively on the coprocessor?
    Intel optimized SMP LINPACK benchmark is included in Intel MKL 11.0 installation packages. Please find it in $MKLROOT/benchmarks/linpack. See the execution scripts in this location for the default configuration.

  • Is the memory allocated for arguments of an Intel MKL routine resides on the coprocessor or on the host?
    Unless input data already exists on the coprocessor or output data is not needed on the host, MKL routine input arguments are allocated on the host and then copied to the coprocessor. Enough space needs to be allocated on the coprocessor to receive the data. Output arguments are copied back to the host. The offload pragmas offers a rich set of controls for data transfer and memory management on the coprocessor. In the case of MKL automatic offload, however, the MKL runtime system handles all these transparently.

  • If memory population between host and coprocessor is transparent, now you have two copies of data. What about data synching?
    In the case of Intel MKL automatic offload, data synching is taken care of transparently by the Intel MKL runtime. If a function call is offloaded using pragmas, then the user needs to rely on the facilities provided by the pragmas to synch/merge data. Intel Xeon Phi coprocessor also supports a shared memory programming model called MYO (“mine”, “yours”, “ours”). Data synching between host processors and coprocessors is taken care of implicitly in this model.
    Refer to this article for more information.
  • If I have two automatic offload function calls, and a non-automatic offload function call in between them, suppose these functions reuse data, will the data persist on the coprocessor to be reused?
    Data persistence on coprocessor is currently not supported for function calls using Intel MKL automatic offload. The input data is always copied from host to coprocessor in the beginning of an automatic offload execution and output data is always copied back at the end.

  • Can PARDISO and other sparse solvers make use of the coprocessor? How does the performance compare with, say, running on an 8-core Xeon processor?
    Yes. Intel MKL sparse solvers, including PARDISO, can make use of the coprocessor. However, our optimization effort has so far been focused on dense matrices (BLAS and LAPACK). Sparse solvers at present are not optimized to the same extent. Performance of sparse solvers, on processor or on the coprocessor, largely depends on the properties of sparse systems. It’s hard to have a performance comparison without putting a particular sparse system in the context.

  • Is Intel® Integrated Performance Primitives (Intel® IPP) supported on Intel Xeon Phi product?
    Support for Intel IPP is still to be determined. If you have a request for supporting Intel IPP on the Intel Xeon Phi coprocessor, please follow the regular Intel IPP support channel to submit a formal request.

  • There are a lot of pragmas to set. Are there any preprocessors to scan one's FORTRAN code for LAPACK calls and automatically insert all the appropriate pragmas?
    There is no such a tool to automatically scan your code and insert pragmas. But if you use MKL automatic offload (when applicable), then you can take the advantage of computation offloading without using pragmas.

  • The offload pragmas from Intel compilers are very different than OpenACC. Can users do either one for the Intel Xeon Phi coprocessor?
    Intel compilers do not have plans to support OpenACC.

  • What is the difference, if any, between using the Intel specific compiler directives to offload to the coprocessor and using the newly proposed OpenMP coprocessor/accelerator directives? Am I correct that these new OpenMP directives will be added to the Intel compilers next year?
    Intel compiler offload directives offer a much richer set of features than OpenMP offload directives. Intel Compiler 13.0 update 2 (both FORTRAN and C/C++) will add support for OpenMP offload directives.

  • Does GCC support Intel Xeon Phi?
    Please see this page for information on third-party tools available with support for Intel Xeon Phi coprocessor.
    Our changes to the GCC tool chain, available as of June 2012, allow it to build the coprocessor’s Linux environment, including our drivers, for the Intel Xeon Phi Coprocessor. The changes do not include support for vector instructions and related optimization improvements. GCC for Intel Xeon Phi is really only for building the kernel and related tools; it is not for building applications. Using GCC to build an application for Intel Xeon Phi Coprocessor will most often result in low performance code due its current inability to vectorize for the new Knights Corner vector instructions. Future changes to give full usage of Knights Corner vector instructions would require work on the GCC vectorizer to utilize those instructions’ masking capabilities

  • Is debugging supported on the coprocessor?
    Yes. Debugging is supported. At this time, Intel debugger, GDB, TotalView, and Allinea DDT are debuggers available with support Intel Xeon Phi coprocessor. See this page for more information.

  • Is the 8GB memory shared by all cores on the coprocessor? Are there memory hierarchies on the Intel Xeon Phi coprocessor?
    Yes. All cores on a coprocessor share 8GB memory. The memory hierarchy includes the shared 8GB memory, and for each core, a 32KB L1 instruction cache and a 32 KB L1 data cache and a 512KB unified L2 cache. The caches are fully coherent and implement the x86 memory order model. See here for a description of the Intel Many Integrated Core architecture.

  • How lightweight are threads on the coprocessor? Is context switching expensive?
    Context switch is more expensive on Intel Xeon Phi coprocessors than on Intel Xeon processors. This is because the coprocessor has more vector registers, and a coprocessor core is typically slower than a processor core.

  • What MPI implementations are supported?
    At present, Intel MPI and MPICH2 are the two implementations that support Intel Xeon Phi coprocessors.

  • Can I put an MPI rank on the host processor and another MPI rank on the coprocessor to have a 2-node MPI environment?
    Yes. This usage model has been supported since Intel MPI 4.1. Please refer to the Intel MPI product page for more information on Intel MPI support for Intel Xeon Phi coprocessors.

  • Can you explain the motherboard requirements for Intel Xeon Phi coprocessors, e.g. power, BIOS, PCI bandwidth?
    Please contact your OEMs for information on system configurations for Intel Xeon Phi coprocessors. Find a list of OEMs that support the coprocessor on this page.
  • What is the estimated price of Intel Xeon Phi coprocessor?
    Please contact your OEMs or your Intel field representatives to get estimated pricing of Intel Xeon Phi coprocessor.

  • Where to buy Intel software tools that will support Intel Xeon Phi coprocessor?
    Please contact your local Intel® Software Development Products Resellers for more details.

VecAnalysis Python* Script for Annotating Intel C++ & Fortran Compilers Vectorization Reports

$
0
0

 

This is the Python* script used to annotate Intel® C++ and Fortran compiler 13.1 (Intel® C++/Fortran/Visual Fortran Composer XE 2013 Update 2 and later) vectorization reports produced at -vec-report7.  The attached zip file contains:

  • vecanalysis.py 
  • vecmessages.py
  • README-vecanalysis.txt

NOTE: You will need Python* version 2.6.5 or higher. For more information, and download instructions please click here.

The new -vec-report7 (for Linux*) (/Qvec-report7 for Windows*) compiler option available in Intel® C++ and Fortran compilers version 13.1 allows the compiler to emit vector code quality messages and the corresponding message ID, and data values for vectorized loops.  The messages provide information such as the expected speedup, memory access patterns, and the number of vector idioms for vectorized loops.  Below is a sample of the type of messges the compiler will emit at -vec-report7:

  • loop was vectorized (with peel / with remainder)
  • unmasked aligned unit stride loads: 4
  • unmasked aligned unit stride stores: 2
  • saturating add/subtract: 3
  • estimated potential speedup: 6.270000

The attached Python* script takes the message IDs produced by the compiler as input and produces a .txt file that includes the original source code annotated with -vec-report7 messages.  The information gives more insight into the generated vector code quality without the need to analyze the assembly code. The naming convention for the output file is (filename_extension_vr.txt).  For example the output file corresponding to satSub.c would be satSub_c_vr.txt. The compiler does not invoke the Python script automatically.  The user needs to apply the Python script manually to the output file produced by the compiler as shown below.  The below command assumes the vecanalysis Python script files are located in the "vecanalysis" directory:

Example: icc -c -vec-report7 satSub.c 2>&1 | ./vecanalysis/vecanalysis.py --list

For more information please see the README.vecanalysis.txt file provided.

$ python
Python 2.6.5 (r265:79063, Jul  5 2010, 11:46:13)
[GCC 4.5.0 20100604 [gcc-4_5-branch revision 160292]] on linux2
Type "help", "copyright", "credits" or "license" for more information.

$ icc -c -vec-report7 satSub.c 2>&1 | ./vecanalysis/vecanalysis.py --list
satSub.c(9): (col. 3) remark: SIMD LOOP WAS VECTORIZED.
satSub.c(9): (col. 3) remark: VEC#00001WPWR 1.
satSub.c(9): (col. 3) remark: VEC#00052 1.
satSub.c(9): (col. 3) remark: VEC#00101UASL 4.
satSub.c(9): (col. 3) remark: VEC#00101UASS 2.
satSub.c(9): (col. 3) remark: VEC#00101UUSL 2.
satSub.c(9): (col. 3) remark: VEC#00101UUSS 1.
satSub.c(9): (col. 3) remark: VEC#00201 5.
satSub.c(9): (col. 3) remark: VEC#00202 0.310000.
satSub.c(9): (col. 3) remark: VEC#00203 6.270000.
satSub.c(9): (col. 3) remark: VEC#00204 15.
satSub.c(9): (col. 3) remark: VEC#00405 3.
Writing satSub_c_vr.txt ... done
Statistics for all files

// Below is the vectorization summary for satSub.c
                                                                   Source Locations
Message                                                                 Count     %

// This line says there were 3 saturating add/subtract operations.  
// 100% means 
the message refers to a single location/loop in the program.
//  (Count = 1) means there is one instance of this message for the loops in the program.
saturating add/subtract: 3.                                            1 100.0%    
unmasked unaligned unit stride loads: 2.                     1 100.0%    
loop was vectorized (with peel/with remainder)            1 100.0%     
unmasked aligned unit stride stores: 2.                        1 100.0%     

// 100% of all loops (in this case a single loop) in the program were vectorized
// If there were 10 loops out of which 6 got vectorized, the % would be 60%

SIMD LOOP WAS VECTORIZED.                               1 100.0%    
unmasked aligned unit stride loads: 4.                         1 100.0%
scalar loop cost: 5.                                                       1 100.0%
lightweight vector operations: 15.                                 1 100.0%
vector loop cost: 0.310000.                                           1 100.0%
loop inside vectorized loop at nesting level: 1.              1 100.0%
unmasked unaligned unit stride stores: 1.                     1 100.0%
estimated potential speedup: 6.270000.                        1 100.0%
Total Source Locations:                                                 1

$ more satSub_c_vr.txt
VECRPT satSub.c
VECRPT                                                                    Source Locations
VECRPT Message                                                                 Count     %
VECRPT saturating add/subtract: 3.                                             1 100.0%
VECRPT unmasked unaligned unit stride loads: 2.                      1 100.0%
VECRPT loop was vectorized (with peel/with remainder)             1 100.0%
VECRPT unmasked aligned unit stride stores: 2.                         1 100.0%
VECRPT scalar loop cost: 5.                                                         1 100.0%
VECRPT unmasked aligned unit stride loads: 4.                           1 100.0%
VECRPT SIMD LOOP WAS VECTORIZED.                                 1 100.0%
VECRPT lightweight vector operations: 15.                                   1 100.0%
VECRPT vector loop cost: 0.310000.                                            1 100.0%
VECRPT loop inside vectorized loop at nesting level: 1.               1 100.0%
VECRPT unmasked unaligned unit stride stores: 1.                      1 100.0%
VECRPT estimated potential speedup: 6.270000.                         1 100.0%
VECRPT Total Source Locations:                                                               1

   1: #define SAT_U8(x) ((x) < 0 ? 0 : (x))
   2: void satsub(
   3:   unsigned char *a,
   4:   unsigned char *b,
   5:   int n
   6: ){
   7:   int i;
   8: #pragma simd
VECRPT (col. 3) SIMD LOOP WAS VECTORIZED.
VECRPT (col. 3) estimated potential speedup: 6.270000.
VECRPT (col. 3) lightweight vector operations: 15.
VECRPT (col. 3) loop inside vectorized loop at nesting level: 1.
VECRPT (col. 3) loop was vectorized (with peel/with remainder)
VECRPT (col. 3) saturating add/subtract: 3.
VECRPT (col. 3) scalar loop cost: 5.
VECRPT (col. 3) unmasked aligned unit stride loads: 4.
VECRPT (col. 3) unmasked aligned unit stride stores: 2.
VECRPT (col. 3) unmasked unaligned unit stride loads: 2.
VECRPT (col. 3) unmasked unaligned unit stride stores: 1.
VECRPT (col. 3) vector loop cost: 0.310000.
   9:   for (i=0; i<n; i++){
  10:     a[i] = SAT_U8(a[i] - b[i]);
  11:   }
  12: }
$

Samples for Intel® C++ Composer XE

$
0
0

Intel® C++ compiler is an industry-leading C/C++ compiler, including optimization features like auto-vectorization and auto-parallelization, OpenMP*, and Intel® Cilk™ Plus multithreading capabilities; plus the highly optimized performance libraries.

We have created a list of articles with samples explaining the features in detail and how or when to use in the source code. There are:

By installing or copying all or any part of the sample source code, you agree to the terms of the Intel(R) Sample Source Code License Agreement.

Auto-vectorization articles and samples
Article NameDescriptionDownload
A Guide to Auto-vectorization with Intel® C++ CompilersThis article provides guidelines for enabling Intel C++ compiler auto-vectorization using the sample source code; it targets the Intel® processors or compatible non-Intel processors that support SIMD instructions such as Intel® Streaming SIMD Extensions (Intel® SSE).Source Code in C/C++

Intel® Cilk™ Plus

Please visit Intel® C++ Compiler Code Samples page.

Building Open Source Applications using Intel C++ Compiler
Article NameDescriptionPlatforms
How to Building POV-Ray* with Intel C++ Compiler on WindowsThe article provided detail instructions on building Povray* using the Intel® C++ Compiler for Windows.
Version information
  • Povray* beta version 3.7
  • Intel(R) C++ for Windows: 11.0
Windows
Building Boost C++ Libraries with Intel® C++ Compiler on Windows XPBoost is a set of libraries for the C++ language, visit www.boost.org for more information. The article provides detail instructions on how to build Boost* library with Intel C++ Compiler on Windows.
Version information
  • Boost: v1.39.0
  • Intel C++ Compiler for Windows: 11.1
Windows
Building Open MPI* with the Intel compilersThe article is to help Intel® compiler customers build and use the Open MPI* library with Intel C++ and Fortran Compilers for Linux and OS X.
Version information
  • Open MPI: 1.2
  • Intel(R) C++ and Fortran Compilers for Linux* or Mac OS* X: 10.0
Linux*, OS X*
Building UPC* to utilize the Intel C++ CompilerThe Berkeley* Unified Parallel C* (UPC) is a programming language that adds parallelization extensions to the C language. The article explains how to build UPC* compiler with Intel C++ Compiler and configure it for use with symmetric multiprocessing (SMP) machines.
Version information
  • UPC: version 2.4.0
  • Intel(R) C++ Compiler for Linux*: 10.0
Linux
Building Quantlib with Intel C++ CompilerQuantlib is a free/open-source library for modeling, trading, and risk management in real-life writting in C++. The article explains how to configure and build the Quantlib* library (http://quantlib.org/) and an example provided with Quantlib.
Version information
  • Quantlib: Quantlib-0.3.13.tar.gz
  • Boost*: boost_1_33_1
  • Intel(R) C++ Compiler for Linux*: 10.0
Linux
Building Xerces with Intel C++ CompilerThe article describes how to build the Xerces-C++ with the Intel® C++ Compiler for Linux*
Version information
  • Xerces: 2.7.0
  • Intel(R) C++ Compiler for Linux*: 10.0
Linux
Building FFTW* With the Intel CompilersThe FFTW library is used for high performance computation of the Discrete Fourier Transform (DFT). The article describles how to build the FFTW* library on Linux* using Intel C++ Compiler for Linux.
Version information
  • FFTW* library v3.1.2
  • Intel(R) C++ Compiler for Linux*: 10.0
Linux
Building PGPLOT* with the Intel compilersPGPLOT is a library for creating two-dimensional plots and graphs. The article provides instructions on how to build the PGPLOT* graphics library using Intel C++ and Fortran Compilers for Linux.
Version information
  • PGPLOT* graphics library v5.2.2
  • Intel(R) C++ and Fortran Compilers for Linux*: 10.x
Linux
Building WRF v2.x with the Intel compilersThe Weather Research and Forecasting (WRF) Model (http://wrf-model.org/index.php) is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. The article is created to help users of WRF make use of the Intel C++ and Fortran compiler.
Version information
  • WRF: version 2.2 and 2.2.1
  • Intel(R) C++ and Fortran Compilers for Linux* or Mac OS* X: 10.x, 11.x
Linux
Building WRF v3.1.1 with the Intel compilersThe article is created to help users of WRF v3.1.1 make use of the Intel C++ and Fortran compiler.
Version information
  • WRF: version 3.1.1
  • Intel(R) C++ and Fortran Compilers for Linux* or Mac OS* X: 11.1
Linux
Building the HPCC* benchmark with Intel C++ and Fortran CompilersThe HPC Challenge (HPCC) benchmark is used to evaluate and test a wide variety of performance parameters for high-performance computing system. The article provides instructions on how to build the HPCC* benchmark.
Version information
  • HPCC: 1.0.0
  • Intel(R) C++ and Fortran Compilers for Linux* or Mac OS* X: 10.0
  • Intel(R) Math Kernel Library: 9.1
Linux, OS X
Building HDF5* with Intel® compilersThe article provides instructions on how to build and use the HDF5 library with Intel C++ and Fortrna Compilers on Linux* or OS X*. HDF5 (http://www.hdfgroup.org/HDF5/) is the latest generation of the HDF libraries, a general purpose library and associated file formats for storing and sharing scientific data.
Version information
  • HDF5 1.8.9
  • Intel C++ and Fortran Compiler for Linux* or Mac OS* X: 13.0 Update 1
Linux, OS X

Linux* ABI

$
0
0

by Milind Girkar, Hongjiu Lu, David Kreitzer, and Vyacheslav Zakharin (Intel)

Description of the Intel® AVX, Intel® AVX2, Intel® AVX-512 and Intel® MPX extensions required for the Intel® 64 architecture application binary interface.

How to detect New Instruction support in the 4th generation Intel® Core™ processor family

$
0
0

Downloads


How to detect New Instruction support in the 4th generation Intel® Core™ processor family [PDF 342.3KB]

The 4th generation Intel® Core™ processor family (codenamed Haswell) introduces support for many new instructions that are specifically designed to provide better performance to a broad range of applications such as: media, gaming, data processing, hashing, cryptography, etc.  The new instructions can be divided into the following categories:

  • Intel® Advanced Vector Extensions 2 (Intel® AVX2)
  • Fused Multiply Add (FMA)
  • Bit Manipulation New Instructions (BMI)
  • MOVBE instruction (previously supported by the Intel® Atom™ processor)
  • Intel® Transactional Synchronization Extensions (Intel® TSX) (available in some models)

The details of these instructions can be found in Intel® 64 and IA-32 Architectures Software Developer Manuals and Intel® Advanced Vector Extensions Programming Reference manual.

In order to correctly use the new instructions and avoid runtime crashes, applications must properly detect hardware support for the new instructions using CPUID checks. It is important to understand that a new instruction is supported on a particular processor only if the corresponding CPUID feature flag is set. Applications must not assume support of any instruction set extension simply based on, for example, checking a CPU model or family and must instead always check for _all_ the feature CPUID bits of the instructions being used.

Software developers can take advantage of the new instructions via writing assembly code, using intrinsic functions, or relying on compiler automatic code generation. In the latter case, it is crucial to understand what instructions the compiler(s) can generate with given switches and implement proper CPUID feature checks accordingly.

Generally, compilers and libraries generating code for 4th generation Intel Core processors are expected and allowed to use all the instructions listed above, with the exception of Intel TSX. Below is the complete list of CPUID flags that generally must be checked:

CPUID.(EAX=01H, ECX=0H):ECX.FMA[bit 12]==1 &&
CPUID.(EAX=07H, ECX=0H):EBX.AVX2[bit 5]==1 &&
CPUID.(EAX=07H, ECX=0H):EBX.BMI1[bit 3]==1 &&
CPUID.(EAX=07H, ECX=0H):EBX.BMI2[bit 8]==1 &&
CPUID.(EAX=80000001H):ECX.LZCNT[bit 5]==1 &&
CPUID.(EAX=01H, ECX=0H):ECX.MOVBE[bit 22]==1

Note: Applications using instructions from the RTM subset of Intel TSX extension need to guard the code by checking the CPUID.(EAX=07H, ECX=0H).EBX.RTM[bit 11]==1. Applications can also, but are not required to, check CPUID.(EAX=07H, ECX=0H).EBX.HLE[bit 4]==1 for HLE, because legacy processors ignore HLE hints.

For example Intel® Composer XE 2013 can automatically generate all the new instructions guarded by the CPUID features in the above list, using -QaxCORE-AVX2 and -QxCORE-AVX2 switches on Microsoft Windows* (on Linux*: -axCORE-AVX2 and -xCORE-AVX2). The compiler switch -[Q]axCORE-AVX2 generates automatic CPUID check and dispatch to the code using new instructions, while the  -[Q]xCORE-AVX2 switch assumes the new instructions are supported and thus requires a manual implementation of the CPUID check for all the features in the list above. Microsoft Visual C++* 2012 compiler supports these new instructions via intrinsics as well as 32-bit inline assembler, while the GCC compiler supports both auto-generation and intrinsics with -march=core-avx2 switch starting with version 4.7, thus requiring a check of the complete list of CPUID features above, whenever such code is called.

Additionally, libraries such as Intel® Integrated Performance Primitives (Intel® IPP) beginning with version 7.1 may also use these new Instructions. In the case of Intel IPP, two types of interfaces are available: an automatically dispatched interface is the default, and a CPU-specific interface available via prefixes like ‘h9_’ (32-bit) or ‘l9_’ (64-bit). In the case of functions optimized for the 4th generation Intel Core processor family, applications must check for the support of all the features in the list above before calling these functions.

And finally, new instructions using VEX prefixes and operating on vector YMM/XMM registers continue to require checking for OS support of YMM state before using, the same check as for Intel AVX instructions.

Below is a code example you can use to detect the support of new instructions:

#if defined(__INTEL_COMPILER) && (__INTEL_COMPILER >= 1300)

#include <immintrin.h>

int check_4th_gen_intel_core_features()
{
    const int the_4th_gen_features = 
        (_FEATURE_AVX2 | _FEATURE_FMA | _FEATURE_BMI | _FEATURE_LZCNT | _FEATURE_MOVBE);
    return _may_i_use_cpu_feature( the_4th_gen_features );
}

#else /* non-Intel compiler */

#include <stdint.h>
#if defined(_MSC_VER)
# include <intrin.h>
#endif

void run_cpuid(uint32_t eax, uint32_t ecx, uint32_t* abcd)
{
#if defined(_MSC_VER)
    __cpuidex(abcd, eax, ecx);
#else
    uint32_t ebx, edx;
# if defined( __i386__ ) && defined ( __PIC__ )
     /* in case of PIC under 32-bit EBX cannot be clobbered */
    __asm__ ( "movl %%ebx, %%edi \n\t cpuid \n\t xchgl %%ebx, %%edi" : "=D" (ebx),
# else
    __asm__ ( "cpuid" : "+b" (ebx),
# endif
              "+a" (eax), "+c" (ecx), "=d" (edx) );
    abcd[0] = eax; abcd[1] = ebx; abcd[2] = ecx; abcd[3] = edx;
#endif
}     

int check_xcr0_ymm() 
{
    uint32_t xcr0;
#if defined(_MSC_VER)
    xcr0 = (uint32_t)_xgetbv(0);  /* min VS2010 SP1 compiler is required */
#else
    __asm__ ("xgetbv" : "=a" (xcr0) : "c" (0) : "%edx" );
#endif
    return ((xcr0 & 6) == 6); /* checking if xmm and ymm state are enabled in XCR0 */
}


int check_4th_gen_intel_core_features()
{
    uint32_t abcd[4];
    uint32_t fma_movbe_osxsave_mask = ((1 << 12) | (1 << 22) | (1 << 27));
    uint32_t avx2_bmi12_mask = (1 << 5) | (1 << 3) | (1 << 8);

    /* CPUID.(EAX=01H, ECX=0H):ECX.FMA[bit 12]==1   && 
       CPUID.(EAX=01H, ECX=0H):ECX.MOVBE[bit 22]==1 && 
       CPUID.(EAX=01H, ECX=0H):ECX.OSXSAVE[bit 27]==1 */
    run_cpuid( 1, 0, abcd );
    if ( (abcd[2] & fma_movbe_osxsave_mask) != fma_movbe_osxsave_mask ) 
        return 0;

    if ( ! check_xcr0_ymm() )
        return 0;

    /*  CPUID.(EAX=07H, ECX=0H):EBX.AVX2[bit 5]==1  &&
        CPUID.(EAX=07H, ECX=0H):EBX.BMI1[bit 3]==1  &&
        CPUID.(EAX=07H, ECX=0H):EBX.BMI2[bit 8]==1  */
    run_cpuid( 7, 0, abcd );
    if ( (abcd[1] & avx2_bmi12_mask) != avx2_bmi12_mask ) 
        return 0;

    /* CPUID.(EAX=80000001H):ECX.LZCNT[bit 5]==1 */
    run_cpuid( 0x80000001, 0, abcd );
    if ( (abcd[2] & (1 << 5)) == 0)
        return 0;

    return 1;
}

#endif /* non-Intel compiler */


static int can_use_intel_core_4th_gen_features()
{
    static int the_4th_gen_features_available = -1;
    /* test is performed once */
    if (the_4th_gen_features_available < 0 )
        the_4th_gen_features_available = check_4th_gen_intel_core_features();

    return the_4th_gen_features_available;
}

#include <stdio.h>

int main(int argc, char** argv)
{
    if ( can_use_intel_core_4th_gen_features() )
        printf("This CPU supports ISA extensions introduced in Haswell\n");
    else
        printf("This CPU does not support all ISA extensions introduced in Haswell\n");

    return 1;
}

 
 

Intel, the Intel logo, Atom, and Core are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.

Webinar -"Intel® System Studio: Embedded application development and debugging tools"

$
0
0

Abstract

Presenter

Information

The Intel® System Studio is a flexible complete software development studio which allows you to optimize Intel® Architecture based intelligent embedded systems and devices. It combines Eclipse* CDT integrated optimizing compiler solutions and signal and media processing libraries, whole platform power and performance tuning capabilities, in-depth memory and thread checking, instruction trace and data race detection enabled application debug, and the deep insight of a  JTAG based system software debug solution.

This session will give an overview for the Intel® System Studio and introduce the key features which cover the following topics:

  • Debugging with GDB* debugger with enhanced instruction trace and data race detection support, and  Intel® JTAG Debugger with deep insight into processor architecture, flashing, and source level debug from EFI* to OS Kernel and to driver development
  • Building with Intel® C++ Compiler in corss-build environment to extract the best performance for your embedded target.
  • Tuning with Intel® Vtune Amplifier to find the performance hotspots, identify the architecture performance bottlenecks, analyze system power and frequency.
  • Verifying with Intel® inspector to find the memory and threading issues for your embedded applications

 



Naveen Gv & Sukruth H V

 

 

 

 

PLAYBACK

The following is a selected list of questions and answers from the webinar "Intel® System Studio: Embedded application development and debugging tools", we thought these may be useful to other developers as reference.

Q: We have XDP Debug interface on our board. Is Intel® JTAG Debugger interface different?

A: The Intel® JTAG Debugger 3.0 provides Linux* hosted cross-debug solutions for software developers to debug the Linux* kernel sources and dynamically loaded drivers and kernel modules on Intel® Atom™ Processor based devices. It does so using the In-Target Probe eXtended Debug Port (ITP-XDP) on Intel® Atom™ Processor (N2xxx, D2xxx, E6xx, CE42xx, and CE 53xx) based platforms. 

For more information, refer to Intel® System Studio Installation guide and Release Notes Intel® JTAG Debugger 3.0 : http://software.intel.com/sites/default/files/article/365160/jtag-release-install.pdf

Q: Does Intel System Studio support Menlow and Baytrail platforms?

A: Intel® System Studio 2013 supports Intel® Atom™ Z5xx series Processors(Menlow). The  upcoming release of Intel® System Studio 2014 beta will support The Intel® Atom™ Processor E3xxx and Z3xxx code-named “Baytrail” from Windows* host with the Intel® ITP-XDP3 device.

Q: Do you have auto configuration option for connecting the target device? We are using American Arium which do not support auto configuration.

A: We do have script files for specific Intel® Atom™ based target boards which will set the environment variables and bring up the Intel® JTAG debugger GUI, provided you have connected either an Intel® ITP-XDP3 or Macraigor* usb2demon onto the target board.

Q: Our focus is primarily for board bringup and BIOS/ UEFI porting effort.

A: The Intel® JTAG Debugger included with Intel® System Studio supports BIOS/UEFI debugging. As UEFI code is usually compiler in Microsoft* COFF PDB format, our debugger relies on Microsoft Visual Studio* redistributables for symbol info resolution and thus the Windows* hosted version of the Intel® JTAG Debugger should be used.

Q: Is the mac probe available with the purchase of the System Studio?

A: Intel System Studio is just a software tools suite and NO hardware is shipped along with this product.To order the Intel® ITP-XDP3 device, please contact the Hibbert Group* at Intelvtg@hibbertgroup.com and request the VTG order form.

To order Macraigor* usb2Demon*, Go to http://www.macraigor.com/usbDemon.htm and select the Intel® Atom™ Processor target with the appropriate 24, 31 or 60 pin connector for your target device.

Q: What does SoC mean?

A: SoC - System on Chip, Please refer to http://www.intel.com/pressroom/kits/soc/ for more information.

Q: How to collect data from target device to host

A: Target board analysis can be done using Intel® VTune Amplifier, you have to setup SSH(Secured Shell) connection and the analysis data is automatically copied back onto the host machine. Please refer to the remote collection article for more details :

http://software.intel.com/en-us/articles/how-to-use-remote-collection-on-intel-vtune-amplifier-2013-for-systems

Q: Intel JTAG debugger can support for all the controller?

A: Intel® System Studio JTAG debugger supports only Intel® Atom x86 architecture as of today. With the 2014 release it will also support Intel® Core processors code-named “Haswell” and newer.

Q: Does Intel System studio supports GCC environment/GCC commands

A: Yes, Intel® System Studio tools are compatible with GCC compiled binaries and the Intel® C++ Compiler accepts GCC command-line options.

Q: Can we develop bare metal application on this studio. i.e target is not having any operating system

A: The build tools included with Intel® System Studio are targeted towards a variety of Embedded Linux* flavors like Yocto Project*, Wind River Linux*. The Intel® C++ Compiler relies on the presence of the GNU binutils. It is not intended for bare metal applications. You can however use the Intel®  JTAG debugger to debug and analyze your bare metal code.

Q: What about Ivy bridge and Haswell support?

A:  The build tools including the Intel® C++ Compiler already support the latest generation of Intel® Core processors. Optimiztaions will be further improved in future versions of Intel® System Studio. With Intel® System Studio 2014 we add support for the processor code-named “Haswell” in the analysis and system debug tools as well.

Q: Is The Studio Helpful In Developing Linux Drivers?

A: Yes, it can be used to build, optimize and debug Linux drivers and kernel moduels.

Q: What Do You Mean By SSH Connection?

A: http://en.wikipedia.org/wiki/Secure_Shell 

Q: What Is Yocto Project?

A: Yocto Project* is a Linux* foundation open source framework for embedded Linux* development. It is Open Embedded compatible and provides reference OS builds as well as the setup and build environment to build your own compatible custom embedded Linux* - https://www.yoctoproject.org/ 

Q: Is Windriver ice JTAG supported?

A: No, as of now we are not supporting The Windriver ICE JTAG probe.

Q: If we want to implement our own load balancing mechanism, how we can bypass the Cilk auto load balancing?

A: Load balancing is done by default by Cilk Plus [Which is part of Intel compiler]. Since this is done by CILK runtime automatically, it is not advisable to change the behavior of this. However you can get the source of this runtime under GPL V3 license https://www.cilkplus.org/which-license.

Q: is this support all version of Linux? Like Fedora and Ubuntu.

A:   We have validated and listed some of the supported Host and Target OS here http://software.intel.com/en-us/articles/intel-system-studio-system-requirements, You may want to ensure that the Linux kernel version is 2.6.32 or above.

Q: is trial version have limited code compilation?

A: The feature set of the evaluation version is identical  to the commercial version, with the only limitation being the license expiration after 30days.

Q: is this studio support for C language programming or having other C Complier?

A: Intel C/C++ Compiler which is part of Intel® System Studio supports C/C++ language.

Q: How can you analyze multi core processing ?

A: Intel® System Studio’s VTune Amplifier for systems and Inspector for systems can be helpful to find the CPU usage on a Multi-core processor using concurrency analysis, threading analysis respectively.

Q: OpenMP will be supported for this compiler ?

A: No, OpenMP is not supported as a language extention by the Intel® C++ Compiler for Embedded Linux* OS included with Intel® System Studio. Pre-existing OpenMP based binaries and shared objects that rely on OpenMP runtimes will however be executed correctly. We areopen to considering the requests, If more developers are asking for OpenMP support in future.

Q: We observed a situation of 100% CPU usage, wanted to analyze the root cause

A: You can use Intel® VTune Amplifier for Systems.

Q: Does the studio software take care of cross compilation?

A: Yes, the Intel® C++ Compiler support sysroot and chroot based cross-build setups. For  cross compilation using sysroot we offer a "-platform" compiler option on the host that already takes care of the cross build integration for multiple target Linux* OSs. The provided cross-build integration can also be used as a template for other cross-build environment. Please refer to the detailed procedure on cross compilation here:- http://software.intel.com/en-us/articles/using-intel-c-compiler-for-embedded-system

Using Intel® SDE's chip-check feature

$
0
0

Intel® SDE includes a software validation mechanism to restrict executed instructions to a particular microprocessor. This is intended to be a helpful diagnostic tool for use when deploying new software. Use chip check when you want to make sure that your program is not using instruction features that are not present on a specific microarchitecture implementation.

In the output of "sde -long-help" there is a section describing the controls for this feature:

-chip_check  [default ]
        Restrict to a specific XED chip.
-chip_check_die  [default 1]
        Die on errors. 0=warn, 1=die
-chip_check_disable  [default 0]
        Disable the chip checking mechanism.
-chip_check_emit_file  [default 0]
        Emit messages to a file. 0=no file, 1=file
-chip_check_file  [default sde-chip-check.txt]
        Output file chip-check errors.
-chip_check_jit  [default 0]
        Check during JIT'ing only. Checked code might not be executed due to
        speculative JIT'ing, but this mode is a little faster.
-chip_check_list  [default 0]
        List valid chip names and exit.
-chip_check_stderr  [default 1]
        Try to emit messages to stderr. 0=no stderr, 1=stderr
-chip_check_vsyscall  [default 0]
        Enable the chip checking checking in the vsyscall area.

To list all the chips that Intel SDE knows about, you can use "sde -chip-check-list". The output will vary depending on the version of Intel SDE you use. For the current version, you will see this output:

% kits/current/sde -chip-check-list -- /bin/ls
        INVALID             I86           I86FP            I186 
         I186FP        I286REAL            I286         I2186FP 
       I386REAL            I386          I386FP        I486REAL 
           I486     PENTIUMREAL         PENTIUM  PENTIUMMMXREAL 
     PENTIUMMMX         ALLREAL      PENTIUMPRO        PENTIUM2 
       PENTIUM3        PENTIUM4      P4PRESCOTT   P4PRESCOTT642 
   P4PRESCOTT2M           CORE2          PENRYN        PENRYN_E 
        NEHALEM        WESTMERE         BONNELL        SALTWELL 
     SILVERMONT             AMD             KNL       IVYBRIDGE 
    SANDYBRIDGE         SKYLAKE       BROADWELL         HASWELL 
       GOLDMONT             ALL 

To limit instructions to the Intel Westmere microarchitecture, use "sde -chip-check WESTMERE -- yourapp". If you do not want to limit instructions to a particular chip, "-chip-check ALL". To limit the allowed instructions to just those implemented on on the current Intel(R) Quark processors, use "-chip-check PENTIUM".

By default, Intel SDE emits warnings to a file called sde-chip-check.txt and also to stderr (if the application has not closed stderr). This behavior can be customized using the above knobs.

On linux, there are instructions in the virtual system call area that are not under direct user control. To avoid flagging those instructions, the chip check mechanism defaults to ignoring instructions that region. If you wanted to check the instructions in the vsyscall area for some reason, use "-chip-check-vsyscall".

There is a performance cost for using the chip-check feature. At instrumentation (JIT) time, we must do an extra check on each instruction. And at run time every instruction that is not valid for particular chip gets code inserted before it to trigger the error (or warning). There is also a JIT time code to finding the function symbols (if any) associated with the unwanted instructions.

Using the "-chip-check-jit" option, the JIT instrumentor can report disallowed instructions at JIT instrumentation time. This may be too aggressive as the JIT speculates and the JITted code may never execute due to the dynamic control flow in the program. It is more conservative though if you want to be sure there are no unwanted instructions.

Example

Here is a little example of the error message you get when your program does not have symbols:

% kits/current/sde -chip-check PENTIUM -- /bin/ls
TID 0 SDE-ERROR: Executed instruction not valid for specified chip (PENTIUM): 0x2b3db3fdc447: cmovnbe rdx, rax
Instruction bytes are: 48 0f 47 d0 

If your program was compiled in debug mode or has function symbols, Intel SDE tries to provide additional information when it reports problems. This can be very useful for figuring out where the unwanted instructions are coming from.

kits/current/sde -chip-check IVYBRIDGE -- tests/a.out
TID 0 SDE-ERROR: Executed instruction not valid for specified chip (IVYBRIDGE): 0x400623: vfmadd231sd xmm1, xmm2, xmm3

Function: main
File Name: /tmp/fma1.c:36
Instruction bytes are: c4 e2 e9 b9 cb 

In binaries without debug symbols, sde will still show the function name when it can be located.

Finding more errors

By default, Intel SDE issues the above error message and terminates when it encounters an unwanted instruction. Sometimes there are more than one unwanted instruction in large program. By using the "-chip-check-die 0" option, Intel SDE will continue to execute after reporting the error.

If an unwanted instruction is not executed, it will not be flagged by Intel SDE. Intel SDE is built upon the Pin dynamic binary instrumentation system. Pin is JIT for the application being run.  As with all path-based checking mechanisms, you must exercise any code paths if you want them to be checked.


Intel® Xeon® Processor E5-2600 V2 Product Family Technical Overview

$
0
0

Download Article


Intel® Xeon® Processor E5-2600 V2 Product Family Technical Overview [PDF 780KB]

Contents

  1. Executive Summary
  2. Introduction
  3. Intel Xeon processor E5-2600 V2 product family enhancements
    1. Intel® Secure Key (DRNG)
    2. Intel® OS Guard (SMEP)
    3. Intel® Advanced Vector Extensions (Intel® AVX): Float 16 Format Conversion
    4. Advanced Programmable Interrupt Controller (APIC) Virtualization (APICv)
    5. PCI Express Enhancements
  4. Conclusion
  5. About the Author

1. Executive Summary


The Intel® Xeon® processor E5-2600 V2 product family, codenamed “Ivy Bridge EP”, is a 2-socket platform based on Intel’s most recent microarchitecture. Ivy Bridge is the 22-nanometer shrink of the Intel® Xeon® processor E5-2600 (codenamed “Sandy Bridge EP”) microarchitecture. This product brings additional capabilities for data centers: more cores and more memory bandwidth. As a result, platforms based on the Intel Xeon processor E5-2600 V2 product family will yield up to 50% improvement in performance1 compared to the previous generation “Sandy Bridge EP”.

2. Introduction


The Intel Xeon processor E5-2600 V2 product family is based on Ivy Bridge EP microarchitecture, an enhanced version of the Sandy Bridge EP microarchitecture (http://software.intel.com/en-us/articles/intel-xeon-processor-e5-26004600-product-family-technical-overview). The platform supporting the Intel Xeon processor E5-2600 V2 product family is named “Romley.” This paper discusses the new features available in the Intel Xeon processor E5-2600 V2 product family compared to the Intel Xeon processor E5-2600 product family. Each section includes information about what developers need to do to take advantage of new features for improving application performance and security.

3. Intel Xeon processor E5-2600 V2 product family enhancements


Some of the new features that come with the Intel Xeon processor E5-2600 V2 product family include:

  1. 22-nm process technology
  2. Security: Intel® Secure Key (DRNG)
  3. Security: Intel® OS Guard (SMEP)
  4. Intel® Advanced Vector Extensions (Intel® AVX): Float 16 Format Conversion
  5. Virtualization: APIC Virtualization (APICv)
  6. PCI Express* (PCIe): Support for atomic operation, x16 Non Transparent Bridge


Figure 1. The Intel® Xeon® processor E5-2600 V2 product family Microarchitecture

Figure 1 shows a block diagram of the Intel Xeon processor E5-2600 V2 product family microarchitecture. All processors in the family have up to 12 cores (compared to 8 cores in its predecessor), which bring additional computing power to the table. They also have 50% additional cache (30 MB) and more memory bandwidth. With the 22-nm process technology, the Intel Xeon processor E5-2600 V2 product family has less idle power and is capable of delivering 25% more performance2 while consuming less power compared to the earlier version.

Table 1 shows a comparison of the Intel Xeon processor E5-2600 V2 product family features compared to its predecessor, the Intel Xeon processor E5-2600.

Table 1. Comparison of the Intel® Xeon® processor E5–2600 product family to the Intel® Xeon® processor E5–2600 V2 product family

1 Requires a system with Intel® Turbo Boost Technology. Intel Turbo Boost Technology and Intel Turbo Boost Technology 2.0 are only available on select Intel® processors. Consult your PC manufacturer. Performance varies depending on hardware, software, and system configuration. For more information, visit http://www.intel.com/go/turbo

The rest of this paper discusses some of the main enhancements in this product family.

a. Intel® Secure Key (DRNG)

Intel Secure Key (Digital Random Number Generator: DRNG) is a hardware approach to high-quality and high-performance entropy and random number generation. The entropy source is thermal noise within the silicon.


Figure 2. Digital Random Number Generator using RDRAND instruction

Figure 2 shows a block diagram of the Digital Random Number Generator. The entropy source outputs a random stream of bits at the rate of 3 GHz that is sent to the conditioner for further processing. The conditioner takes pairs of 256-bit raw entropy samples generated by the entropy source and reduces them to a single 256-bit conditioned entropy sample. This is passed to a deterministic random bit generator (DRBG) that spreads the sample into a large set of random values, thus increasing the amount of random numbers available by the module. DRNG is compliant with ANSI X9.82, NIST, and SP800-90 and certifiable to FIPS-140-2.

Since DRNG is implemented in hardware as a part of the processor chip, both the entropy source and DRBG execute at processor clock speeds. There is no system I/O required to obtain entropy samples and no off-chip bus latencies to slow entropy transfer. DRNG is scalable enough to support heavy server application workloads and multiple VMs.

DRNG can be accessed through a new instruction named RDRAND. RDRAND takes the random value generated by DRNG and stores it in a 16-bit or 32-bit destination register (size of the destination register determines size of the random value). RDRAND can be emulated via CPUID.1.ECX[30] and is available at all privilege levels and operating modes. Performance of RDRAND instruction is dependent on the bus infrastructure; it varies between processor generations and families.

Software developers can use the RDRAND instruction either through cryptographic libraries (OpenSSL* 1.0.1) or through direct application use (assembly functions). Intel® Compiler (starting with version 12.1), Microsoft Visual Studio* 2012, and GCC* 4.6 support the RDRAND instruction.

Microsoft Windows* 8 uses the DRNG as an entropy source to improve the quality of output from its cryptographically secure random number generator. Linux* distributions based on the 3.2 kernel use DRNG inside the kernel for random timings. Linux distributions based on the 3.3 kernel use it to improve the quality of random numbers coming from /dev/random and /dev/urandom, but not the quantity. That being said, Red Hat Fedora* Core 18 ships with the rngd daemon enabled by default, which will use DRNG to increase both the quality and quantity of random numbers in /dev/random and /dev/urandom.

For more details on DRNG and RDRAND instruction, refer to the Intel DRNG Software Implementation Guide.

b. Intel® OS Guard (SMEP)

Intel OS Guard (Supervisor Mode Execution Protection: SMEP) prevents execution out of untrusted application memory while operating at a more privileged level. By doing this, Intel OS Guard helps prevent Escalation of Privilege (EoP) security attacks. Intel OS Guard is available in both 32-bit and 64-bit operating modes and can be enumerated via CPUID.7.0.EBX[7].


Figure 3. Pictorial description of Intel® OS Guard operation

Support for Intel OS Guard needs to be in the operating system (OS) or Virtual Machine Monitor (VMM) you are using. Please contact your OS or VMM providers to determine which versions include this support. No changes are required in the BIOS or application level to use this feature.

c. Intel® Advanced Vector Extensions (Intel® AVX): Float 16 Format Conversion

The “Sandy Bridge” microarchitecture introduced Intel AVX, a new-256 bit instruction set extension to Intel® SSE designed for applications that are floating-point (FP) intensive. The “Ivy Bridge” microarchitecture enhances this with the addition of float 16 format conversion instructions.


Figure 4. Intel® Advanced Vector Extensions Instruction Format

Intel Xeon processor E5-2600 V2 product family supports half-precision (16-bit) floating- point data types. Half-precision floating-point data types provide 2x more compact data representation than single-precision (32-bit) floating-point data format, but sacrifice data range and accuracy. In particular, half-floats may provide better performance than 32-bit floats when the 32-bit float data does not fit into the L1 cache. This format is widely used in graphics and imaging applications to reduce dataset size and memory bandwidth consumption.

Because the half-precision floating-point format is a storage format, the only operation performed on half-floats is conversion to and from 32-bit floats. The Intel Xeon processor E5-2600 V2 product family introduces two half-float conversion instructions: vcvtps2ph for converting from 32-bit float to half-float (4x speedup compared to alternative Intel AVX code implementation), and vcvtph2ps for converting from half-float to 32-bit float (2.5x speedup compared to alternative Intel AVX implementation). A developer can utilize these instructions without writing assembly by using the corresponding intrinsics instructions: _mm256_cvtps_ph for converting from 32-bit float to half-float, and _mm256_cvtph_ps for converting from half-float to 32-bit float (_mm_cvtps_ph and _mm_cvtph_ps for 128-bit vectors).

The compilers that support these instructions include Intel Compiler (starting with version 12.1), Visual Studio 2012, and GCC 4.6. To direct the Intel Compiler to produce the conversion instructions for execution on Intel Xeon processor E5-2600 V2 product family (or later), a developer can either compile the entire application with the –xCORE-AVX-I flag (/QxCORE-AVX-I on Windows), or use the Intel®-specific optimization pragma with target_arch=CORE-AVX-I for the individual function(s).

For more details on half precision floating point instructions, refer to: http://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats

d. Advanced Programmable Interrupt Controller (APIC) Virtualization (APICv)

A significant amount of performance overhead in machine virtualization is due to Virtual Machine (VM) exits. Every VM exit can cause a penalty of approximately 2,000 – 7,000 CPU cycles (see Figure 5), and a significant portion of these exits are for APIC and interrupt virtualization. Whenever a guest operating system tries to read an APIC register, the VM has to exit and the Virtual Machine Monitor (VMM) has to fetch and decode the instruction.

The Intel Xeon processor E5-2600 V2 product family introduces support for APIC virtualization (APICv); in this context, the guest OS can read most APIC registers without requiring VM exits. Hardware and microcode emulate (virtualize) the APIC controller, thus saving thousands of CPU cycles and improving VM performance.


Figure 5. APIC Virtualization

This feature must be enabled at the VMM layer: please contact your VMM supplier for their roadmap on APICv support. No application-level changes are required to take advantage of this feature.

e. PCI Express Enhancements

The Intel Xeon processor E5-2600 V2 product family supports PCIe atomic operations (as a completer). Today, message-based transactions are used for PCIe devices, and these use interrupts that can experience long latency, unlike CPU updates to main memory that use atomic transactions. An Atomic Operation (AtomicOp) is a single PCIe transaction that targets a location in memory space, reads the location’s value, potentially writes a new value back to the location, and returns the original value. This “read-modify-write” sequence to the location is performed atomically. This is a new operation added per PCIe Specification 3.0. FetchAdd, Swap, and CAS (Compare and Swap) are the new atomic transactions.

The benefits of atomic operations include:

  • Lower overhead for synchronization
  • Lock-free statistics (e.g. counter updates)
  • Performance enhancement for device drivers

The Intel Xeon processor E5-2600 V2 product family also supports X16 non transparent bridge. All these contribute to better I/O performance.

These PCIe features are inherently transparent and require no application changes.

For more details on these PCIe features, refer to:

5. Conclusion


In summary, the Intel Xeon processor E5-2600 V2 product family combined with the Romley platform provides many new and improved features that could significantly change your performance and power experience on enterprise platforms. Developers can make use of most of these new features without making any changes to their applications.

6. About the Author

Sree Syamalakumari is a software engineer in the Software & Service Group at Intel Corporation. Sree holds a Master's degree in Computer Engineering from Wright State University, Dayton, Ohio.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.

1 Baseline Configuration and Score on SPECVirt_sc2013* benchmark: Platform with two Intel® Xeon® Processor E5-2690, 256GB memory, RHEL 6.4(KVM). Baseline source as of July 2013. Score: 624.9 @ 37 VMs. New Configuration: IBM System x3650 M4* platform with two Intel® Xeon® Processor E5-2697 v2, 512GB memory, RHEL 6.4(KVM). Source as of Sept. 2013. Score: 947.9 @ 57 VMs. For more information go to http://www.intel.com/performance.

2 Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

How to install Intel® System Studio on Windows* OS

$
0
0

Topic: - How to install Intel® System Studio 2014 Beta on Windows* OS

Objective: - This article is focused on explaining step wise as “How to install Intel® System Studio 2014 Beta on Windows* OS”.

Installation: - After downloading Intel® System Studio 2014 Beta – Windows* Host from Intel® registration center (https://registrationcenter.intel.com), please go ahead and follow below mentioned steps:-

Step 1:- Double click on the executable file that you have downloaded. Now you should see a “User Account Control” window asking for your permission, Press “yes”.

Step 2:- Now you should see an installer been launched as shown below:-

Step 3:- Now you would see a “Welcome” page describing about the products that would be installed, then click “Next”

Step 4:- Now you should see a Licensing page with 2 options

  1. “I have a serial number and want to activate and install my product”:- You can use this option provided you have a SN with you and have an internet connection on the system on which you are installing.
  2. “Choose alternate activation” :- This option can be used if you want to activate the product using either of these 3 options :-
    1. Remote activation
    2. Using a license file
    3. Using a license manager

Step 5:- Now you would see an “Options” windows which allows you to customize and select the components that you want to install, If you want all the components listed to be installed, then click “Install” and jump to step 6, else click “Customize” and jump to step 5. 

Step 6:- Customizing your installation, in this step you can remove the components that you don’t want to be installed and in fact you can later add the removed components by again “Modifying” the installation.

As shown above you can “right-click” on the components and select “Do not install”, if you do not want to install that component. After customizing the components, you can click “Next”.

Step 7:- Now you can click “Install” and it might take several minutes to install the whole product.

Step 8:- Once the installation completes, You can click on the “Finish” button.

Video : Installation of Intel® System Studio on Windows* Host

$
0
0

This video is focused on “How to install Intel® System Studio 2014 Beta on Windows* Host”.

Intel System Studio Installation - Windows Host.wmv

Size : 7.34MB

How to get Intel System Studio 2014 beta - Windows* Host package?

Upon registering for the program you will receive a serial number and email with a license file. You will need either of these two to complete the installation process. If you want to use the license file you can point to it during install, but you can also copy it to C:\Program Files (x86)\Common Files\Intel\Licenses\for automatic pickup by the installer.

Execute one of the installer executable.

w_cembd_2014.0.xxx.exe or w_cembd_2014.0.xxx_online.exe

The later one is an online installer reducing the initial package download size. 

Prerequisites for Eclipse* IDE Integration

The Intel® C++ Compiler and SVEN SDK can be automatically integrated into a preexisting Eclipse* CDT installation. The Eclipse* CDK, Eclipse* JRE and the Eclipse* CDT integrated development environment are not shipped with this package of the Intel® System Studio. The Eclipse* integration is automatically offered as one of the last steps of the installation process. If you decide against integration during an earlier install, simply rerun the Intel® System Studio installer.

When asked point the installer to the installation directory of your Eclipse* install. Usually this would be C:\Program Files (x86)\eclipse\.

The prerequisites for successful Eclipse integration are:

1. Eclipse* 3.7 (Indigo) – Eclipse* 4.3 (Kepler)

2. Eclipse* CDT 8.0 – 8.1

3. Java Runtime Environment (JRE) version 6.0 (also called 1.6) update 11 or later.

英特尔® 集成性能基元(Intel® IPP)函数面向英特尔® 高级矢量扩展指令集(Intel® AVX)而优化

$
0
0
  • 下表列出了英特尔® IPP 7.0.2 函数库版本提供的英特尔® AVX 支持。
  • 7.0 函数库 32位和 64 位版本提供经优化的英特尔® AVX 代码。
  • 6.1 函数库对英特尔® AVX 的支持非常有限;如果您打算在英特尔® AVX 平台上使用英特尔® IPP,您应升级至采用英特尔® IPP 函数库 7.0 版本。

英特尔® AVX(英特尔® 高级矢量扩展指令集)是一套针对 SSE 的 256 位扩展指令集,可为浮点密集型应用提供更高的性能。英特尔® AVX 向现有英特尔® SIMD 指令集(基于 SSE)添加新函数,并采用更紧凑的 SIMD 编码格式。AVX 中的大量(200 多个)英特尔 SSEx 指令经过升级,可充分利用多种特性,如各自不同的目的操作数和灵活的内存对齐等。约 100 个原来的 128 位英特尔 SSEx 指令可用于处理 256 位矢量数据。此外,该版本还添加了约 100 个全新的数据处理和算法运算。这些运算不包含于传统英特尔 SSEx SIMD 指令集内。

英特尔® AVX 的主要优势在于:

  • 支持更广泛的矢量数据(多至 256 位)。
  • 高效的指令编码模式,支持3运算指令或 4 运算指令语法。
  • 灵活的编程环境--从分支处理到宽松的内存对齐要求。
  • 全新的数据处理和算术计算基元,包括播送、乱序、融合乘加等



ippGetCpuFeatures()报告与 SIMD 特性有关可应用于您的处理器的信息。此外,ippGetCpuType()检测您系统中的处理器类型。ippCpuAVX返回值表示您的处理器支持英特尔® AVX 指令集。这些函数将在 ippcore.h中声明。

隐藏 ippGetCpuFeatures()(采用ippCPUID_AVX (0x0100))返回的值,以确定英特尔® AVX SIMD 指令集是否获得您的处理器的支持(ippGetCpuFeatures() 和 ippCPUID_AVX 为 TRUE)。为了确定您的操作系统是否还支持英特尔® AVX 指令(保存扩展的 SIMD 寄存器)、隐藏 ippGetCpuFeatures()(采用ippAVX_ENABLEDBYOS (0x0200))返回的值。在您的应用使用英特尔® AVX SIMD 指令前,必须满足两种条件( CPU 和操作系统支持)。



英特尔® 函数库已面向各种 SIMD 指令集而优化。自动“分派”检测运行处理器上可用的 SIMD 指令集,并为该处理器选择最佳的 SIMD 指令。如欲了解更多有关分派的更多信息,请参阅了解英特尔® IPP 函数库中的 CPU 分派

英特尔® IPP 函数库中的英特尔® AVX 优化包含“人工优化”和“编译器调试”函数 - 直接针对英特尔® AVX 指令集而优化的代码。由于英特尔® IPP 函数库中包含大量基元,因此不可能针对一次产品发行或更新期间内的英特尔® AVX 指令集所代表的大量新指令,来直接优化每个英特尔® IPP 函数(处理器特定优化可能还需要考虑高速缓存大小和内核/线程的数量)。因此,下表中的函数表示全新英特尔® AVX 指令所带来的最大优势,或是英特尔® IPP 客户最广泛使用的函数。

如果您有一些特定的英特尔® IPP 函数但下表没有列明,如果您希望将它们添加至未来 AVX 优化的优先级列表中,请在 IPP 论坛上创建一条线程,说明您希望将哪些函数添加至 AVX 优化优先级列表中。

随着面向英特尔® AVX 直接优化的函数在函数库每个新版本或更新版本中的应用,这些函数将添加至下表中。

下表中使用的规则支持多个类似的函数在一行上指示:

  • {x} – 括号包含所需(函数名称)元素。
  • [x] – 方括号内包含可选(函数名称)元素。
  • | - 垂直线表示可选或所需元素集中独一无二的选择。
  • {x|y|z} – 函数名称中所需元素内三个互斥选择示例。
  • [x|y|z] – 函数名称中可选元素内三个互斥选择示例。

信号处理

ippsAbs_{16s|32s|32f|64f}[_I] 
ippsAdd_{32f|32fc|64f|64fc}[_I] 
ippsAddC_{32f|64f}[_I] 
ippsAddProductC_32f 
ippsAddProduct_{32fc|64f|64fc} 
ippsAutoCorr_{32f|64f}
ippsConv_32f 
ippsConvert_{8s|8u|16s|16u|32s|64f}32f 
ippsConvert_{32s|32f}64f 
ippsConvert_32f{8s|8u|16s|16u}_Sfs 
ippsConvert_64f32s_Sfs 
ippsCopy_{16s|32s|32f|64f} 
ippsCrossCorr_{32f|64f} 
ippsDFTFwd_CToC_{32f|32fc|64f|64fc} 
ippsDFTFwd_RTo{CCS|Pack|Perm}_{32f|64f} 
ippsDFTInv_CCSToR_{32f|64f} 
ippsDFTInv_CToC_{32f|32fc|64f|64fc} 
ippsDFTInv_{Pack|Perm}ToR_{32f|64f} 
ippsDFTOutOrd{Fwd|Inv}_CToC_{32fc|64fc} 
ippsDiv[C]_32f[_I] 
ippsDotProd_32f64f 
ippsFFTFwd_CToC_{32f|32fc|64f|64fc}[_I] 
ippsFFTFwd_RTo{CCS|Pack|Perm}_{32f|64f}[_I] 
ippsFFTInv_CCSToR_{32f|64f}[_I] 
ippsFFTInv_CToC_{32f|32fc|64f|64fc}[_I] 
ippsFFTInv_{Pack|Perm}ToR_{32f|64f}[_I] 
ippsFIR64f_32f[_I] 
ippsFIR64fc_32fc[_I] 
ippsFIRLMS_32f 
ippsFIR_{32f|32fc|64f|64fc}[_I] 
ippsIIR32fc_16sc_[I]Sfs 
ippsIIR64fc_32fc[_I] 
ippsIIR_32f[_I] 
ippsLShiftC_16s_I 
ippsMagnitude_16sc_Sfs 
ipps{Min|Max}Indx_{32f|64f} 
ippsMul_32fc[_I] 
ippsMul[C]_{32f|32fc|64f|64fc}[_I] 
ippsMulC_64f64s_ISfs 
ipps{Not|Or}_8u 
ippsPhase_{16s|16sc|32sc}_Sfs 
ippsPowerSpectr_{32f|32fc} 
ippsRShiftC_16u_I 
ippsSet_{8u|16s|32s} 
ippsSqr_{8u|16s|16u|16sc}_[I]Sfs 
ippsSqr_{32f|32fc|64f|64fc}[_I] 
ippsSqrt_32f[_I] 
ippsSub_{32f|32fc|64f|64fc}[_I] 
ippsSubC_{32f|32fc|64f|64fc}[_I] 
ippsSubCRev_{32f|32fc|64f|64fc}[_I] 
ippsSum_{32f|64f} 
ippsThreshold_{32f|GT_32f|LT_32f}_[_I] 
ippsThreshold_{GT|LT}Abs_{32f|64f}[_I] 
ippsThreshold_GTVal_32f[_I] 
ippsWinBartlett_{32f|32fc|64f|64fc}[_I] 
ippsWinBlackman_{32f|64f|64fc}[_I] 
ippsWinBlackmanOpt_{32f|64f|64fc}[_I] 
ippsWinBlackmanStd_{32f|64f|64fc}[_I] 
ippsWinKaiser_{32f|64f|64fc}[_I] 
ippsZero_{8u|16s|32f}

 

SPIRAL (GEN) 函数

ippgDFTFwd_CToC_8_64fc ippgDFTFwd_CToC_12_64fc 
ippgDFTFwd_CToC_16_{32fc|64fc}
ippgDFTFwd_CToC_20_64fc
ippgDFTFwd_CToC_24_64fc
ippgDFTFwd_CToC_28_64fc 
ippgDFTFwd_CToC_32_{32fc|64fc}
ippgDFTFwd_CToC_36_64fc
ippgDFTFwd_CToC_40_64fc
ippgDFTFwd_CToC_44_64fc 
ippgDFTFwd_CToC_48_{32fc|64fc}
ippgDFTFwd_CToC_52_64fc 
ippgDFTFwd_CToC_56_64fc 
ippgDFTFwd_CToC_60_64fc 
ippgDFTFwd_CToC_64_{32fc|64fc} 
ippgDFTInv_CToC_8_64fc 
ippgDFTInv_CToC_12_64fc 
ippgDFTInv_CToC_16_{32fc|64fc} 
ippgDFTInv_CToC_20_64fc 
ippgDFTInv_CToC_24_64fc 
ippgDFTInv_CToC_28_64fc 
ippgDFTInv_CToC_32_{32fc|64fc} 
ippgDFTInv_CToC_36_64fc 
ippgDFTInv_CToC_40_64fc 
ippgDFTInv_CToC_44_64fc 
ippgDFTInv_CToC_48_{32fc|64fc} 
ippgDFTInv_CToC_52_64fc 
ippgDFTInv_CToC_56_64fc 
ippgDFTInv_CToC_60_64fc 
ippgDFTInv_CToC_64_{32fc|64fc}

 

音频编码

iippsDeinterleave_32f

 

语音编码

ippsAdaptiveCodebookSearch_RTA_32f
ippsFixedCodebookSearch_RTA_32f
ippsFixedCodebookSearchRandom_RTA_32f
ippsHighPassFilter_RTA_32f
ippsLSPQuant_RTA_32f
ippsLSPToLPC_RTA_32f
ippsPostFilter_RTA_32f_I
ippsQMFDecode_RTA_32f
ippsSynthesisFilter_G729_32f

 

色彩转换

ippiRGBToHLS_8u_AC4R
ippiRGBToHLS_8u_C3R

 

逼真渲染

ipprCastEye_32f
ipprCastShadowSO_32f
ipprDot_32f_P3C1M
ipprHitPoint3DEpsM0_32f_M
ipprHitPoint3DEpsS0_32f_M
ipprMul_32f_C1P3IM

 

计算机视觉

ippiEigenValsVecs_[8u]32f_C1R 
ippiFilterGaussBorder_32f_C1R 
ippiMinEigenVal_[8u]32f_C1R 
ippiNorm_Inf_{8u|8s|16u|32f}_C{1|3C}MR 
ippiNorm_L1_{8u|8s|16u|32f}_C{1|3C}MR 
ippiNorm_L2_{8u|8s|16u|32f}_C{1|3C}MR 
ippiNormRel_L2_32f_C3CMR 
ippiUpdateMotionHistory_[8u|16u]32f_C1IR

 

图像处理

ippiAddC_32f_C1[I]R 
ippiConvert_32f* 
ippiCopy_16s* 
ippiCopy_8u* 
ippiConvFull_32f_{AC4|C1|C3}R 
ippiConvValid_32f_{AC4|C1|C3}R 
ippiCrossCorrFull_NormLevel_16u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_NormLevel_32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_NormLevel_64f_C1R 
ippiCrossCorrFull_NormLevel_8s32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_NormLevel_8u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_NormLevel_8u_{AC4|C1|C3|C4}RSfs 
ippiCrossCorrFull_Norm_16u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_Norm_32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_Norm_8s32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_Norm_8u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrFull_Norm_8u_{AC4|C1|C3|C4}RSfs 
ippiCrossCorrSame_NormLevel_16u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_NormLevel_32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_NormLevel_8s32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_NormLevel_8u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_NormLevel_8u_{AC4|C1|C3|C4}RSfs 
ippiCrossCorrSame_Norm_16u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_Norm_32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_Norm_8s32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_Norm_8u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrSame_Norm_8u_{AC4|C1|C3|C4}RSfs 
ippiCrossCorrValid_{8u32f|8s32f|16u32f|32f}_C1R 
ippiCrossCorrValid_NormLevel_16u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_NormLevel_32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_NormLevel_64f_C1R 
ippiCrossCorrValid_NormLevel_8s32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_NormLevel_8u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_NormLevel_8u_{AC4|C1|C3|C4}RSfs 
ippiCrossCorrValid_Norm_16u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_Norm_32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_Norm_8s32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_Norm_8u32f_{AC4|C1|C3|C4}R 
ippiCrossCorrValid_Norm_8u_{AC4|C1|C3|C4}RSfs 
ippiDCT8x8FwdLS_8u16s_C1R 
ippiDCT8x8Fwd_16s_C1[I|R] 
ippiDCT8x8Fwd_32f_C1[I] 
ippiDCT8x8Fwd_8u16s_C1R 
ippiDCT8x8InvLSClip_16s8u_C1R 
ippiDCT8x8Inv_16s8u_C1R 
ippiDCT8x8Inv_16s_C1[I|R] 
ippiDCT8x8Inv_2x2_16s_C1[I] 
ippiDCT8x8Inv_32f_C1[I] 
ippiDCT8x8Inv_4x4_16s_C1[I] 
ippiDCT8x8Inv_A10_16s_C1[I] 
ippiDCT8x8To2x2Inv_16s_C1[I] 
ippiDCT8x8To4x4Inv_16s_C1[I] 
ippiDFTFwd_CToC_32fc_C1[I]R 
ippiDFTFwd_RToPack_32f_{AC4|C1|C3|C4}[I]R 
ippiDFTFwd_RToPack_8u32s_{AC4|C1|C3|C4}RSfs 
ippiDFTInv_CToC_32fc_C1[I]R 
ippiDFTInv_PackToR_32f_{AC4|C1|C3|C4}[I]R 
ippiDFTInv_PackToR_32s8u_{AC4|C1|C3|C4}RSfs 
ippiDilate3x3_32f_C1[I]R 
ippiDilate3x3_64f_C1R 
ippiDivC_32f_C1[I]R 
ippiDiv_32f_{C1|C3}[I]R 
ippiDotProd_32f64f_{C1|C3}R 
ippiErode3x3_64f_C1R 
ippiFFTFwd_CToC_32fc_C1[I]R 
ippiFFTFwd_RToPack_32f_{AC4|C1|C3|C4}[I]R 
ippiFFTFwd_RToPack_8u32s_{AC4|C1|C3|C4}RSfs 
ippiFFTInv_CToC_32fc_C1[I]R 
ippiFFTInv_PackToR_32f_{AC4|C1|C3|C4}[I]R 
ippiFFTInv_PackToR_32s8u_{AC4|C1|C3|C4}RSfs 
ippiFilter_32f_{C1|C3|C4}R 
ippiFilter_32f_AC4R 
ippiFilter_64f_{C1|C3}R 
ippiFilter32f_{8s|8u|16s|16u|32s}_C{1|3|4}R 
ippiFilter32f_{8u|16s|16u}_AC4R 
ippiFilter32f_{8s|8u}16s_C{1|3|4}R 
ippiFilterBox_8u_{C1|C3}R 
ippiFilterBox_32f_{C1|C4|AC4}R 
ippiFilterColumn32f_{8u|16s|16u}_{C1|C3|C4|AC4}R 
ippiFilterColumn_32f_{C1|C3|C4|AC4}R 
ippiFilterGauss_32f_{C1|C3}R 
ippiFilterHipass_32f_{C1|C3|C4|AC4}R 
ippiFilterLaplace_32f_{C1|C3|C4|AC4}R 
ippiFilterLowpass_32f_{C1|C3|AC4}R 
ippiFilterMax_32f_{C1|C3|C4|AC4}R 
ippiFilterMedian_32f_C1R 
ippiFilterMin_32f_{C1|C3|C4|AC4}R 
ippiFilterRow_32f_{C1|C3|C4|AC4}R 
ippiFilterRow32f_{8u|16s|16u}_{C1|C3|C4|AC4}R 
ippiFilterSobelHoriz_32f_{C1|C3}R 
ippiFilterSobelVert_32f_{C1|C3}R 
ippiMean_32f_{C1|C3}R 
ippiMulC_32f_C1[I]R 
ippiMul_32f_{C1|C3|C4}[I]R 
ippiResizeSqrPixel_{32f|64f}_{C1|C3|C4|AC4}R 
ippiResizeSqrPixel_{32f|64f}_{P3|P4}R 
ippiSqrDistanceFull_Norm_16u32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceFull_Norm_32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceFull_Norm_8s32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceFull_Norm_8u32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceFull_Norm_8u_{AC4|C1|C3|C4}RSfs 
ippiSqrDistanceSame_Norm_16u32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceSame_Norm_32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceSame_Norm_8s32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceSame_Norm_8u32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceSame_Norm_8u_{AC4|C1|C3|C4}RSfs 
ippiSqrDistanceValid_Norm_16u32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceValid_Norm_32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceValid_Norm_8s32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceValid_Norm_8u32f_{AC4|C1|C3|C4}R 
ippiSqrDistanceValid_Norm_8u_{AC4|C1|C3|C4}RSfs 
ippiSqrt_32f_C1R 
ippiSqrt_32f_C3IR 
ippiSubC_32f_C1[I]R 
ippiSub_32f_{C1|C3|C4}[I]R 
ippiSum_32f_C{1|3}R 
ippiTranspose_32f_C1R

 

图像压缩

ippiPCTFwd_JPEGXR_32f_C1IR 
ippiPCTFwd16x16_JPEGXR_32f_C1IR 
ippiPCTFwd8x16_JPEGXR_32f_C1IR 
ippiPCTFwd8x8_JPEGXR_32f_C1IR 
ippiPCTInv_JPEGXR_32f_C1IR_128 
ippiPCTInv16x16_JPEGXR_32f_C1IR 
ippiPCTInv8x16_JPEGXR_32f_C1IR 
ippiPCTInv8x8_JPEGXR_32f_C1IR

 

未直接面向 AVX 而优化的那些函数(即,未出现在表中的函数)已利用英特尔编译器“xG”切换(支持 AVX 优化)得到编译。通过遵循 AVX ABI(应用二级制接口)特性还可进一步提升性能。该特性在具有 AVX 代码的任何函数后插入特定的 AVX“vzeroupper”,以避免任意 AVX 的转换处罚。

对于那些未直接面向 AVX 而优化的函数,g9/e9 函数库利用来自之前兼容的 SSE 优化的函数优化,如,面向 p8/y8 函数库调试的优化和之前的 SIMD 优化(如,SSE4.x、AES-NI 和 SSE2/3)。上表中未列出的函数将包括最高等级直接优化的代码(基于 AES-NI、SSE4.x、SSSE3、SSE3 和 SSE2 SIMD 指令集,如适用)。

如欲了解有关 g9/e9 优化层和英特尔® IPP 函数库中英特尔 AVX 的更多信息,请参阅Intel® 64 架构用户指南中面向 Windows* 操作系统的英特尔® 集成性能基元

如欲了解更多信息,请参阅“如何面向英特尔® AVX 进行编译”,并访问Intel Parallel Studio 网站,了解有关开发、调试和优化多线程应用可用工具的更多信息。

优化通知

英特尔® 编译器\相关库和相关开发工具可能包含或使用针对英特尔® 和非英特尔微处理器中都存在的指令集(例如 SIMD 指令集)实现优化的选项,不过面向非英特尔微处理器的优化程度可能稍差一些。此外,部分面向英特尔编译器的编译器选项(包括并非英特尔微架构专用的选项)也为英特尔微处理器保留了下来。如欲获得有关英特尔编译器选项包括指令集和相关特定微处理器的详细说明,请参阅“编译器选项”下的“英特尔® 编译器用户和参考指南”。英特尔® 编译器产品中的大多数库例程面向英特尔微处理器的优化程度要高于其它微处理器。英特尔® 编译器产品中的编译器和库可为英特尔和英特尔兼容微处理器提供优化特性,具体取决于您所选择的选项以及代码和其它因素。您将能够在英特尔微处理器上获得额外性能。

英特尔® 编译器、相关库和相关开发工具针对非英特尔微处理器的优化程度可能与英特尔微处理器相同或不同。这些优化特性包括英特尔® SIMD 流指令扩展 2(英特尔® SSE2)、英特尔® SIMD 流指令扩展 3(英特尔® SSE3)和追加 SIMD 流指令扩展 3(英特尔® SSSE3)指令集以及其它优化性能。对于在非英特尔制造的微处理器上进行的优化,英特尔不对相应的可用性、功能或有效性提供担保。该产品中依赖于处理器的优化仅适用于英特尔微处理器。

虽然英特尔坚信我们的编译器和库是帮助您在英特尔® 和非英特尔微处理器上获得最佳性能的理想选择,不过仍建议您对其它编译器和库进行评估,选择最符合您要求的解决方案。我们希望通过致力于最大限度提升所有编译器或库的性能来赢得您的青睐,如果您发现我们存在不足之处,欢迎您指正。

通知版本 #20101101

英特尔® MKL 中的英特尔® AVX 优化代码

$
0
0

英特尔® AVX(英特尔® 高级矢量扩展指令集)是英特尔处理器发展的下一步。自英特尔® MKL 10.2 版本开始,英特尔® MKL 都具有英特尔® AVX 优化代码,但是,如果要激活 10.2 版本中的英特尔® AVX 代码,用户需要使用 mkl_enable_instructions()。从英特尔® MKL 10.3 版本开始,英特尔® AVX 代码将自动得到分派,并且不需要专门激活。在英特尔® MKL 10.3 版本中,英特尔® AVX 优化已扩展至 DGEMM/SGEMM、radix-2 Complex-to-Complex FFT、大部分实际的 VML 函数和 VSL 分布生成器。

在英特尔® MKL 10.3 版本中,在 Nehalem(NHM)架构上,支持英特尔® AVX 的处理器运行支持英特尔® AVX 的操作系统会实现加速,如下所示:

英特尔® AVX DGEMM (M, N, K=8Kx4Kx128) 的运行速度比在 NHM 上运行快 1.8 倍。英特尔® AVX DGEMM/SGEMM 达到 88-90% 的设备峰值。

英特尔® AVX/NHM 运行 radix-2 1D 集群 FFT(N=1024)的速度加快 1.8 倍

利用英特尔® AVX 优化代码的 Intel® Optimized LINPACK 性能指标评测在 4 枚内核(N=20000)上的运行速度加快 1.86 倍(或总效率超过 80%)。


Viewing all 49 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>