| Total: 1000
Modern implementations of homomorphic encryption (HE) rely heavily on polynomial arithmetic over a finite field. This is particularly true of the CKKS, BFV, and BGV HE schemes. Two of the biggest performance bottlenecks in HE primitives and applications are polynomial modular multiplication and the forward and inverse number-theoretic transform (NTT). Here, we introduce Intel Homomorphic Encryption Acceleration Library (Intel HEXL), a C++ library which provides optimized implementations of polynomial arithmetic for Intel processors. Intel HEXL takes advantage of the recent Intel Advanced Vector Extensions 512 (Intel AVX512) instruction set to provide state-of-the-art implementations of the NTT and modular multiplication. On the forward and inverse NTT, Intel HEXL provides up to 7.2x and 6.7x speedup, respectively, over a native C++ implementation. Intel HEXL also provides up to 6.0x speedup on the element-wise vector-vector modular multiplication, and 1.7x speedup on the element-wise vector-scalar modular multiplication. Intel HEXL is available open-source at https://github.com/intel/hexl under the Apache 2.0 license and has been adopted by the Microsoft SEAL and PALISADE homomorphic encryption libraries.
We searched high resolution spectra of 5600 nearby stars for emission lines that are both inconsistent with a natural origin and unresolved spatially, as would be expected from extraterrestrial optical lasers. The spectra were obtained with the Keck 10-meter telescope, including light coming from within 0.5 arcsec of the star, corresponding typically to within a few to tens of au of the star, and covering nearly the entire visible wavelength range from 3640 to 7890 angstroms. We establish detection thresholds by injecting synthetic laser emission lines into our spectra and blindly analyzing them for detections. We compute flux density detection thresholds for all wavelengths and spectral types sampled. Our detection thresholds for the power of the lasers themselves range from 3 kW to 13 MW, independent of distance to the star but dependent on the competing "glare" of the spectral energy distribution of the star and on the wavelength of the laser light, launched from a benchmark, diffraction-limited 10-meter class telescope. We found no such laser emission coming from the planetary region around any of the 5600 stars. As they contain roughly 2000 lukewarm, Earth-size planets, we rule out models of the Milky Way in which over 0.1 percent of warm, Earth-size planets harbor technological civilizations that, intentionally or not, are beaming optical lasers toward us. A next generation spectroscopic laser search will be done by the Breakthrough Listen initiative, targeting more stars, especially stellar types overlooked here including spectral types O, B, A, early F, late M, and brown dwarfs, and astrophysical exotica.
Network Function Virtualization (NFV) promises the benefits of reduced infrastructure, personnel, and management costs by outsourcing network middleboxes to the public or private cloud. Unfortunately, running network functions in the cloud entails security challenges, especially for complex stateful services. In this paper, we describe our experiences with hardening the king of middleboxes - Intrusion Detection Systems (IDS) - using Intel Software Guard Extensions (Intel SGX) technology. Our IDS secured using Intel SGX, called SEC-IDS, is an unmodified Snort 3 with a DPDK network layer that achieves 10Gbps line rate. SEC-IDS guarantees computational integrity by running all Snort code inside an Intel SGX enclave. At the same time, SEC-IDS achieves near-native performance, with throughput close to 100 percent of vanilla Snort 3, by retaining network I/O outside of the enclave. Our experiments indicate that performance is only constrained by the modest Enclave Page Cache size available on current Intel SGX Skylake based E3 Xeon platforms. Finally, we kept the porting effort minimal by using the Graphene-SGX library OS. Only 27 Lines of Code (LoC) were modified in Snort and 178 LoC in Graphene-SGX itself.
Observations of the pair of galaxies VV 330 with the SCORPIO multimode instrument on the 6-m Special Astrophysical Observatory telescope are presented. Large-scale velocity fields of the ionized gas in H-alfa and brightness distributions in continuum and H-alfa have been constructed for both galaxies with the help of a scanning Fabry Perot interferometer. Long-slit spectroscopy is used to study the stellar kinematics. Analysis of the data obtained has revealed a complex structure in each of the pair components. Three kinematic subsystems have been identified in UGC 5600: a stellar disk, an inner gas ring turned with respect to the disk through ~80degrees, and an outer gas disk. The stellar and outer gas disks are noncoplanar. Possible scenarios for the formation of the observed multicomponent kinematic galactic structure are considered, including the case where the large-scale velocity field of the gas is represented by the kinematic model of a disk with a warp. The velocity field in the second galaxy of the pair, UGC 5609, is more regular. A joint analysis of the data on the photometric structure and the velocity field has shown that this is probably a late-type spiral galaxy whose shape is distorted by the gravitational interaction, possibly, with UGC 5600.
We present a comprehensive overview of the stereoscopic Intel RealSense RGBD imaging systems. We discuss these systems' mode-of-operation, functional behavior and include models of their expected performance, shortcomings, and limitations. We provide information about the systems' optical characteristics, their correlation algorithms, and how these properties can affect different applications, including 3D reconstruction and gesture recognition. Our discussion covers the Intel RealSense R200 and the Intel RealSense D400 (formally RS400).
Memory-safety violations are a prevalent cause of both reliability and security vulnerabilities in systems software written in unsafe languages like C/C++. Unfortunately, all the existing software-based solutions to this problem exhibit high performance overheads preventing them from wide adoption in production runs. To address this issue, Intel recently released a new ISA extension - Memory Protection Extensions (Intel MPX), a hardware-assisted full-stack solution to protect against memory safety violations. In this work, we perform an exhaustive study of the Intel MPX architecture to understand its advantages and caveats. We base our study along three dimensions: (a) performance overheads, (b) security guarantees, and (c) usability issues. To put our results in perspective, we compare Intel MPX with three prominent software-based approaches: (1) trip-wire - AddressSanitizer, (2) object-based - SAFECode, and (3) pointer-based - SoftBound. Our main conclusion is that Intel MPX is a promising technique that is not yet practical for widespread adoption. Intel MPX's performance overheads are still high (roughly 50% on average), and the supporting infrastructure has bugs which may cause compilation or runtime errors. Moreover, we showcase the design limitations of Intel MPX: it cannot detect temporal errors, may have false positives and false negatives in multithreaded code, and its restrictions on memory layout require substantial code changes for some programs.
A critical enabler for progress in neuromorphic computing research is the ability to transparently evaluate different neuromorphic solutions on important tasks and to compare them to state-of-the-art conventional solutions. The Intel Neuromorphic Deep Noise Suppression Challenge (Intel N-DNS Challenge), inspired by the Microsoft DNS Challenge, tackles a ubiquitous and commercially relevant task: real-time audio denoising. Audio denoising is likely to reap the benefits of neuromorphic computing due to its low-bandwidth, temporal nature and its relevance for low-power devices. The Intel N-DNS Challenge consists of two tracks: a simulation-based algorithmic track to encourage algorithmic innovation, and a neuromorphic hardware (Loihi 2) track to rigorously evaluate solutions. For both tracks, we specify an evaluation methodology based on energy, latency, and resource consumption in addition to output audio quality. We make the Intel N-DNS Challenge dataset scripts and evaluation code freely accessible, encourage community participation with monetary prizes, and release a neuromorphic baseline solution which shows promising audio quality, high power efficiency, and low resource consumption when compared to Microsoft NsNet2 and a proprietary Intel denoising model used in production. We hope the Intel N-DNS Challenge will hasten innovation in neuromorphic algorithms research, especially in the area of training tools and methods for real-time signal processing. We expect the winners of the challenge will demonstrate that for problems like audio denoising, significant gains in power and resources can be realized on neuromorphic devices available today compared to conventional state-of-the-art solutions.
Intel Max GPUs are a new option available to CGYRO fusion simulation users. This paper outlines the changes that were needed to successfully run CGYRO on Intel Max 1550 GPUs on TACC's Stampede3 HPC system and presents benchmark results obtained there. Benchmark results were also run on Stampede3 Intel Max CPUs, as well as NVIDIA A100 and AMD MI250X GPUs at other major HPC systems. The Intel Max GPUs are shown to perform comparably to the other tested GPUs for smaller simulations but are noticeably slower for larger ones. Moreover, Intel Max GPUs are significantly faster than the tested Intel Max CPUs on Stampede3.
With the announcement that the Aurora Supercomputer will be composed of general purpose Intel CPUs complemented by discrete high performance Intel GPUs, and the deployment of the oneAPI ecosystem, Intel has committed to enter the arena of discrete high performance GPUs. A central requirement for the scientific computing community is the availability of production-ready software stacks and a glimpse of the performance they can expect to see on Intel high performance GPUs. In this paper, we present the first platform-portable open source math library supporting Intel GPUs via the DPC++ programming environment. We also benchmark some of the developed sparse linear algebra functionality on different Intel GPUs to assess the efficiency of the DPC++ programming ecosystem to translate raw performance into application performance. Aside from quantifying the efficiency within the hardware-specific roofline model, we also compare against routines providing the same functionality that ship with Intel's oneMKL vendor library.
Genetic information is increasing exponentially, doubling every 18 months. Analyzing this information within a reasonable amount of time requires parallel computing resources. While considerable research has addressed DNA analysis using GPUs, so far not much attention has been paid to the Intel Xeon Phi coprocessor. In this paper we present an algorithm for large-scale DNA analysis that exploits thread-level and the SIMD parallelism of the Intel Xeon Phi. We evaluate our approach for various numbers of cores and thread allocation affinities in the context of real-world DNA sequences of mouse, cat, dog, chicken, human and turkey. The experimental results on Intel Xeon Phi show speed-ups of up to 10x compared to a sequential implementation running on an Intel Xeon processor E5.
With the rapidly growing demand for computing power new accelerator based architectures have entered the world of high performance computing since around 5 years. In particular GPGPUs have recently become very popular, however programming GPGPUs using programming languages like CUDA or OpenCL is cumbersome and error-prone. Trying to overcome these difficulties, Intel developed their own Many Integrated Core (MIC) architecture which can be programmed using standard parallel programming techniques like OpenMP and MPI. In the beginning of 2013, the first production-level cards named Intel Xeon Phi came on the market. LRZ has been considered by Intel as a leading research centre for evaluating coprocessors based on the MIC architecture since 2010 under strict NDA. Since the Intel Xeon Phi is now generally available, we can share our experience on programming Intel's new MIC architecture.
In this paper, we explore how transfer learning, coupled with Intel Xeon, specifically 4th Gen Intel Xeon scalable processor, defies the conventional belief that training is primarily GPU-dependent. We present a case study where we achieved near state-of-the-art accuracy for image classification on a publicly available Image Classification TensorFlow dataset using Intel Advanced Matrix Extensions(AMX) and distributed training with Horovod.
Batched linear solvers play a vital role in computational sciences, especially in the fields of plasma physics and combustion simulations. With the imminent deployment of the Aurora Supercomputer and other upcoming systems equipped with Intel GPUs, there is a compelling demand to expand the capabilities of these solvers for Intel GPU architectures. In this paper, we present our efforts in porting and optimizing the batched iterative solvers on Intel GPUs using the SYCL programming model. These new solvers achieve impressive performance on the Intel GPU Max 1550s (Ponte Vecchio GPUs) which surpass our previous CUDA implementation on NVIDIA H100 GPUs by an average of 2.4x for the PeleLM application inputs. The batched solvers are ready for production use in real-world scientific applications through the Ginkgo library, complementing the performance portability of the batched functionality of Ginkgo.
Enforcing integrity and confidentiality of users' application code and data is a challenging mission that any software developer working on an online production grade service is facing. Since cryptology is not a widely understood subject, people on the cutting edge of research and industry are always seeking for new technologies to naturally expand the security of their programs and systems. Intel Software Guard Extension (Intel SGX) is an Intel technology for developers who are looking to protect their software binaries from plausible attacks using hardware instructions. The Intel SGX puts sensitive code and data into CPU-hardened protected regions called enclaves. In this project we leverage the Intel SGX to produce a secure cryptographic library which keeps the generated keys inside an enclave restricting use and dissemination of confidential cryptographic keys. Using enclaves to store the keys we maintain a small Trusted Computing Base (TCB) where we also perform computation on temporary buffers to and from untrusted application code. As a proof of concept, we implemented hashes and symmetric encryption algorithms inside the enclave where we stored hashes, Initialization Vectors (IVs) and random keys and open sourced the code (https://github.com/hmofrad/CryptoEnclave).
Monte Carlo (MC) simulations play a pivotal role in diverse scientific and engineering domains, with applications ranging from nuclear physics to materials science. Harnessing the computational power of high-performance computing (HPC) systems, especially Graphics Processing Units (GPUs), has become essential for accelerating MC simulations. This paper focuses on the adaptation and optimization of the OpenMC neutron and photon transport Monte Carlo code for Intel GPUs, specifically the Intel Data Center Max 1100 GPU (codename Ponte Vecchio, PVC), through distributed OpenMP offloading. Building upon prior work by Tramm J.R., et al. (2022), which laid the groundwork for GPU adaptation, our study meticulously extends the OpenMC code's capabilities to Intel GPUs. We present a comprehensive benchmarking and scaling analysis, comparing performance on Intel MAX GPUs to state-of-the-art CPU execution (Intel Xeon Platinum 8480+ Processor, codename 4th generation Sapphire Rapids). The results demonstrate a remarkable acceleration factor compared to CPU execution, showcasing the GPU-adapted code's superiority over its CPU counterpart as computational load increases.
As modern scientific simulations grow ever more in size and complexity, even their analysis and post-processing becomes increasingly demanding, calling for the use of HPC resources and methods. yt is a parallel, open source post-processing python package for numerical simulations in astrophysics, made popular by its cross-format compatibility, its active community of developers and its integration with several other professional Python instruments. The Intel Distribution for Python enhances yt's performance and parallel scalability, through the optimization of lower-level libraries Numpy and Scipy, which make use of the optimized Intel Math Kernel Library (Intel-MKL) and the Intel MPI library for distributed computing. The library package yt is used for several analysis tasks, including integration of derived quantities, volumetric rendering, 2D phase plots, cosmological halo analysis and production of synthetic X-ray observation. In this paper, we provide a brief tutorial for the installation of yt and the Intel Distribution for Python, and the execution of each analysis task. Compared to the Anaconda python distribution, using the provided solution one can achieve net speedups up to 4.6x on Intel Xeon Scalable processors (codename Skylake).
Intel Trust Domain Extensions (TDX) is a new architectural extension in the 4th Generation Intel Xeon Scalable Processor that supports confidential computing. TDX allows the deployment of virtual machines in the Secure-Arbitration Mode (SEAM) with encrypted CPU state and memory, integrity protection, and remote attestation. TDX aims to enforce hardware-assisted isolation for virtual machines and minimize the attack surface exposed to host platforms, which are considered to be untrustworthy or adversarial in the confidential computing's new threat model. TDX can be leveraged by regulated industries or sensitive data holders to outsource their computations and data with end-to-end protection in public cloud infrastructure. This paper aims to provide a comprehensive understanding of TDX to potential adopters, domain experts, and security researchers looking to leverage the technology for their own purposes. We adopt a top-down approach, starting with high-level security principles and moving to low-level technical details of TDX. Our analysis is based on publicly available documentation and source code, offering insights from security researchers outside of Intel.
Large Language Models (LLMs) have become essential tools in natural language processing, finding large usage in chatbots such as ChatGPT and Gemini, and are a central area of research. A particular area of interest includes designing hardware specialized for these AI applications, with one such example being the neural processing unit (NPU). In 2023, Intel released the Intel Core Ultra processor with codename Meteor Lake, featuring a CPU, GPU, and NPU system-on-chip. However, official software support for the NPU through Intel's OpenVINO framework is limited to static model inference. The dynamic nature of autoregressive token generation in LLMs is therefore not supported out of the box. To address this shortcoming, we present NITRO (NPU Inference for Transformers Optimization), a Python-based framework built on top of OpenVINO to support text and chat generation on NPUs. In this paper, we discuss in detail the key modifications made to the transformer architecture to enable inference, some performance benchmarks, and future steps towards improving the package. The code repository for NITRO can be found here: https://github.com/abdelfattah-lab/nitro.
Homomorphic Encryption (HE) is an emerging encryption scheme that allows computations to be performed directly on encrypted messages. This property provides promising applications such as privacy-preserving deep learning and cloud computing. Prior works have been proposed to enable practical privacy-preserving applications with architectural-aware optimizations on CPUs, GPUs and FPGAs. However, there is no systematic optimization for the whole HE pipeline on Intel GPUs. In this paper, we present the first-ever SYCL-based GPU backend for Microsoft SEAL APIs. We perform optimizations from instruction level, algorithmic level and application level to accelerate our HE library based on the Cheon, Kim, Kimand Song (CKKS) scheme on Intel GPUs. The performance is validated on two latest Intel GPUs. Experimental results show that our staged optimizations together with optimizations including low-level optimizations and kernel fusion accelerate the Number Theoretic Transform (NTT), a key algorithm for HE, by up to 9.93X compared with the naïve GPU baseline. The roofline analysis confirms that our optimized NTT reaches 79.8% and85.7% of the peak performance on two GPU devices. Through the highly optimized NTT and the assembly-level optimization, we obtain 2.32X - 3.05X acceleration for HE evaluation routines. In addition, our all-together systematic optimizations improve the performance of encrypted element-wise polynomial matrix multiplication application by up to 3.10X.
This paper presents a SYCL implementation of Multi-Layer Perceptrons (MLPs), which targets and is optimized for the Intel Data Center GPU Max 1550. To increase the performance, our implementation minimizes the slow global memory accesses by maximizing the data reuse within the general register file and the shared local memory by fusing the operations in each layer of the MLP. We show with a simple roofline model that this results in a significant increase in the arithmetic intensity, leading to improved performance, especially for inference. We compare our approach to a similar CUDA implementation for MLPs and show that our implementation on the Intel Data Center GPU outperforms the CUDA implementation on Nvidia's H100 GPU by a factor up to 2.84 in inference and 1.75 in training. The paper also showcases the efficiency of our SYCL implementation in three significant areas: Image Compression, Neural Radiance Fields, and Physics-Informed Machine Learning. In all cases, our implementation outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 30 and the CUDA PyTorch version on Nvidia's H100 GPU by up to a factor 19. The code can be found at https://github.com/intel/tiny-dpcpp-nn.
In this paper, we describe a new large dataset for illumination estimation. This dataset, called INTEL-TAU, contains 7022 images in total, which makes it the largest available high-resolution dataset for illumination estimation research. The variety of scenes captured using three different camera models, namely Canon 5DSR, Nikon D810, and Sony IMX135, makes the dataset appropriate for evaluating the camera and scene invariance of the different illumination estimation techniques. Privacy masking is done for sensitive information, e.g., faces. Thus, the dataset is coherent with the new General Data Protection Regulation (GDPR). Furthermore, the effect of color shading for mobile images can be evaluated with INTEL-TAU dataset, as both corrected and uncorrected versions of the raw data are provided. Furthermore, this paper benchmarks several color constancy approaches on the proposed dataset.
In this paper, we present benchmark data for Intel Memory Drive Technology (IMDT), which is a new generation of Software-defined Memory (SDM) based on Intel ScaleMP collaboration and using 3D XPointTM based Intel Solid-State Drives (SSDs) called Optane. We studied IMDT performance for synthetic benchmarks, scientific kernels, and applications. We chose these benchmarks to represent different patterns for computation and accessing data on disks and memory. To put performance of IMDT in comparison, we used two memory configurations: hybrid IMDT DDR4/Optane and DDR4 only systems. The performance was measured as a percentage of used memory and analyzed in detail. We found that for some applications DDR4/Optane hybrid configuration outperforms DDR4 setup by up to 20%.
With the rapid increase in software exploits, the last few decades have seen several hardware-level features to enhance security (e.g., Intel MPX, ARM TrustZone, Intel SGX, Intel CET). Due to security, performance and/or usability issues these features have attracted steady criticism. One such feature is the Intel Memory Protection Extensions (MPX), an instruction set architecture extension promising spatial memory safety at a lower performance cost due to hardware-accelerated bounds checking. However, recent investigations into MPX have found that is neither as performant, accurate, nor precise as cutting-edge software-based spatial memory safety. As a direct consequence, compiler and operating system support for MPX is dying, and Intel has begun to manufacture desktop CPUs without MPX. Nonetheless, given how ubiquitous MPX is, it provides an excellent yet under-utilized hardware resource that can be aptly salvaged for security purposes. In this paper, we propose Simplex, a library framework that re-purposes MPX registers as general purpose registers. Using Simplex, we demonstrate how MPX registers can be used to store sensitive information (e.g., encryption keys) directly on the hardware. We evaluate Simplex for performance and find that its overhead is small enough to permit its deployment in all but the most performance-intensive code. We refactored the string.h buffer manipulation functions and found a geometric mean 0.9% performance overhead. We also modified the deepsjeng and lbm SPEC CPU2017 benchmarks to use Simplex and found a 1% and 0.98% performance overhead respectively. Finally, we investigate the behavior of the MPX context with regards to multi-process and multi-thread programs.
In this paper we demonstrate the methodology for parallelizing the computation of large one-dimensional discrete fast Fourier transforms (DFFTs) on multi-core Intel Xeon processors. DFFTs based on the recursive Cooley-Tukey method have to control cache utilization, memory bandwidth and vector hardware usage, and at the same time scale across multiple threads or compute nodes. Our method builds on single-threaded Intel Math Kernel Library (MKL) implementation of DFFT, and uses the Intel Cilk Plus framework for thread parallelism. We demonstrate the ability of Intel Cilk Plus to handle parallel recursion with nested loop-centric parallelism without tuning the code to the number of cores or cache metrics. The result of our work is a library called EFFT that performs 1D DFTs of size 2^N for N>=21 faster than the corresponding Intel MKL parallel DFT implementation by up to 1.5x, and faster than FFTW by up to 2.5x. The code of EFFT is available for free download under the GPLv3 license. This work provides a new efficient DFFT implementation, and at the same time demonstrates an educational example of how computer science problems with complex parallel patterns can be optimized for high performance using the Intel Cilk Plus framework.
The SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications as the underlying ISA is SIMD and important hardware capabilities cannot be fully utilized. To close this performance gap we introduce C-For-Metal (CM), an explicit SIMD programming framework designed to deliver close-to-the-metal performance on Intel GPUs. The CM programming language and its vector/matrix types provide an intuitive interface to exploit the underlying hardware features, allowing fine-grained register management, SIMD size control and cross-lane data sharing. Experimental results show that CM applications from different domains outperform the best-known SIMT-based OpenCL implementations, achieving up to 2.7x speedup on the latest Intel GPU.