PACXX - Portable heterogeneous programming in C++

What is PACXX?

PACXX is a Clang-based drop-in C++ compiler replacement enabling the user to program modern heterogeneous systems using the familiar C++ programming language including C++14/17/20 standards.
Programming modern heterogeneous systems comprising of multi-core CPUs, GPUs, etc. is complex and error-prone. Existing programming approaches, like CUDA, HIP or OpenCL, are intrinsically low-level or even text-based. Handling of multiple compute devices from different vendors is complicated and requires manually data movement between them.
PACXX leverages the functionality of the open-source and perpetually developing LLVM toolkit to enable transparent programming of various compute devices. PACXX users are free to use the familiar C++ programming language including STL.
PACXX is a research project currently developed at the research group parallel and distributed systems at University of Münster which is located in Germany.

  • Detailed Information

    PACXX bridges the gap between the simplicity/expressiveness/flexibility/convenience of modern C++ and the computational capabilities of similarly modern CPUs, GPUs and other accelerators in a portable fashion.

    Experienced C++ developers/programmers are able to scale up their computation to accelerators with never-before-seen ease and portability when compared to the wide-spread CUDA, HIP or OpenCL approaches while preserving their carefully assembled high-level constructs and algorithms.

    Main features

    • Complete freedom to use any features of the C++ standards up to the most recent one (currently C++20).
    • Including but not limited to containers, algorithms and other parts of STL.
    • Write once - use everywhere: The same program can be executed on any of the supported accelerators without recompiling.
    • Implicit memory management greatly simplifies the transition from the original C++ code.

    Simple showcase

    To understand the efficiency/simplicity of PACXX one could compare the implementations of a simple program across all mentioned approaches:

    Modern C++
    int main() {
      vector<int> a(1024), b(1024), c(1024);
      auto vadd = [&] {
        for (size_t i = 0; i < a.size(); ++i)
          c[i] = a[i] + b[i];

    Even using a lambda to highlight the kernel code the whole program consists of less than 10 LoCs. But the drawback is obvious: this program will only run on CPUs and in a single thread.

    int main() {
      const char* kernel = _R"(
        __kernel void vadd(const int* a, const int* b, int* c, size_t size) {
          unsigned long i = get_global_id(0);
          if (i >= size) reutrn;
          c[i] = a[i] + b[i];
        __kernel void vadd(const float* a, const float* b, float* c, size_t size) {/*...*/}";
      vector<int> a(1024), b(1024), c(1024);
      ... // initialize platforms, devices, contexts, queues... (> 5 LoC)
      size_t s = 1024 * sizeof(int);
      cl_mem dev_a = clClreateBuffer(ctx, CL_MEM_USE_HOST_PTR, s,;
      clEnqueueWriteBuffer(queue, dev_a, true, 0, s, (void*), 0, 0, 0);
      ... // initialization of dev_b and dev_c, compile and create kernels, bind arguments (> 9 LoC)
      size_t global = 1024, localSize = 1;
      clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0, 0, 0);
      clEqueueReadBuffer(queue, dev_c, true, 0, s, (void*), 0, 0, 0);

    The equivalent program in OpenCL can easily span over half a hundred LoCs mostly due to explicit accelerator management, data transfers and lack of support for templates, requiring a copy of the kernel for each occurring data type. Though the validity of the kernel is unknown at compile time - the kernel code is written in plain text and is intended to be compiled at run time using a vendor-provided compiler with varying degree of feature completeness and standard compliance.

    template <typename T> __global__ void vadd(const T* a, const T* b, T* c, size_t size) {
      auto i = threadIdx.x + blockIdx.x * blockDim.x;
      if (i >= size) return;
      c[i] = a[i] + b[i];
    int main() {
      size_t s = 1024 * sizeof(int);
      vector<int> a(1024), b(1024), c(1024);
      int* dev_a, dev_b, dev_c;
      cudaMalloc(&dev_a, s);
      cudaMalloc(&dev_b, s);
      cudaMalloc(&dev_c, s);
      cudaMemcpy(dev_a,, s, cudaMemcpyHostToDevice);
      cudaMemcpy(dev_b,, s, cudaMemcpyHostToDevice);
      vadd<<<1, 1024>>>(dev_a, dev_b, dev_c, 1024);
      cudaMemcpy(, dev_c, s, cudaMemcpyDeviceToHost);

    These approaches while delivering improved performance on the hardware from the same Vendor loose any resemblance of portability: CUDA is limited to GPUs from NVIDIA, which can be extended to GPUs from AMD using the HIP approach or to CPUs using the commercial OpenACC approach.
    The improvements over the OpenCL approach due to being C++-based help reduce the code size to 20-30 LoCs.

    int main() {
      Executor& exec = Executor::get();
      vector<int> a(1024), b(1024), c(1024);
      auto vadd = [=, &c](auto config) {
        auto i = config.get_global(0);
        if (i >= a.size()) return;
        c[i] = a[i] + b[i];
      KernelConfiguration config({1}, {a.size()});
      auto F = exec.launch(vadd, config);

    As can be seen not only does PACXX require less additional LoCs (close to 10 LoCs total) but also provides greater portability for the resulting program.

    Get PACXX

    All required instructions to receive a local copy of PACXX are available after cloning the meta-repository.

  • Workflow

    To convert an existing application to PACXX one has to follow these simple steps:

    • Allocate a PACXX Executor that will handle the accelerator and the computation:
    Executor& exec = Executor::get();
    • Isolate the computational parts of the program into lambdas, capturing the input data by copy and output - by reference:
    auto vadd = [=, &c](auto config) {...}
    • Replace loops and other parallelisation opportunities with thread ID queries:
    auto i = config.get_global(0);
    • Issue the execution of lambda using Executor::launch() with appropriate amounts of threads:
    KernelConfiguration config({1}, {a.size()});
    exec.launch(vadd, config);
  • Related Publications

    • Haidl Michael, Gorlatch Sergei. 2014. ‘PACXX: Towards a Unified Programming Model for Programming Accelerators using C++14.’ Contributed to the The LLVM Compiler Infrastructure in HPC Workshop at Supercomputing '14, New Orleans. doi: 10.1109/LLVM-HPC.2014.9.
    • Haidl M, Hagedorn B, Gorlatch S. 2016. ‘Programming GPUs with C++14 and Just-In-Time Compilation.’ Contributed to the Advances in Parallel Computing: On the Road to Exascale, ParCo2015, Edinburgh, Schottland. doi: 10.3233/978-1-61499-621-7-247.
    • Haidl M, Steuwer M, Humernbrum T, Gorlatch S. 2016. ‘Multi-Stage Programming for GPUs in Modern C++ using PACXX.’ Contributed to the The 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit, GPGPU '16, Barcelona, Spain. doi: 10.1145/2884045.2884049.
    • Haidl M, Gorlatch S. 2017. ‘High-Level Programming for Many-Cores using C++14 and the STL.’ International Journal of Parallel Programming 2017. doi: 10.1007/s10766-017-0497-y.
    • Haidl M, Steuwer M, Dirks H, Humernbrum T, Gorlatch S. 2017. ‘Towards Composable GPU Programming: Programming GPUs with Eager Actions and Lazy Views.’ In Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, edited by Chen Q, Huang Z, 58-67. New York, NY: ACM. doi: 10.1145/3026937.3026942.
    • Kucher Vladyslav, Fey Florian, Gorlatch Sergei. 2019. ‘Unified Cross-Platform Profiling of Parallel C++ Applications.’ Contributed to the 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), Dallas, TX, USA. doi: 10.1109/PMBS.2018.8641652.