cuLite v0.3.1
A lite CUDA C++ Interface
Loading...
Searching...
No Matches
cuLite

Overview

Table of contents

  1. About cuLite
  2. How to use cuLite
  3. Useful info
  4. Documentation
  5. Third-Party Dependencies
  6. Examples
  7. License
  8. Contact

About cuLite

CUDA Lightweight Interface is a modern, lightweight C++ wrapper library built upon NVIDIA CUDA libraries and CLA3P, distributed as a core component of the SimuliCore suite.

With its minimalist design, cuLite provides an intuitive interface to GPU computing primitives, making it accessible for novice developers while offering advanced functionality required by experienced users to achieve optimal GPU performance. Built as a GPU acceleration layer on top of CLA3P (Compact Linear Algebra Parallel Portable Package), cuLite maintains full compatibility with CLA3P data structures and operations within the SimuliCore ecosystem, enabling seamless hybrid CPU-GPU computing workflows.

Utilizing a foundational framework of GPU memory management and accelerated computation, cuLite is designed for continuous evolution.

Features Currently Implemented

cuLite is a dynamic and evolving library. Its currently supported capabilities and upcoming features include:

  • GPU memory management (device/host allocation and transfers)
  • Dense vector/matrix management
  • Sparse matrix management
  • Dense vector/matrix algebra
  • Sparse matrix algebra (limited)
  • Virtual operation layer
  • Dense linear system solvers (LU)
  • Sparse linear system solvers (cuDSS)
  • QR decomposition
  • Singular Value Decomposition (SVD)
  • Eigenvalue problems (limited support)
  • CUDA stream management (limited support)
  • cuBLAS integration (Basic Linear Algebra Subprograms on GPU - limited support)
  • cuSOLVER integration (Dense linear system solvers on GPU - limited support)
  • cuSPARSE integration (Sparse matrix operations on GPU - limited support)

All features described above are supported by a comprehensive suite of intuitive handler classes to facilitate rapid development, complemented by an advanced proxy layer for direct CUDA library access. This dual-layer approach enables users to select the interface that best aligns with their specific project requirements.

The library continues to expand, with new features integrated into each release. Technical inquiries or feature requests for future versions may be directed to us via the contact channels.

(back to top)

How to use cuLite

cuLite artifacts are located within the SimuliCore installation directory. To integrate the library into your project, perform the following steps:

  • Include Path: Add <simulicore_install>/include to your project's include directories.
  • Library Link: Link your project with the appropriate library found in <simulicore_install>/lib.
  • Dependencies: Ensure your project is linked with the required CUDA libraries.

cuLite provides both 32-bit and 64-bit integer interfaces to accommodate different computational requirements:

  • 32-bit Integer Interface: Link your project with the standard library (libculite.so for Linux or culite.lib for Windows).
  • 64-bit Integer Interface:
    1. Add the -DCULITE_I64 definition to your compilation flags.
    2. Link with the 64-bit specific library (libculite_i64.so for Linux or culite_i64.lib for Windows).
#
# Sample Linux x86_64 CMake configuration (32-bit integers)
# cuLite needs CLA3P library and NVIDIA CUDA libraries
# CLA3P needs Intel MKL library
#
include(<simulicore_install>/cmake/3rd/mkl.lin.cmake)
include(<simulicore_install>/cmake/3rd/cuda.lin.cmake)
set(SIMULICORE_INC <simulicore_install>/include)
set(SIMULICORE_LIB -L<simulicore_install>/lib -lculite -lcla3p)
set(SIMULICORE_3RD_DEF ${INTEL_MKL_DEF})
set(SIMULICORE_3RD_INC ${INTEL_MKL_INC} ${NVIDIA_CUDA_INC} ${NVIDIA_CUDSS_INC})
set(SIMULICORE_3RD_LIB ${INTEL_MKL_LIB} ${NVIDIA_CUDA_LIB} ${NVIDIA_CUDSS_LIB})
add_executable(<target> main.cpp)
target_compile_definitions(<target> PRIVATE ${SIMULICORE_3RD_DEF})
target_include_directories(<target> PRIVATE ${SIMULICORE_INC} ${SIMULICORE_3RD_INC})
target_link_libraries(<target> ${SIMULICORE_LIB} ${SIMULICORE_3RD_LIB})

See section Third-Party Dependencies for more information.

(back to top)

Useful info

cuLite provides a streamlined interface to NVIDIA CUDA libraries, abstracting away much of the boilerplate code typically required for GPU computing while maintaining high performance and flexibility.

Memory Management

The library simplifies GPU memory operations through intuitive handler classes:

#include <cla3p/dense.hpp>
#include <culite/dense.hpp>
// Transfer data from host to device
// Allocate device memory
// Transfer data from host to device
hostVec >> devVec1;
hostVec >> devVec2; // allocates memory for devVec2 internally
// Transfer data from device to host
devVec1 >> hostVec;
XxVector< real_t > RdVector
XxVector< real_t > RdVector
Double precision real device vector.
Definition dense.hpp:31

GPU Linear Algebra Operations

cuLite exposes matrix-vector multiplication at three levels of abstraction, from low-level BLAS calls up to expressive C++ operators. The examples below demonstrate all three approaches for computing y = alpha * A * x + beta * y using dense GPU objects.

1. Direct cuBLAS handler interface — raw access to the underlying BLAS routine via culite::CuBlasHandler::gemv:

// Fill A and x with data...
// Compute y = alpha * A * x + beta * y through the global cuBLAS handler
culite::real_t alpha = 1.0, beta = 0.0;
cuBlasHandler.gemv(cla3p::op_t::N,
A.nrows(), A.ncols(),
&alpha, A.values(), A.ld(),
x.values(), 1,
&beta, y.values(), 1);
The cuBlas handler class.
Definition cublas_handler.hpp:79
void gemv(op_t op, int_t m, int_t n, const T_Scalar *alpha, const T_Scalar *a, int_t lda, const T_Scalar *x, int_t incx, const T_Scalar *beta, T_Scalar *y, int_t incy)
Performs general matrix-vector multiplication.
CuBlasHandler & globalCuBlasHandler()
Returns the global cuBLAS handler instance.
Definition cublas_handler.hpp:725
double real_t
Double precision real.
Definition scalar.hpp:46
XxMatrix< real_t > RdMatrix
Double precision real matrix.
Definition dense.hpp:55

2. Functional interface with a custom cuBLAS handler — higher-level culite::ops::mult function accepting an explicit handler, useful when multiple independent BLAS contexts are required:

// Fill A and x with data...
// Create a dedicated cuBLAS handler and compute y = alpha * A * x + beta * y
culite::real_t alpha = 1.0, beta = 0.0;
culite::ops::mult(alpha, cla3p::op_t::N, A, x, beta, y, myHandler);
void mult(T_Scalar alpha, op_t opA, const dns::XxMatrix< T_Scalar > &A, op_t opB, const dns::XxMatrix< T_Scalar > &B, T_Scalar beta, dns::XxMatrix< T_Scalar > &C, CuBlasHandler &cuBlasHandler=globalCuBlasHandler())
Updates a general matrix with a matrix-matrix product.

3. Operator interface — the most concise form; the global cuBLAS handler is used internally:

// Fill A and x with data...
// Compute y = A * x (global cuBLAS handler is used)
// Or accumulate in-place: y += A * x
y += A * x;

Stream Management

For advanced users requiring explicit control over GPU execution, cuLite provides stream management capabilities. The CUDA stream class is currently under development and offers limited functionality:

// Create and manage CUDA streams via cudaStream class
culite::CudaStream cudaStream;
// Get the underlying CUDA stream handle
cudaStream_t stream = cudaStream.stream();
// Synchronize the stream
cudaStream.sync();
CUDA stream encapsulation class.
Definition cuda_stream.hpp:37
void sync()
Synchronizes the CUDA stream.
cudaStream_t stream()
Retrieves the underlying CUDA stream handle.

The library is designed to work seamlessly with CLA3P for hybrid CPU-GPU computing workflows, enabling efficient data transfer and computation across different processing units.

(back to top)

Documentation

You can find the latest cuLite version documentation here.

(back to top)

Third-Party Dependencies

To accelerate computations on GPU hardware, cuLite relies on NVIDIA CUDA libraries:

  • Linux/Windows (x86_64 with NVIDIA GPU):
    • NVIDIA CUDA Toolkit: The CUDA Toolkit includes cuBLAS, cuSOLVER, and cuSPARSE libraries required by cuLite.
    • cuDSS (CUDA Direct Sparse Solver): Required for advanced sparse linear system solving capabilities, provides high-performance direct solvers for sparse matrices on GPU.

Detailed instructions for linking with CUDA libraries are available in the provided linking guide.

Minimum Requirements:

  • CUDA Toolkit 13.0 or later
  • NVIDIA GPU with compute capability 5.0 or higher
  • Compatible NVIDIA GPU driver
  • cuDSS library (for sparse solver functionality)

(back to top)

Examples

The directory <simulicore_install>/examples/culite contains comprehensive templates and source examples demonstrating the configuration and compilation of custom projects utilizing the cuLite library.

Building Examples on Linux

Execute the following commands from a terminal:

cd <simulicore_install>/examples/culite
./example_builder.sh

Example executables are organized within the i32/bin and i64/bin directories, corresponding to their respective 32-bit and 64-bit integer interfaces. To execute a demonstration, select the desired script using the format ex<number>_<description>.sh and run it from your terminal:

./i32/bin/ex01a_dense_vector_create.sh

Building Examples on Windows

To build and execute the provided examples on Windows using Visual Studio, follow the procedure below:

  • Project Initialization: Open the <simulicore_install>/examples/culite directory in Visual Studio to initialize the CMake project.
  • Configuration: By default, examples are configured for the 32-bit integer interface. To utilize the 64-bit interface, set the CMake variable -DCULITE_EXAMPLES_I64=ON.
  • Compilation & Installation: Build the solution to generate the example executables and perform installation.
  • Execution: All binaries are located in the ixx/bin directory (where ixx corresponds to i32 or i64, depending on your selection). To run an example, execute the desired batch file via the Visual Studio terminal:
./i32/bin/ex01a_dense_vector_create.bat

NOTE (Compliance Requirement): Build Configuration Alignment
To ensure stable execution, verify that the build configuration (Debug or Release) of the application matches the configuration used to compile the library. Mixing configurations is strictly discouraged due to fundamental incompatibilities in the C Runtime Library (CRT).

(back to top)

License

Distributed as a part of SimuliCore licensed under the Apache License, Version 2.0.

(back to top)

Contact

As a recently released framework, cuLite is under active development and refinement. We welcome user feedback regarding the software's functionality, documentation, or implementation.

Please submit your insights, feature requests, documentation inquiries, or technical issue reports through the following channels:

(back to top)