Let's get you up and running! This should take about 10 minutes if everything goes smoothly, or maybe 30 minutes if you hit some bumps along the way.

What you'll need

Required:**

CMake 3.24 or newer (check with cmake --version)
A C++20 compiler (GCC 11+, Clang 12+, or MSVC 2019+)
OpenMP (usually comes with your compiler)

Optional but recommended:**
CUDA toolkit (if you want GPU acceleration – HIGHLY RECOMMENDED!)
Python 3.11+ (for running tests and examples)

Quick compatibility check:**
cmake --version # Should be 3.24+

g++ --version # Should support C++20

nvcc --version # Optional, for CUDA support

Building Lamp++

The basic build

git clone https://github.com/clay-arras/lamp.git
cd lamp
cmake -S . -B build
cmake --build build

That's it for a CPU-only build. Everything should compile in under a minute.

With CUDA support

If you have a NVIDIA GPU and want to use it:

cmake -S . -B build -DENABLE_CUDA=ON

cmake --build build

The build system will auto-detect your GPU architecture, so you don't need to worry about compute capabilities.

Build types and options

Release build (fast, no debug info):**

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_CUDA=ON

cmake --build build

Debug build (slower, with debug symbols):**

cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug -DENABLE_CUDA=ON

cmake --build build

With code coverage (for contributors):**

cmake -S . -B build -DENABLE_COVERAGE=ON

cmake --build build

The Release build includes -march=native and -ffast-math, so it's optimized for your specific CPU. Debug builds include helpful debug symbols and assertions.

Your first program

Create a file called test.cpp:

#include "lamppp/lamppp.hpp"
#include <iostream>
 
int main() {
    // Create some data
    std::vector<float> data_a = {1.0f, 2.0f, 3.0f, 4.0f};
    std::vector<float> data_b = {2.0f, 3.0f, 4.0f, 5.0f};
    
    // Make tensors (use CPU if you don't have CUDA)
    lmp::Tensor tensor_a(data_a, {2, 2}, lmp::DeviceType::CPU, lmp::DataType::Float32);
    lmp::Tensor tensor_b(data_b, {2, 2}, lmp::DeviceType::CPU, lmp::DataType::Float32);
    
    // Wrap in Variables to track gradients
    lmp::Variable a(tensor_a, true);  // requires_grad=true
    lmp::Variable b(tensor_b, true);
    
    // Do some computation
    lmp::Variable c = a * b;
    lmp::Variable loss = lmp::sum(c);
    
    std::cout << "Forward pass result: " << loss.data() << std::endl;
    
    // Compute gradients
    loss.backward();
    
    std::cout << "Gradient of a:\n" << a.grad() << std::endl;
    std::cout << "Gradient of b:\n" << b.grad() << std::endl;
    
    return 0;
}

To compile and run it:

g++ -std=c++20 -I./include test.cpp -L./build/src/tensor -L./build/src/autograd -ltensor_core -lautograd_core -fopenmp -o test

./test

Or if you built with CUDA:

g++ -std=c++20 -I./include test.cpp -L./build/src/tensor -L./build/src/autograd -ltensor_core -lautograd_core -fopenmp -lcuda -lcudart -o test

./test

Running the examples

The MNIST example is a good way to see everything working together:

First, get the data:**

cd examples
mkdir -p ../data && cd ../data
curl -L -o mnist-in-csv.zip https://www.kaggle.com/api/v1/datasets/download/oddrationale/mnist-in-csv
unzip mnist-in-csv.zip
cd ..

Run the example:**

./build/examples/mnist

This trains a simple 2-layer neural network on MNIST. You should see training accuracy improving over time. The network gets to about 85-90% accuracy, which isn't state-of-the-art but shows that everything is working.

Running tests

Basic test suite:**

cd build

ctest

Individual test suites:**

./tests/tensor_tests # Test tensor operations

./tests/autograd_tests # Test gradient computation

With verbose output:**

ctest --verbose

The tests cover all the basic tensor operations, gradient computation, and CUDA kernels (if enabled). They should all pass on a fresh build.

Running benchmarks

./build/benchmarks/reg_bench_short # Quick benchmark

./build/benchmarks/reg_bench_long # Thorough benchmark

These will give you an idea of performance on your system. The benchmarks compare different operation implementations and should help identify any performance issues. You can also check the corresponding Pytorch benchmarks to see how the two libraries compare.

Common issues and solutions

CUDA not found errors:**

Make sure CUDA toolkit is installed and in your PATH
Try nvcc --version to verify installation
You can always build without CUDA using -DENABLE_CUDA=OFF

OpenMP linking errors:**
Install OpenMP development packages
On Ubuntu: sudo apt install libomp-dev
On macOS: brew install libomp

Tests failing:**
Make sure you're in the build directory when running ctest
Try a clean rebuild: rm -rf build && mkdir build && cd build && cmake .. && make