HN
← Back to Projects

Heterogeneous HPC Simulation System

Real-time simulation of 1M+ swarming particles, controlled by custom AVR hardware.

A real-time GPGPU simulation system bridging AVR-based hardware input with CUDA-accelerated kernels. Focused on maximizing throughput and decoupling I/O latency, successfully managing 1.04M particles at interactive framerates.
  • Engineered a real-time Boids simulation for 1.04M particles (1024x1024) at 50 FPS, achieving 85.23% Compute Throughput on an RTX 3070 via CUDA Uniform Grid partitioning.
  • Developed bare-metal firmware for ATmega328P by directly manipulating AVR registers (UBRR0, ADMUX, ADCSRA) for high-speed UART, bypassing standard library overhead.
  • Implemented a lock-free synchronization pipeline using std::atomic and Win32 Serial API to decouple hardware I/O from the high-frequency GPU rendering loop.

Technical Deep Dive

Massive Parallelization via Uniform Grid

Resolved the O(n²) complexity of brute-force neighbor searches by implementing a Uniform Grid spatial partitioning algorithm in CUDA. Each particle is hashed into a grid cell; neighbor queries are limited to the 27 surrounding cells only — reducing average comparisons from 1M to ~200 per particle, enabling stable 60 FPS with 1.04M entities.

Kernel Optimization via Nsight Compute

Used NVIDIA Nsight Compute to identify shared memory bank conflicts in the neighbor-search kernel. Restructured memory access patterns and tuned register usage, pushing Compute Throughput from ~52% to 85.23% on an RTX 3070.

Lock-Free Hardware–GPU Synchronization

Engineered a lock-free pipeline using std::atomic and the Win32 Serial API to decouple the 60Hz GPU render loop from the lower-frequency UART I/O (~100Hz). Ensures zero render stalls during hardware input bursts from the ATmega328P.

Bare-Metal AVR Firmware

Directly manipulated AVR registers (ADMUX, ADCSRA, UBRR0) on the ATmega328P, bypassing Arduino’s standard library overhead. Achieved sub-millisecond ADC sampling and UART transmission for real-time potentiometer → GPU simulation parameter tuning.

Uniform Grid Kernel — Architecture

The core CUDA kernel uses a hash-based Uniform Grid to reduce neighbor lookups from O(n²) to effectively O(1) per particle. Below is the conceptual pipeline used in the 1.04M-particle Boids simulation.

Step 1 — Cell Hash

// Map 3D position to flat cell index
__device__ int cellHash(
    float3 pos, float cellSize, int3 dim) {
  int3 c = make_int3(pos / cellSize);
  return c.x + c.y*dim.x + c.z*dim.x*dim.y;
}

Step 2 — 27-Cell Neighbor Query

// Query only 27 surrounding cells
for (int dz=-1; dz<=1; dz++)
 for (int dy=-1; dy<=1; dy++)
  for (int dx=-1; dx<=1; dx++) {
    int nc = hash(cx+dx, cy+dy, cz+dz);
    // iterate particles in cell nc...
  }

Step 3 — Shared Memory Tile

__shared__ float3 sPos[BLOCK_SIZE];
sPos[threadIdx.x] = positions[globalIdx];
__syncthreads();
// compute Boids forces using sPos[]
// avoids repeated global mem reads

Lock-Free I/O Bridge

// std::atomic decouples UART thread
// from 60Hz GPU render loop
std::atomic<float> gSepWeight;

// UART reader thread (~100Hz)
gSepWeight.store(newVal);

// CUDA launch thread (60Hz)
float w = gSepWeight.load();