Heterogeneous HPC Simulation System

Real-time simulation of 1M+ swarming particles, controlled by custom AVR hardware.

A real-time GPGPU simulation system bridging AVR-based hardware input with CUDA-accelerated kernels. Focused on maximizing throughput and decoupling I/O latency, successfully managing 1.04M particles at interactive framerates.

Engineered a real-time Boids simulation for 1.04M particles (1024x1024) at 50 FPS, achieving 85.23% Compute Throughput on an RTX 3070 via CUDA Uniform Grid partitioning.
Developed bare-metal firmware for ATmega328P by directly manipulating AVR registers (UBRR0, ADMUX, ADCSRA) for high-speed UART, bypassing standard library overhead.
Implemented a lock-free synchronization pipeline using std::atomic and Win32 Serial API to decouple hardware I/O from the high-frequency GPU rendering loop.

Technical Deep Dive

Massive Parallelization via Uniform Grid

Resolved the O(n²) complexity of brute-force neighbor searches by implementing a Uniform Grid spatial partitioning algorithm in CUDA. Each particle is hashed into a grid cell; neighbor queries are limited to the 27 surrounding cells only — reducing average comparisons from 1M to ~200 per particle, enabling stable 60 FPS with 1.04M entities.

Kernel Optimization via Nsight Compute

Used NVIDIA Nsight Compute to identify shared memory bank conflicts in the neighbor-search kernel. Restructured memory access patterns and tuned register usage, pushing Compute Throughput from ~52% to 85.23% on an RTX 3070.

Lock-Free Hardware–GPU Synchronization

Engineered a lock-free pipeline using std::atomic and the Win32 Serial API to decouple the 60Hz GPU render loop from the lower-frequency UART I/O (~100Hz). Ensures zero render stalls during hardware input bursts from the ATmega328P.

Bare-Metal AVR Firmware

Directly manipulated AVR registers (ADMUX, ADCSRA, UBRR0) on the ATmega328P, bypassing Arduino’s standard library overhead. Achieved sub-millisecond ADC sampling and UART transmission for real-time potentiometer → GPU simulation parameter tuning.

Heterogeneous HPC Simulation System

Technical Deep Dive

Massive Parallelization via Uniform Grid

Kernel Optimization via Nsight Compute

Lock-Free Hardware–GPU Synchronization

Bare-Metal AVR Firmware

Uniform Grid Kernel — Architecture

Step 1 — Cell Hash

Step 2 — 27-Cell Neighbor Query

Step 3 — Shared Memory Tile

Lock-Free I/O Bridge