Heterogeneous HPC Simulation System
Real-time simulation of 1M+ swarming particles, controlled by custom AVR hardware.
- Engineered a real-time Boids simulation for 1.04M particles (1024x1024) at 50 FPS, achieving 85.23% Compute Throughput on an RTX 3070 via CUDA Uniform Grid partitioning.
- Developed bare-metal firmware for ATmega328P by directly manipulating AVR registers (UBRR0, ADMUX, ADCSRA) for high-speed UART, bypassing standard library overhead.
- Implemented a lock-free synchronization pipeline using std::atomic and Win32 Serial API to decouple hardware I/O from the high-frequency GPU rendering loop.
Technical Deep Dive
Massive Parallelization via Uniform Grid
Resolved the O(n²) complexity of brute-force neighbor searches by implementing a Uniform Grid spatial partitioning algorithm in CUDA. Each particle is hashed into a grid cell; neighbor queries are limited to the 27 surrounding cells only — reducing average comparisons from 1M to ~200 per particle, enabling stable 60 FPS with 1.04M entities.
Kernel Optimization via Nsight Compute
Used NVIDIA Nsight Compute to identify shared memory bank conflicts in the neighbor-search kernel. Restructured memory access patterns and tuned register usage, pushing Compute Throughput from ~52% to 85.23% on an RTX 3070.
Lock-Free Hardware–GPU Synchronization
Engineered a lock-free pipeline using std::atomic and the Win32 Serial API to decouple the 60Hz GPU render loop from the lower-frequency UART I/O (~100Hz). Ensures zero render stalls during hardware input bursts from the ATmega328P.
Bare-Metal AVR Firmware
Directly manipulated AVR registers (ADMUX, ADCSRA, UBRR0) on the ATmega328P, bypassing Arduino’s standard library overhead. Achieved sub-millisecond ADC sampling and UART transmission for real-time potentiometer → GPU simulation parameter tuning.