I started with the naive char* board that updates each cell, and scaled it to instead use 16 by uint16_t tiles with an active bounding box. Even with a multithreaded solution, I was stuck around 29x for a while before pivoting to a new style of geometric decomposition that uses a bitboard layout (uint64_t per 64 cells) to implement an Abrash-style bit-parallel kernel. It utilizes horizontal heighbours vis shifts, wraps, and counts neighbouts via 4 bit planes (c0 to c3), then apply the Life rule, rather than 8 scalar neighbours. The board is updated row-wise in parallel with openMP, using static scheduling and separate row ranges to avoid false sharing on the testing machine. I added early exit for completely dead board + periodic hashing to detect some cycles/equilibrium. I couldn’t properly integrate AVX2 intrinsics, or sparse active-call list for boards that aren’t that dense.