I figured out that the test cases allocated a disproportionate amount of X-byte blocks. I was able to get to the top by hardcoding a specific freelist just for X-byte blocks.
Learned a lesson about easily it is to game a benchmark :)
Use pre-allocated pools with array of indexes, free/allocation idx for alloc and free.
Con: Fixed pool size and fixed amount of memory can be allocated per pool.
Pro: constant cost operations per alloc/free via Atomic inc/dec of idx - no linklist tranversing ; Can be alloc in kernel space and free in user space (linux/QNX) and in multiple user processes when memory pools are in shmem; Run very will in SMP environment without any locks - all memory contentions were handled with atomic +/- alloc/free idx.
Same source code run in QNX, vxworks and linux (kernel and user space) at that time.
I'm gonna read this article and try making my own allocator next.