As seen, pyrit opencl does about 1200... i've wroted a kernel that grab data as seen here to get processed by kernel as uint4 vectors, theorically this should does x4, and it does!
What's the big news? Output data are verified with tested vectors against 8192 sha1 rounds, or better;
First 20 of 32 byte of pmk are good with mine implementation, I've now enough C/OpenCL basis to target the differents cl address spaces provided, made use of _local and async_worg_group_copy and maybe redesign the algo to do 20+12 in the same time!
I've gotta go now, this fucking work...