Let the AIs play games against each other. The resulting leader board is more precise than benchmarks?