The 2B beat the 12B by 19 percentage points (50% vs 31% semantic accuracy). Larger models are "too smart"? They compute the answer mentally and output "42" instead of writing SQL. 81% of the 12B's errors were plain numbers.
Everything runs locally, zero cloud compute. The repo has scripts, data and full results to reproduce it.