Another major reason was because a lot of tasks would only use 3 or 4 of the 5 ALUs available in VLIW5 so there were a ton of shaders taking up die space, sitting idle a lot of the time. Moving to VLIW4 meant the available resources were generally utilised more efficiently, but at the cost of increased power draw given the same core count/clock.
Edit: It does mention that in the Anandtech article, presumably where I read it originally.