I know. That was what I was talking about in first place.
Nope, hardware design doesn't particularly care what your language of choice is. Be it C++, Ruby, Lua etc. etc. the instructions that drive the circuitry, and the trade-offs made for efficiency (of note in the console space) are not the domain of programming. It's a
single SIMD instruction to fully use all CU's on a gpu, be it 32, 48, or 64+. Your design determines how fast you execute those instructions. Point in case; There is a reason AMD dedicated so much die space for their RDNA2 silicon to an extra cache level.
You get speed up based on how much you can make your tools - in this case your programming tools - work within the limits of the design.
As a crude example: If i took a single pixel, and wanted to test it's colour i'd at minimum require three passes (i have an incredibly badly designed system). With each pass the RGB values are read into registers, then held in some buffer. It makes no difference to the programmer how this process is called, and/or achieved. Because (and this is how you actually do it) you can just pack each 8-bit RGB value into a single 32-bit register, and read them off in a single pass. Easy 3-4x perf increase for this badly designed system when reading pixel data. This is all abstracted away. How you expose the circuitry to the 'programmer' is another thing entirely.