Skip Navigation

Igor's Lab uncovers 'hotspot issue' affecting all RTX 50-series GPUs — says it could compromise graphics card longevity

www.tomshardware.com

Igor's Lab uncovers 'hotspot issue' affecting all RTX 50-series GPUs — says it could compromise graphics card longevity

18 comments
  • They had to make sure they didn’t get another 1080ti situation and have people hold on to their cards for a decade+

  • Neither card was using any sort of thermal pads to connect the power delivery portion of the PCB, where the hotspot is located, to each GPU's respective backplate.

    lol

    Anybody still remember when palit were good? Showing my age here.

  • Sound's like Nvidia is pressuring the AIB with their fake MSRP and this is what we get as a result.

    • No, not really.

      Nvidia is pressuring partner board mfgrs to stay closer to MSRP...

      But the fundamental problem is Nvidia's actual design guidelines, which this article and the actual source this article is based on state directly, in detail.

      The problem is Nvidia is designing shitty cards that physically can't handle the heat from the way they do power management/routing through the board itself.

      Recently, several current graphics card models in the RTX 5000 series, including the RTX 5080, 5070 (Ti) and 5060 Ti in particular, have shown thermal anomalies in the area of local hotspots on the back of the board in my tests.

      These affect cards from major board partners such as Palit, PNY and MSI as well as variants from other manufacturers, which (have to) largely adhere to the reference design specified by NVIDIA.

      The thermal load does not manifest itself as a systemic temperature problem of the GPU cores themselves, but in the form of pronounced heat nests below the power supply – often in areas that are hardly cooled or mechanically connected at all when viewed from the rear.

      https://www.igorslab.de/en/local-hotspots-on-rtx-5000-cards-when-board-layout-and-cooling-design-do-not-work-together/

      Basically, this is the whole... literally melting/fires starting in recent Nvidia GPUs at the power connector... thing, or something quite similar to it, on the 50 series.

      https://youtube.com/watch?v=Y36LMS5y34A

      The partner mfgrs, as I highlighted... are largely following Nvidia's design specs quite closely.

      Meaning this issue is almost certainly present in actual Nvidia reference cards as well, but there are far less of those, so it is harder to get a good sample size to do a study.

      ...

      Not sure if you haven't been following this generation of Nvidia cards very closely... but there have been tons of other hardware defects coming straight out of Nvidia, such as defective/broken/deactivated ROP clusters, which are basically a specialized subcomponent of the GPU processor chip itself, similar to how tensor cores or cuda cores are specialized subcomponents.

      The partner board mfgs just get those from Nvidia, and while they should probably be actually quality testing them better and not selling defective ones they assemble into their own boards... they are also not the ultimate cause of that problem...

      And having a missing ROP cluster or two will kneecap your GPU's performance, more significantly on lower tier cards.

      ...

      Also, Jayz2cents just in the last 24 hours put out a video showing that the latest mainline Nvidia driver update on Windows (576.02) basically breaks the way internal GPU temps are reported to most third party software that monitors GPU temps and apply custom fan rpm curves, ocing and what not.

      This results in your GPU temp not getting updated in that software in many scenarios, which then means your fans don't actually ramp up, which then means your GPU overheats.

      https://youtube.com/watch?v=KrCEPX47vtw

      However... it does seem that this was fixed... but literally only because Jay fucking put Nvidia on blast. The error never should have been present in the first place, there really isn't a concievable reason to change the fundamental way that... drivers have been reporting temps to other software for... a decade? two decades?

      And also Nvidia isn't like... widely telling people 'oh shit please do a driver rollback'... they just kind of quietly published an optional hotfix that is fairly hard to find, you'd have to have basically watched Jay's first video to even really know this problem exists...

      ...and if it effects you, because you use custom fan curves... well, it'll cause massive damage to your card.

      https://youtube.com/watch?v=W9ztK2pFe64

      ...

      Finally: AiB means 'Add in Board'.

      Any GPU is an add in board.

      I know everyone uses the term AiB to mean 'non Nvidia/AMD/Intel reference card, produced by a partner mfger'... but everyone is wrong.

      A reference card ... is an AiB.

      • I do appreciate the pedantry about terminology.

        I do have a few questions about the findings here that I'd like to see the specialized press cover before I start recommending people go drilling holes through their backplates. The obvious is how this works across a wider array of cards, since the sample in the piece is so small, but also whether undervolting would help or if that'd all be downstream from the potentially affected segments of the board.

18 comments