Resolving AMDgpu crashes on ThinkPad T14s Gen4 with Ryzen 7 7840u

August 12, 2024 | View Comments

https://danielnouri.org/media/penguin.webp

I've recently upgraded from my old ThinkPad X1 Carbon 3rd Gen to a shiny ThinkPad T14s Gen4 with AMD Ryzen 7 as my main laptop. (Yes I know, it's been about time.)

Installing Debian Bookworm was rather straight-forward. I've really come to appreciate Debian and KDE, it's a much better operating system than the Ubuntu and Gnome combination I used to use a couple years back. KDE comes with both much better defaults and better customisability than Gome 3, and Debian strips away the bullshit from Ubuntu like snaps. Debian was actually the first Linux I ever used, back in 2000 I think. And it's still one brilliantly smooth and stable distribution.

But let's get to the stability bit. I decided to go for AMD Ryzen for the first time in any of my computers having heard many good things about it. The AMD Ryzen 7 7840u CPU that's in my laptop comes with an integrated GPU called Radeon 780M. And boy does it run smooth. But there was a problem. When running GPU intense applications, such as Godot, my system would sometimes crash irrecoverably. The same would sometimes happen when waking up from sleep or suspend.

The rest of this post is about what I like to think of as the kernel initiation ritual, that is, the debugging that's necessary to get your new laptop running stably under Linux. I'm writing down both details of the logs at the time of those crashes and steps that I found helped me fix the issue and have a stable system again, in the hopes that it might help someone out there with similar issues, and to document it for my future self too.

So here's what I saw in the logs right before those crashes when running GPU heavy stuff. Notice how it's the AMDgpu driver that's bailing here:

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=12466, emitted seq=12468
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Godot_v4.2.2-st pid 3090 thread Godot_v4.2.2-st pid 3090
amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
[drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue

And then apparently there were issues with waking up from suspend too:

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=111004, emitted seq=111006
amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
amdgpu 0000:c3:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
amdgpu 0000:c3:00.0: amdgpu: Failed to disable gfxoff!
amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
ath11k_pci 0000:01:00.0: msdu_done bit in attention is not set

The most relevant bug report that I eventually found online after a back and forth was this one where a gentleman called Mario Limonciello had the most helpful suggestions.

I found a combination of two things to help. First, I set this parameter of the AMDgpu kernel module: amdgpu.vm_update_mode=3, which I did by adjusting the GRUB_CMDLINE_LINUX_DEFAULT line in /etc/default/grub to something like this:

GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=off amdgpu.mcbp=0 amdgpu.vm_update_mode=3"

(Note that the other parameters where already set in there.)

I then ran sudo update-grub to update the GRUB configuration file.

The second change I did was to set the UMA (Unified Memory Architecture) Frame Buffer Size to a fixed size of 4G rather than using dynamic allocation (Automatic), in my system's BIOS. I'll still have to check if this step is still necessary since it's the first of two things that I did before my system was stable again, but that's for another time.

And that's it! After a bit of digging and tweaking, my ThinkPad T14s Gen4 is now running smoothly with Debian Bookworm and the AMDgpu driver. It's always a bit of a ritual to get a new laptop running stably under Linux, but it's a small price to pay for the freedom and customizability that comes with it. I hope this post helps someone else out there who's struggling with similar issues.