Resolving AMDgpu crashes on ThinkPad T14s Gen4 with Ryzen 7 7840u
August 12, 2024 | View Comments
I've recently upgraded from my old ThinkPad X1 Carbon 3rd Gen to a shiny ThinkPad T14s Gen4 with AMD Ryzen 7 as my main laptop. (Yes I know, it's been about time.)
Installing Debian Bookworm was rather straight-forward. I've really come to appreciate Debian and KDE, it's a much better operating system than the Ubuntu and Gnome combination I used to use a couple years back. KDE comes with both much better defaults and better customisability than Gome 3, and Debian strips away the bullshit from Ubuntu like snaps. Debian was actually the first Linux I ever used, back in 2000 I think. And it's still one brilliantly smooth and stable distribution.
But let's get to the stability bit. I decided to go for AMD Ryzen for the first time in any of my computers having heard many good things about it. The AMD Ryzen 7 7840u CPU that's in my laptop comes with an integrated GPU called Radeon 780M. And boy does it run smooth. But there was a problem. When running GPU intense applications, such as Godot, my system would sometimes crash irrecoverably. The same would sometimes happen when waking up from sleep or suspend.
Here's from the logs right before the crashes when running GPU heavy stuff. Notice how it's the AMDgpu driver that's bailing here:
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=12466, emitted seq=12468 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Godot_v4.2.2-st pid 3090 thread Godot_v4.2.2-st pid 3090 amdgpu 0000:c3:00.0: amdgpu: GPU reset begin! amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839 amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839 amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839 [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
And then apparently there were issues with waking up from suspend too:
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=111004, emitted seq=111006 amdgpu 0000:c3:00.0: amdgpu: GPU reset begin! amdgpu 0000:c3:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000 amdgpu 0000:c3:00.0: amdgpu: Failed to disable gfxoff! amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839 amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839 amdgpu 0000:c3:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839 ath11k_pci 0000:01:00.0: msdu_done bit in attention is not set
After posting this blog the first time, and discussing some workarounds that I applied to make the system more stable (now removed), Joe commented below with this:
I had the same problem. My solution was to use the much newer firmware and drivers from bookworm-backports non-free-firmware. Problem went away. The stock bookworm firmware is ancient.
So here's how to upgrade to newer AMD firmware. First add this to your /etc/apt/sources.list:
deb http://deb.debian.org/debian bookworm-backports main contrib non-free non-free-firmware
Then run as root:
apt update apt -t bookworm-backports install firmware-amd-graphics reboot
And that's it! After a bit of digging and tweaking, my ThinkPad T14s Gen4 is now running smoothly with Debian Bookworm and the AMDgpu driver. It's always a bit of a ritual to get a new laptop running stably under Linux, but it's a small price to pay for the freedom and customizability that comes with it. I hope this post helps someone else out there who's struggling with similar issues.