Skip to content

Kernel oops on amdgpu load (Polaris / Vega) #44

@madscientist159

Description

@madscientist159

On a ppc64el system with this kernel and a WX7100 (Polaris) card, loading the amdgpu module results in a kernel oops. Note that the upstream Linux 4.15 amdgpu module works and allows a full graphical environment to load; the oops is specific to the AMD 4.13 kernel. Oops follows:

[   89.848698] checking generic (600c280010000 500000) vs hw (6000000000000 10000000)
[   89.848800] amdgpu 0000:01:00.0: enabling device (0140 -> 0142)
[   89.915446] [drm] initializing kernel modesetting (POLARIS10 0x1002:0x67C4 0x1002:0x0B0D 0x00).
[   89.965406] [drm] register mmio base: 0x00000000
[   89.965458] [drm] register mmio size: 262144
[   89.965502] [drm] PCI I/O BAR is not found.
[   89.965540] [drm] probing gen 2 caps for device 1014:4c1 = 300104/180001e
[   89.965584] [drm] probing mlw for device 1014:4c1 = 300104
[   89.965631] [drm] UVD is enabled in VM mode
[   89.965658] [drm] VCE enabled in VM mode
[   90.299090] [drm] PCI I/O BAR is not found. Using MMIO to access ATOM BIOS
[   90.299092] ATOM BIOS: 113-C9540101-100
[   90.299103] [drm] GPU post is not needed
[   90.299130] [drm] vm size is 64 GB, block size is 13-bit, fragment size is 9-bit
[   90.299147] amdgpu: No suitable DMA available
[   92.836890] amdgpu 0000:01:00.0: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
[   92.836969] amdgpu 0000:01:00.0: GTT: 256M 0x0000000000000000 - 0x000000000FFFFFFF
[   92.837021] [drm] Detected VRAM RAM=8192M, BAR=256M
[   92.837056] [drm] RAM width 256bits GDDR5
[   92.837183] [TTM] Zone  kernel: Available graphics memory: 7471346 kiB
[   92.837227] [TTM] Initializing pool allocator
[   92.837289] [drm] amdgpu: 8192M of VRAM memory ready
[   92.837325] [drm] amdgpu: 8192M of GTT memory ready.
[   92.837383] [drm] GART: num cpu pages 65536, num gpu pages 65536
[   92.837555] [drm] PCIE GART of 256M enabled (table at 0x000000F400040000).
[   92.837607] amdgpu 0000:01:00.0: (-12) failed to allocate kernel bo
[   92.837651] amdgpu 0000:01:00.0: (-12) create WB bo failed
[   92.837829] [drm:amdgpu_device_init [amdgpu]] *ERROR* amdgpu_wb_init failed -12
[   92.837912] amdgpu 0000:01:00.0: amdgpu_init failed
[   92.838002] Unable to handle kernel paging request for data at address 0xc00c000085a80000
[   92.838066] Faulting instruction address: 0xc008000005a2f1cc
[   92.838122] Oops: Kernel access of bad area, sig: 11 [#1]
[   92.838166] SMP NR_CPUS=2048
[   92.838168] NUMA
[   92.838200] PowerNV
[   92.838257] Modules linked in: amdgpu(+) mfd_core ttm drm_kms_helper drm syscopyarea sysfillrect sysimgblt fb_sys_fops i2c_algo_bit i2c_dev ghash_generic gf128mul ecb snd_hda_codec_hdmi snd_hda_intel xts snd_hda_codec joydev ofpart ctr evdev ipmi_powernv powernv_flash ipmi_devintf cbc snd_hda_core vmx_crypto mtd snd_hwdep ipmi_msghandler at24 opal_prd binfmt_misc snd_aloop snd_pcm snd_timer snd soundcore parport_pc lp parport ip_tables x_tables autofs4 nfsv3 nfs_acl nfs lockd grace sunrpc fscache hid_generic usbhid hid xhci_pci xhci_hcd usbcore tg3 ptp pps_core libphy
[   92.838719] CPU: 0 PID: 971 Comm: kworker/0:1 Not tainted 4.13.0+ #1
[   92.838778] Workqueue: events work_for_cpu_fn
[   92.838823] task: c0000001d35c4700 task.stack: c0000001d35c8000
[   92.838876] NIP: c008000005a2f1cc LR: c0080000059a036c CTR: c008000005a2f178
[   92.838940] REGS: c0000001d35cb4c0 TRAP: 0300   Not tainted  (4.13.0+)
[   92.838993] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[   92.839003]   CR: 28002288  XER: 20040000
[   92.839079] CFAR: c008000005a2f1ac DAR: c00c000085a80000 DSISR: 42000000 SOFTE: 1
               GPR00: c0080000059a036c c0000001d35cb740 c008000005c5bde0 c000000009be0000
               GPR04: c00c000085a80000 0000000000000000 0000000000080000 0000000000000000
               GPR08: 0000000000000001 c008000005a2f178 0000000000000001 c008000004a1e5d8
               GPR12: c008000005a2f178 c00000000fb80000 c000000000128568 c000000009be2f20
               GPR16: c000000009be2f28 c000000009be2f18 c000000009be2f38 c000000009be2f40
               GPR20: c000000009be2f30 0000000000008000 0000000000000400 c000000009be2f38
               GPR24: c000000009be2f40 c000000009be2f30 c000000009be2f18 0000000000000000
               GPR28: 0000000000000000 0000000000000000 c00c000085a80000 0000000000080000
[   92.839738] NIP [c008000005a2f1cc] gmc_v8_0_gart_set_pte_pde+0x54/0x90 [amdgpu]
[   92.839914] LR [c0080000059a036c] amdgpu_gart_unbind+0xa4/0x130 [amdgpu]
[   92.839968] Call Trace:
[   92.839992] [c0000001d35cb740] [c000000009be2720] 0xc000000009be2720 (unreliable)
[   92.840138] [c0000001d35cb780] [c0080000059a036c] amdgpu_gart_unbind+0xa4/0x130 [amdgpu]
[   92.840290] [c0000001d35cb800] [c0080000059a06e8] amdgpu_gart_fini+0x40/0x70 [amdgpu]
[   92.840447] [c0000001d35cb830] [c008000005a30b98] gmc_v8_0_sw_fini+0x50/0x90 [amdgpu]
[   92.840593] [c0000001d35cb860] [c00800000597f1d0] amdgpu_fini+0x208/0x560 [amdgpu]
[   92.840741] [c0000001d35cb910] [c008000005985b5c] amdgpu_device_init+0xcc4/0x1590 [amdgpu]
[   92.840889] [c0000001d35cba30] [c0080000059880fc] amdgpu_driver_load_kms+0xb4/0x2d0 [amdgpu]
[   92.840976] [c0000001d35cbab0] [c0080000044cab7c] drm_dev_register+0x1d4/0x290 [drm]
[   92.841121] [c0000001d35cbb50] [c00800000597d880] amdgpu_pci_probe+0x128/0x1f0 [amdgpu]
[   92.841228] [c0000001d35cbbd0] [c0000000005d851c] local_pci_probe+0x6c/0x140
[   92.841296] [c0000001d35cbc60] [c0000000001199d8] work_for_cpu_fn+0x38/0x60
[   92.843968] [c0000001d35cbc90] [c00000000011ead8] process_one_work+0x248/0x520
[   92.848119] [c0000001d35cbd30] [c00000000011f030] worker_thread+0x280/0x5d0
[   92.851012] [c0000001d35cbdc0] [c00000000012870c] kthread+0x1ac/0x1c0
[   92.851102] [c0000001d35cbe30] [c00000000000bae0] ret_from_kernel_thread+0x5c/0x7c
[   92.851209] Instruction dump:
[   92.851231] 7cdf3378 7c9e2378 7cbd2b78 7cfc3b78 48000008 e8410018 7be6c6c4 7bbd1828
[   92.852725] 78c64602 7fdeea14 7cc6e378 7c0004ac <f8de0000> 39200001 38600000 992d019c
[   92.852815] ---[ end trace 2915333da62340c0 ]---

EDIT: lspci output for the AMD card:

        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon Pro WX 7100]
        Flags: fast devsel, IRQ 24, NUMA node 0
        Memory at 6000000000000 (64-bit, prefetchable) [size=256M]
        Memory at 6000010000000 (64-bit, prefetchable) [size=2M]
        I/O ports at <unassigned> [disabled]
        Memory at 600c000000000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at 600c000040000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [58] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Capabilities: [200] #15
        Capabilities: [270] #19
        Capabilities: [2b0] Address Translation Service (ATS)
        Capabilities: [2c0] Page Request Interface (PRI)
        Capabilities: [2d0] Process Address Space ID (PASID)
        Capabilities: [320] Latency Tolerance Reporting
        Capabilities: [328] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [370] L1 PM Substates
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions