Sorry but the speculation in the chosen answer is misleading, and leaves out the most important aspect, which is address translation via page tables.
It is true that when any PC-compatible machine boots it starts out in "real mode". And modern 32-bit operating systems on x86 do run in "protected mode", which includes segmented addressing as defined by the GDT. However they then also enable page table-based address translation by setting the PG (paging) bit, bit 31, in CR0. This is never turned off for the life of the OS.
Also, in most modern 32-bit operating systems, GDT-based segmented addressing is essentially bypassed: All of the commonly-used GDTEs are set up with base address 0, size 4 billion bytes. So although the MMU does go through the motions of adding the relevant segment "base address" to the "displacement" that comes from the instruction, this is effectively a no-op. A different set of GDTEs is used for ring 0 vs ring 3, but they all have the same base address and size. The "privilege level" fields (0 vs 3) are about all that is different. This enables the "privileged access" bit in the page table entries to be effective, allowing memory to be protected for kernel- mode or user+kernel mode access on a page-by-page basis. That is not possible with segment descriptors; they are far too coarse-grained.
In x64 CPUs the segmenting mechanism essentially disappears while in long mode. Page table-based address translation of course still happens as long as the PG bit is set, which it is throughout the life of the OS. The MMU is most emphatically not disabled while in kernel mode, nor does the "SO" (or anything) use 1:1 mapping between virtual and physical addresses.
Accesses to known physical addresses, like those assigned to PCI-like peripherals, is done by allocating unused virtual addresses and setting up the correspondnig page table entries with the required physical page numbers. Code in device drivers then uses the virtual addresses.
Yes, DMA primarily works on physical addresses. A dumb/cheap DMA controller indeed just transfers to a physically contiguous buffer with a given start address and length. To support such devices either the OS or device driver will allocate physically contiguous "bounce buffers" for the DMA device to access, and copy data between there and the user's buffer.
Smart/more expensive DMA controllers can handle buffers that occupy discontiguous ranges of physical addressses (referred to as "scatter-gather mapping"). These are much preferred for high-performance devices.
An IOMMU can allow stupid/cheap DMA controllers to access a physically discontiguous buffer as if it was contiguous. However, platforms with IOMMUs are not yet ubiquitous enough to say "your platform must have an IOMMU for our OS". Therefore, at present, IOMMUs are primarily used by virtual machine monitors.