From: Dev Jain <dev.jain@arm•com>
To: Wen Jiang <jiangwenxiaomi@gmail•com>,
linux-mm@kvack•org, linux-arm-kernel@lists•infradead.org,
catalin.marinas@arm•com, will@kernel•org,
akpm@linux-foundation•org, urezki@gmail•com
Cc: baohua@kernel•org, Xueyuan.chen21@gmail•com, rppt@kernel•org,
david@kernel•org, ryan.roberts@arm•com,
anshuman.khandual@arm•com, ajd@linux•ibm.com,
linux-kernel@vger•kernel.org, jiangwen6@xiaomi•com
Subject: Re: [PATCH v3 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk
Date: Wed, 27 May 2026 11:28:55 +0530 [thread overview]
Message-ID: <46b0f2a7-3c0d-4372-a45d-946d8259d410@arm.com> (raw)
In-Reply-To: <20260522053146.83209-5-jiangwenxiaomi@gmail.com>
On 22/05/26 11:01 am, Wen Jiang wrote:
> From: "Barry Song (Xiaomi)" <baohua@kernel•org>
>
> vmap_pages_range_noflush_walk() (formerly vmap_small_pages_range_noflush())
> provides a clean interface by taking struct page **pages and mapping them
> via direct PTE iteration. This avoids the page table rewalk seen when
> using vmap_range_noflush() for page_shift values other than PAGE_SHIFT.
>
> Extend it to support larger page_shift values, and add PMD- and
> contiguous-PTE mappings as well. Rename it to vmap_pages_range_noflush_walk()
> since it now handles more than just small pages.
>
> For vmalloc() allocations with VM_ALLOW_HUGE_VMAP, we no longer need to
> iterate over pages one by one via vmap_range_noflush(), which would
> otherwise lead to page table rewalk. The code is now unified with the
> PAGE_SHIFT case by simply calling vmap_pages_range_noflush_walk().
>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel•org>
> Signed-off-by: Wen Jiang <jiangwen6@xiaomi•com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail•com>
> ---
> mm/vmalloc.c | 71 +++++++++++++++++++++++++++++-----------------------
> 1 file changed, 40 insertions(+), 31 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 53fd4ee460ea4..deb764abc0571 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -543,8 +543,10 @@ void vunmap_range(unsigned long addr, unsigned long end)
>
> static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
> unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> - pgtbl_mod_mask *mask)
> + pgtbl_mod_mask *mask, unsigned int shift)
> {
> + unsigned long pfn, size;
> + unsigned int steps;
> int err = 0;
> pte_t *pte;
>
> @@ -575,9 +577,10 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
> break;
> }
>
> - set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
> - (*nr)++;
> - } while (pte++, addr += PAGE_SIZE, addr != end);
> + pfn = page_to_pfn(page);
> + size = vmap_set_ptes(pte, addr, end, pfn, prot, shift);
> + steps = PFN_DOWN(size);
> + } while (pte += steps, *nr += steps, addr += size, addr != end);
>
> lazy_mmu_mode_disable();
> *mask |= PGTBL_PTE_MODIFIED;
> @@ -587,7 +590,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>
> static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
> unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> - pgtbl_mod_mask *mask)
> + pgtbl_mod_mask *mask, unsigned int shift)
> {
> pmd_t *pmd;
> unsigned long next;
> @@ -597,7 +600,27 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
> return -ENOMEM;
> do {
> next = pmd_addr_end(addr, end);
> - if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
> +
> + if (shift == PMD_SHIFT) {
> + struct page *page = pages[*nr];
> + phys_addr_t phys_addr;
> +
> + if (WARN_ON(!page))
> + return -ENOMEM;
> + if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> + return -EINVAL;
So I know these !page and !pfn_valid checks have been copied from vmap_pages_pte_range,
but do they mean anything?
I think pfn_valid() makes sense in that someone may take a random VA/PA, convert it into a struct
page and pass to vmap layer. But I don't see how anyone would pass page == NULL? At the
very least, returning ENOMEM does not make sense because the pages are not being
allocated by vmap() but have already been allocated.
> +
> + phys_addr = page_to_phys(page);
> +
> + if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
> + shift)) {
> + *mask |= PGTBL_PMD_MODIFIED;
> + *nr += 1 << (shift - PAGE_SHIFT);
> + continue;
> + }
> + }
> +
> + if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask, shift))
> return -ENOMEM;
> } while (pmd++, addr = next, addr != end);
> return 0;
> @@ -605,7 +628,7 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>
> static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
> unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> - pgtbl_mod_mask *mask)
> + pgtbl_mod_mask *mask, unsigned int shift)
> {
> pud_t *pud;
> unsigned long next;
> @@ -615,7 +638,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
> return -ENOMEM;
> do {
> next = pud_addr_end(addr, end);
> - if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
> + if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask, shift))
> return -ENOMEM;
> } while (pud++, addr = next, addr != end);
> return 0;
> @@ -623,7 +646,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>
> static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
> unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> - pgtbl_mod_mask *mask)
> + pgtbl_mod_mask *mask, unsigned int shift)
> {
> p4d_t *p4d;
> unsigned long next;
> @@ -633,14 +656,14 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
> return -ENOMEM;
> do {
> next = p4d_addr_end(addr, end);
> - if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
> + if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask, shift))
> return -ENOMEM;
> } while (p4d++, addr = next, addr != end);
> return 0;
> }
>
> -static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> - pgprot_t prot, struct page **pages)
> +static int vmap_pages_range_noflush_walk(unsigned long addr, unsigned long end,
> + pgprot_t prot, struct page **pages, unsigned int shift)
> {
> unsigned long start = addr;
> pgd_t *pgd;
> @@ -655,7 +678,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> next = pgd_addr_end(addr, end);
> if (pgd_bad(*pgd))
> mask |= PGTBL_PGD_MODIFIED;
> - err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
> + err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask, shift);
> if (err)
> break;
> } while (pgd++, addr = next, addr != end);
> @@ -678,27 +701,13 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
> pgprot_t prot, struct page **pages, unsigned int page_shift)
> {
> - unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
> -
> WARN_ON(page_shift < PAGE_SHIFT);
>
> - if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> - page_shift == PAGE_SHIFT)
> - return vmap_small_pages_range_noflush(addr, end, prot, pages);
> + if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC))
> + page_shift = PAGE_SHIFT;
>
> - for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
> - int err;
> -
> - err = vmap_range_noflush(addr, addr + (1UL << page_shift),
> - page_to_phys(pages[i]), prot,
> - page_shift);
> - if (err)
> - return err;
> -
> - addr += 1UL << page_shift;
> - }
> -
> - return 0;
> + return vmap_pages_range_noflush_walk(addr, end, prot, pages,
> + min(page_shift, PMD_SHIFT));
We can easily extend to PUD huge mappings right? Not sure whether we
should keep everything symmetric to how vmap_range_noflush() operates
right now, since P4D mappings don't exist, but PUD looks worthwhile.
> }
>
> int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
next prev parent reply other threads:[~2026-05-27 5:59 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-22 5:31 [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
2026-05-22 5:31 ` [PATCH v3 1/6] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Wen Jiang
2026-05-26 7:56 ` Dev Jain
2026-05-22 5:31 ` [PATCH v3 2/6] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE Wen Jiang
2026-05-27 5:43 ` Dev Jain
2026-05-22 5:31 ` [PATCH v3 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic Wen Jiang
2026-06-01 17:34 ` Uladzislau Rezki
2026-06-02 7:45 ` Wen Jiang
2026-05-22 5:31 ` [PATCH v3 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk Wen Jiang
2026-05-27 5:58 ` Dev Jain [this message]
2026-05-28 3:39 ` Wen Jiang
2026-05-29 5:28 ` Dev Jain
2026-06-05 6:02 ` Dev Jain
2026-05-22 5:31 ` [PATCH v3 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible Wen Jiang
2026-05-27 8:27 ` Dev Jain
2026-05-28 3:42 ` Wen Jiang
2026-05-29 5:57 ` Dev Jain
2026-06-02 7:34 ` Wen Jiang
2026-05-22 5:31 ` [PATCH v3 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings Wen Jiang
2026-05-23 7:53 ` Uladzislau Rezki
2026-05-27 6:25 ` Dev Jain
2026-06-02 8:57 ` Wen Jiang
2026-05-22 18:07 ` [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Andrew Morton
2026-05-23 8:26 ` Wen Jiang
2026-05-23 21:40 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=46b0f2a7-3c0d-4372-a45d-946d8259d410@arm.com \
--to=dev.jain@arm$(echo .)com \
--cc=Xueyuan.chen21@gmail$(echo .)com \
--cc=ajd@linux$(echo .)ibm.com \
--cc=akpm@linux-foundation$(echo .)org \
--cc=anshuman.khandual@arm$(echo .)com \
--cc=baohua@kernel$(echo .)org \
--cc=catalin.marinas@arm$(echo .)com \
--cc=david@kernel$(echo .)org \
--cc=jiangwen6@xiaomi$(echo .)com \
--cc=jiangwenxiaomi@gmail$(echo .)com \
--cc=linux-arm-kernel@lists$(echo .)infradead.org \
--cc=linux-kernel@vger$(echo .)kernel.org \
--cc=linux-mm@kvack$(echo .)org \
--cc=rppt@kernel$(echo .)org \
--cc=ryan.roberts@arm$(echo .)com \
--cc=urezki@gmail$(echo .)com \
--cc=will@kernel$(echo .)org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox