Implementing Markdown for Agents Without Cloudflare

Markdown is becoming the new API for websites. Not for humans - for machines. AI agents, LLM-powered tools, and crawlers all need to consume web content, and HTML is a terrible format for that. It is full of layout noise, scripts, and styling that wastes tokens and confuses models.

Cloudflare introduced Markdown for Agents to address this problem:

Feeding raw HTML to an AI is like paying by the word to read packaging instead of the letter inside. A simple ## About Us on a page in markdown costs roughly 3 tokens; its HTML equivalent – <h2 class="section-title" id="about">About Us</h2> – burns 12-15, and that's before you account for the <div> wrappers, nav bars, and script tags that pad every real web page and have zero semantic value.

Their solution is elegant: same URL, same content, but the server returns markdown instead of HTML when the client sends an Accept: text/markdown header. It is standard HTTP content negotiation - a simple idea with a big impact on how AI tools read your site. Cloudflare's implementation requires their Pro plan, but you don't actually need it.

This post walks through a production implementation I built for a client on Nuxt (prerendered) and Vercel that does the same thing at build time, with zero runtime cost. Every URL on the site returns two representations of the same resource: a browser gets HTML, and an agent sending Accept: text/markdown gets clean markdown with YAML frontmatter. Nothing is converted at request time - during the build, a content.md file is generated alongside each index.html, and the hosting layer decides which file to serve based on the request header. No server-side rendering, no edge functions, no additional compute per request.

If you want the business case and token math, read the companion article on coditive.com. This article is the how.

So how does this actually work?

Nuxt prerenders all pages to static HTML (nuxt generate).
A custom Nuxt module hooks into the build's prerender:done event.
For each HTML file: extract metadata from <head>, strip noise, isolate <main>, convert to markdown via Turndown, prepend YAML frontmatter.
Write content.md next to the corresponding index.html.
Vercel routes config serves the right file based on the Accept header.
Vary: Accept header ensures correct CDN caching.

Where does the markdown conversion happen?

The module plugs into the Nuxt build lifecycle using defineNuxtModule. It registers a hook on Nitro's prerender:done event, which fires after all static HTML files are written to disk. I also register close as a safety net for build modes where prerendering is not used - in that case prerender:done never fires, but close still does.

import { defineNuxtModule } from '@nuxt/kit';

export default defineNuxtModule<{ extractMainContent?: boolean }>({
  meta: { name: 'markdown-for-agents', configKey: 'markdownForAgents' },
  defaults: { extractMainContent: false },
  setup(options, nuxt) {
    let completed = false;

    const generate = async (nitro: any) => {
      if (completed) {
        return;
      }
      // ... conversion logic
      completed = true;
    };

    (nuxt as any).hook('nitro:init', (nitro: any) => {
      nitro.hooks.hook('prerender:done', () => generate(nitro));
      nitro.hooks.hook('close', () => generate(nitro));
    });
  },
});

The completed flag prevents double execution if both hooks fire. Register the module in nuxt.config.ts:

modules: [
  ['./modules/markdown-for-agents', { extractMainContent: true }],
]

Cleaning the HTML

Raw HTML has a lot of elements that are noise for agents. I strip them in stages before conversion. The regex approach below works well for typical prerendered output where tags like <main> or <nav> are not nested inside themselves. For more complex HTML structures, a DOM parser like cheerio or linkedom would be more robust.

// Stage 1: Remove script, style, noscript
function stripNoisyTags(html: string) {
  return html
    .replace(/<script\b[^>]*>[\s\S]*?<\/script>/gi, '')
    .replace(/<style\b[^>]*>[\s\S]*?<\/style>/gi, '')
    .replace(/<noscript\b[^>]*>[\s\S]*?<\/noscript>/gi, '')
}

// Stage 2: Extract <main> or <article> if available
function extractMainContent(html: string) {
  const main = html.match(/<main\b[^>]*>[\s\S]*?<\/main>/i)
  if (main?.[0]) return main[0]
  const article = html.match(/<article\b[^>]*>[\s\S]*?<\/article>/i)
  if (article?.[0]) return article[0]
  return html
}

// Stage 3: Strip layout chrome
function stripLayoutChrome(html: string) {
  return html
    .replace(/<header\b[^>]*>[\s\S]*?<\/header>/gi, '')
    .replace(/<nav\b[^>]*>[\s\S]*?<\/nav>/gi, '')
    .replace(/<footer\b[^>]*>[\s\S]*?<\/footer>/gi, '')
    .replace(/<aside\b[^>]*>[\s\S]*?<\/aside>/gi, '')
    .replace(/<form\b[^>]*>[\s\S]*?<\/form>/gi, '')
}

Converting to markdown

I use Turndown with ATX-style headings and fenced code blocks:

import TurndownService from 'turndown'

function toMarkdown(html: string) {
  const td = new TurndownService({
    headingStyle: 'atx',
    codeBlockStyle: 'fenced',
  })
  return `${td.turndown(html).trim()}\n`
}

Metadata and frontmatter

Each markdown file gets a YAML frontmatter block with title, description, and image, extracted from the page's <meta> tags. For title I check og:title first, then twitter:title, then <title>. Description starts with the standard <meta name="description">, then falls back to og:description and twitter:description. Image follows the OG-first pattern: og:image, then twitter:image.

function readPageMetadata(html: string): PageMetadata {
  const head = extractHead(html)
  return {
    title: getMetaContent(head, 'property', 'og:title')
      || getMetaContent(head, 'name', 'twitter:title')
      || extractTitle(head),
    description: getMetaContent(head, 'name', 'description')
      || getMetaContent(head, 'property', 'og:description'),
    image: getMetaContent(head, 'property', 'og:image')
      || getMetaContent(head, 'name', 'twitter:image'),
  }
}

The markdown is then generated from the final rendered HTML, not from the CMS source. This matters - you get the same content that the browser renders, not whatever the CMS stored before template processing.

The result: content.md next to every index.html

This is what the whole build step is for. Every nuxt prerendered index.html gets a sibling content.md file in the same folder. If /services/example/index.html exists, so does /services/example/content.md. That's the shape I wanted, because it makes the hosting layer's job almost... boring: just a header check and pick one of two files that are guaranteed to be there.

If you are not using Nuxt, the same logic works as a standalone Node.js script that runs after your SSG build. The only Nuxt-specific part is the hook registration. The rest - file discovery, cleaning, conversion, writing - is plain Node.

How does Vercel know which file to serve?

The routing layer is where the content negotiation lives. On Vercel, I use the routes configuration in vercel.json:

{
  "routes": [
    {
      "src": "/",
      "has": [{ "type": "header", "key": "accept", "value": { "pre": "text/markdown" } }],
      "headers": {
        "Content-Type": "text/markdown; charset=utf-8",
        "Vary": "Accept"
      },
      "dest": "/content.md"
    },
    {
      "src": "/(?<path>(?!.*\\.[^/]+$).+?)/?",
      "has": [{ "type": "header", "key": "accept", "value": { "pre": "text/markdown" } }],
      "headers": {
        "Content-Type": "text/markdown; charset=utf-8",
        "Vary": "Accept"
      },
      "dest": "/$path/content.md"
    },
    {
      "src": "/",
      "headers": { "Vary": "Accept" },
      "continue": true
    },
    {
      "src": "/(?<path>(?!.*\\.[^/]+$).+?)/?",
      "headers": { "Vary": "Accept" },
      "continue": true
    }
  ]
}

A few things to unpack here.

Prefix match, not exact match. The "pre": "text/markdown" operator matches any Accept header whose value starts with text/markdown, which works because agents that specifically want markdown list it first (e.g. Accept: text/markdown, */*;q=0.1). An agent that lists another type first (e.g. text/html, text/markdown) would not match - but in practice, agents requesting markdown put it at the highest priority.

The regex. (?<path>(?!.*\.[^/]+$).+?)/? captures the URL path while excluding requests for static files (anything with a file extension). Without this, requests for /image.png would try to serve /image/content.md.

Vary: Accept on HTML responses too. The last two rules with "continue": true add the Vary header to regular HTML responses. Without this, a CDN that caches the HTML version might serve it to an agent requesting markdown.

Route order matters. Markdown rules come first, because if you put HTML fallback rules above them the Accept header check never runs.

Why not just use middleware?

I evaluated four approaches:

Approach	Trade-off
Cloudflare Pro	Automatic, but requires Cloudflare proxy and Pro plan. Vendor lock-in.
Vercel middleware	Full control with fallback logic (check if `content.md` exists). Adds runtime compute cost on every request. Not economical at scale.
Rewrites + renamed files	Modern config (`page.html` + `content.md` instead of `index.html`). Works, but breaks tooling that assumes `index.html` as output - build warnings, override path errors.
Routes (chosen)	Deterministic header-based routing. No runtime cost. Keeps standard `index.html` output. One-time migration from `rewrites`/`headers`/`redirects` to a single `routes` array.

For a high-traffic prerendered site, routes won on both cost and reliability, and the rest of this section walks through why I rejected the other three.

What I tried before settling on `routes`

I started with Vercel rewrites in vercel.json because they are the modern, straightforward option. In practice, when both index.html and markdown lived in the same route folder under predictable names, Vercel’s filesystem resolution sometimes won over rewrites in ways that were hard to reason about, especially across nested prerendered routes.

Middleware came next. It gave the cleanest control flow and made it trivial to fall back to HTML when markdown was missing. It solved the routing logic well, but it runs on every request. For high traffic, that is a long-term cost surface I wanted to avoid.

To drop middleware, I went back to rewrites with renamed build output: page.html plus content.md instead of index.html plus content.md. Rewrites then behaved consistently across many URLs, including deep paths. The trade-off showed up in build logs: warnings on many routes, along the lines of:

Warning: Override path "services/example-service/index.html" was not detected as an output path

That is a red flag. Renaming away from index.html fights Nuxt/Nitro’s expected output contract and can cascade into subtle breakage.

routes was the compromise that felt safest: a one-time migration of vercel.json (and any code that generates redirects or routing config), but conventional index.html output, header-based routing, and no ongoing middleware compute.

`routes` is labeled “legacy” in Vercel docs

Vercel’s docs describe routes as a legacy mechanism and recommend newer rewrites, redirects, headers, and related properties for most upgrades. The same documentation also says that routes exists for advanced integration scenarios. Taken together, that reads less like “scheduled for removal” and more like “power user surface” - reasonable to depend on for header-based file routing until a newer config primitive matches this behavior cleanly.

Not on Nuxt or Vercel?

The pattern is not framework-specific. Any static site generator that outputs HTML files works. Replace the Nuxt module with a post-build script that walks the output directory.

For hosting, you need header-based routing:

Nginx: Use map $http_accept $content_suffix to select the file, then try_files to serve it.
Apache: mod_rewrite with %{HTTP:Accept} condition.
Netlify: Edge functions to inspect the Accept header and rewrite the response. Netlify's _headers file and redirect rules do not support conditions on arbitrary request headers.

If your site already stores content as markdown (Nuxt Content, Astro, Jekyll, Hugo), you have an even simpler path. You can skip the conversion entirely and serve the raw markdown files through a route that responds to the Accept: text/markdown header.

Is it worth shipping?

For any prerendered or static site, this is an afternoon of work. A build-time conversion step, a few routing rules, and the right cache headers are enough to deliver markdown without runtime cost, vendor dependency, or a paid plan.

If you want help implementing Markdown for Agents or optimizing your site for AI visibility more broadly, I offer this as part of my AI Web Enhancements service at coditive.ai.