Crawling & Indexing

Technical SEO starts here: deliberately controlling what search engines crawl and what they index. The two are different systems with different controls, and confusing them causes most indexing bugs in the wild.

You want to…	Use	Not
Stop bots fetching a URL	`robots.txt`	`noindex` (they'll never see it)
Keep a crawlable page out of results	`noindex` meta/header	`robots.txt` (blocks them from seeing the noindex!)
Merge duplicate URLs' signals	`rel=canonical` / redirects	blocking the duplicates

robots.txt#

Lives at the domain root, controls crawler access by path prefix:

public/robots.txt

User-agent: *
Disallow: /api/
Disallow: /admin/
Disallow: /*?sort=        # parameter traps
 
User-agent: GPTBot        # AI training crawlers can be managed separately
Allow: /
 
Sitemap: https://example.com/sitemap.xml

In Next.js you can generate it dynamically:

app/robots.ts

import type { MetadataRoute } from "next";
 
export default function robots(): MetadataRoute.Robots {
  return {
    rules: [{ userAgent: "*", disallow: ["/api/", "/admin/"] }],
    sitemap: "https://example.com/sitemap.xml",
  };
}

XML sitemaps#

A sitemap is a machine-readable list of URLs you want crawled - a discovery aid, not a ranking factor. Rules:

Include only canonical, indexable, 200-status URLs. A sitemap full of redirects and noindexed pages erodes trust in the file.
≤ 50,000 URLs / 50MB per file; use a sitemap index beyond that.
lastmod should be honest - engines use it to prioritize recrawls and learn to ignore it when it lies.

app/sitemap.ts

import type { MetadataRoute } from "next";
import { getAllPosts } from "@/lib/posts";
 
export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
  const posts = await getAllPosts();
  return [
    { url: "https://example.com", lastModified: new Date() },
    ...posts.map((post) => ({
      url: `https://example.com/blog/${post.slug}`,
      lastModified: post.updatedAt,
    })),
  ];
}

Submit it once in Search Console; afterwards engines refetch it on their own schedule.

Static vs dynamic sitemaps#

There are two ways to ship a sitemap, and the right one depends on how often your URLs change:

Static - a plain public/sitemap.xml file you write or generate at build time. Perfect for small or rarely-changing sites. It is served as-is, costs nothing at runtime, and is trivial to inspect. The downside: it goes stale the moment you add a page and forget to update it.
Dynamic - a app/sitemap.ts route (the example above) that builds the list from your data on each request or at build time. This is what you want once URLs come from a CMS, database, or any source that changes without a redeploy. It can never drift out of sync with your content.

A useful rule: if a non-developer can publish a page, you need a dynamic sitemap. If every page ships in a Git commit, a static file is fine.

Sharding past 50,000 URLs#

A single sitemap caps at 50,000 URLs / 50MB. Large sites split into multiple files behind a sitemap index. Next.js does this with generateSitemaps, which produces numbered shards (/sitemap/0.xml, /sitemap/1.xml, ...) and the index automatically:

app/sitemap.ts

import type { MetadataRoute } from "next";
import { getProductCount, getProducts } from "@/lib/products";
 
const PER_SITEMAP = 50_000;
 
export async function generateSitemaps() {
  const count = await getProductCount();
  const pages = Math.ceil(count / PER_SITEMAP);
  return Array.from({ length: pages }, (_, id) => ({ id }));
}
 
export default async function sitemap({
  id,
}: {
  id: number;
}): Promise<MetadataRoute.Sitemap> {
  const start = id * PER_SITEMAP;
  const products = await getProducts(start, start + PER_SITEMAP);
  return products.map((product) => ({
    url: `https://example.com/product/${product.slug}`,
    lastModified: product.updatedAt,
  }));
}

Index control: noindex#

noindex removes a page from results while letting users access it - right for internal search results, thin tag pages, thank-you pages, login screens:

app/search/page.tsx

export const metadata = {
  robots: { index: false, follow: true },
};

For non-HTML resources, send it as an HTTP header instead: X-Robots-Tag: noindex.

Canonicalization#

When several URLs serve the same content (parameters, casing, trailing slashes, www/non-www, http/https), engines pick one canonical and fold the rest into it. Influence the choice with, in order of strength:

301 redirects - the strongest statement; use whenever the duplicates have no reason to exist
rel=canonical - when duplicates must stay accessible (tracking params, print views)
Consistent internal linking - always link to the canonical form
Sitemap inclusion - list only canonicals

Audit reality vs. intent with Search Console's URL Inspection: "Google-selected canonical" vs "user-declared canonical" disagreements are a top cause of "indexed but not the URL I wanted".

Status codes engines care about#

Code	Meaning to a crawler
`200`	Index me
`301`	Moved permanently - transfer signals to target
`302/307`	Temporary - keep the old URL indexed (don't use for permanent moves)
`304`	Unchanged - saves crawl budget on conditional requests
`404`	Gone; will retry occasionally, then drop
`410`	Gone permanently - dropped faster than 404
`5xx`	Server trouble - crawl rate backs off; persistent 5xx deindexes

Soft 404s - pages returning 200 with "not found" content - confuse all of this. Return real status codes; in Next.js, call notFound() rather than rendering an empty state with a 200.

Debugging indexing issues#

Work through Search Console's Indexing → Pages report. The frequent culprits:

"Excluded by noindex"            → intended? if not, remove the tag
"Alternate page with canonical"  → engine merged it; check that's correct
"Crawled - currently not indexed"→ quality/duplication problem, not technical
"Discovered - currently not crawled" → crawl budget or weak internal links
"Blocked by robots.txt"          → unblock (or stop sitemapping it)

Next: Site Architecture - structuring URLs and navigation so crawl and equity flow where you want.