Crawling & Indexing
Technical SEO starts here: deliberately controlling what search engines crawl and what they index. The two are different systems with different controls, and confusing them causes most indexing bugs in the wild.
| You want to… | Use | Not |
|---|---|---|
| Stop bots fetching a URL | robots.txt | noindex (they'll never see it) |
| Keep a crawlable page out of results | noindex meta/header | robots.txt (blocks them from seeing the noindex!) |
| Merge duplicate URLs' signals | rel=canonical / redirects | blocking the duplicates |
robots.txt#
Lives at the domain root, controls crawler access by path prefix:
User-agent: *
Disallow: /api/
Disallow: /admin/
Disallow: /*?sort= # parameter traps
User-agent: GPTBot # AI training crawlers can be managed separately
Allow: /
Sitemap: https://example.com/sitemap.xmlIn Next.js you can generate it dynamically:
import type { MetadataRoute } from "next";
export default function robots(): MetadataRoute.Robots {
return {
rules: [{ userAgent: "*", disallow: ["/api/", "/admin/"] }],
sitemap: "https://example.com/sitemap.xml",
};
}XML sitemaps#
A sitemap is a machine-readable list of URLs you want crawled - a discovery aid, not a ranking factor. Rules:
- Include only canonical, indexable, 200-status URLs. A sitemap full of redirects and noindexed pages erodes trust in the file.
- ≤ 50,000 URLs / 50MB per file; use a sitemap index beyond that.
lastmodshould be honest - engines use it to prioritize recrawls and learn to ignore it when it lies.
import type { MetadataRoute } from "next";
import { getAllPosts } from "@/lib/posts";
export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
const posts = await getAllPosts();
return [
{ url: "https://example.com", lastModified: new Date() },
...posts.map((post) => ({
url: `https://example.com/blog/${post.slug}`,
lastModified: post.updatedAt,
})),
];
}Submit it once in Search Console; afterwards engines refetch it on their own schedule.
Static vs dynamic sitemaps#
There are two ways to ship a sitemap, and the right one depends on how often your URLs change:
- Static - a plain
public/sitemap.xmlfile you write or generate at build time. Perfect for small or rarely-changing sites. It is served as-is, costs nothing at runtime, and is trivial to inspect. The downside: it goes stale the moment you add a page and forget to update it. - Dynamic - a
app/sitemap.tsroute (the example above) that builds the list from your data on each request or at build time. This is what you want once URLs come from a CMS, database, or any source that changes without a redeploy. It can never drift out of sync with your content.
A useful rule: if a non-developer can publish a page, you need a dynamic sitemap. If every page ships in a Git commit, a static file is fine.
Sharding past 50,000 URLs#
A single sitemap caps at 50,000 URLs / 50MB. Large sites split into multiple files behind a sitemap index. Next.js does this with generateSitemaps, which produces numbered shards (/sitemap/0.xml, /sitemap/1.xml, ...) and the index automatically:
import type { MetadataRoute } from "next";
import { getProductCount, getProducts } from "@/lib/products";
const PER_SITEMAP = 50_000;
export async function generateSitemaps() {
const count = await getProductCount();
const pages = Math.ceil(count / PER_SITEMAP);
return Array.from({ length: pages }, (_, id) => ({ id }));
}
export default async function sitemap({
id,
}: {
id: number;
}): Promise<MetadataRoute.Sitemap> {
const start = id * PER_SITEMAP;
const products = await getProducts(start, start + PER_SITEMAP);
return products.map((product) => ({
url: `https://example.com/product/${product.slug}`,
lastModified: product.updatedAt,
}));
}Index control: noindex#
noindex removes a page from results while letting users access it - right for internal search results, thin tag pages, thank-you pages, login screens:
export const metadata = {
robots: { index: false, follow: true },
};For non-HTML resources, send it as an HTTP header instead: X-Robots-Tag: noindex.
Canonicalization#
When several URLs serve the same content (parameters, casing, trailing slashes, www/non-www, http/https), engines pick one canonical and fold the rest into it. Influence the choice with, in order of strength:
- 301 redirects - the strongest statement; use whenever the duplicates have no reason to exist
rel=canonical- when duplicates must stay accessible (tracking params, print views)- Consistent internal linking - always link to the canonical form
- Sitemap inclusion - list only canonicals
Audit reality vs. intent with Search Console's URL Inspection: "Google-selected canonical" vs "user-declared canonical" disagreements are a top cause of "indexed but not the URL I wanted".
Status codes engines care about#
| Code | Meaning to a crawler |
|---|---|
200 | Index me |
301 | Moved permanently - transfer signals to target |
302/307 | Temporary - keep the old URL indexed (don't use for permanent moves) |
304 | Unchanged - saves crawl budget on conditional requests |
404 | Gone; will retry occasionally, then drop |
410 | Gone permanently - dropped faster than 404 |
5xx | Server trouble - crawl rate backs off; persistent 5xx deindexes |
Soft 404s - pages returning 200 with "not found" content - confuse all of this. Return real status codes; in Next.js, call notFound() rather than rendering an empty state with a 200.
Debugging indexing issues#
Work through Search Console's Indexing → Pages report. The frequent culprits:
"Excluded by noindex" → intended? if not, remove the tag
"Alternate page with canonical" → engine merged it; check that's correct
"Crawled - currently not indexed"→ quality/duplication problem, not technical
"Discovered - currently not crawled" → crawl budget or weak internal links
"Blocked by robots.txt" → unblock (or stop sitemapping it)Next: Site Architecture - structuring URLs and navigation so crawl and equity flow where you want.
