How Search Engines Work

Every search engine, from Google to the crawler feeding an AI chatbot, runs the same three-stage pipeline: crawl → index → rank. Understanding this pipeline is the single highest-leverage piece of knowledge in SEO, because every problem you'll ever debug lives in one of these stages.

Stage 1: Crawling#

Crawling is discovery. A program called a crawler (or spider/bot - Google's is Googlebot) fetches pages, parses them, extracts every link, and queues those links to fetch next. Repeat forever.

Crawlers discover URLs from:

Links on pages they already know about
XML sitemaps you submit (guide)
Redirects and canonical hints
Previous crawls (re-checking known pages for changes)

You control crawler behavior primarily through robots.txt:

robots.txt

User-agent: *
Disallow: /admin/
Disallow: /cart
 
Sitemap: https://example.com/sitemap.xml

Crawl budget#

Search engines won't crawl your site infinitely. Crawl budget = how many URLs a crawler will fetch in a given window, a function of:

Crawl capacity - how fast your server responds without degrading
Crawl demand - how popular and frequently-updated your pages are

Small sites (under ~10k pages) rarely need to think about it. Large sites live and die by it - wasting budget on duplicate or junk URLs means real pages go uncrawled.

Rendering#

Modern crawlers render JavaScript, but in a second wave that may lag the initial HTML crawl. Content that only exists after client-side JS execution is discovered later - and sometimes not at all. This is the entire subject of JavaScript SEO.

Stage 2: Indexing#

Indexing is comprehension and storage. The engine parses the rendered page and decides:

What is this page about? Text, headings, structured data, image alt text, link anchors.
Is it the canonical version? Duplicate and near-duplicate pages get clustered; one URL is chosen as canonical and the rest are folded into it.
Is it worth storing? Thin, duplicate or low-quality pages may be crawled but never indexed - the dreaded "Crawled - currently not indexed" in Search Console.

The result goes into the inverted index: a giant map from words and entities to the documents containing them, which is what makes searching billions of pages in 200ms possible.

Stage 3: Ranking#

When a user searches, the engine retrieves candidate documents from the index and ranks them using hundreds of signals. The major families:

Signal family	Examples
Relevance	Query terms and related entities in title, headings, body
Quality & authority	Links from other sites, brand mentions, E-E-A-T signals
User experience	Page speed (Core Web Vitals), mobile usability, HTTPS, intrusive interstitials
Intent & context	Freshness for news queries, location for "near me", language

Two important truths about ranking:

Ranking is per-query, not per-site. You don't "rank" in the abstract - you rank for specific queries, each with its own competitive landscape.
Intent dominates. Google's systems first decide what kind of result satisfies the query (product pages? tutorials? videos?) and rank within that. Matching search intent is a prerequisite, not an optimization.

Debugging with the pipeline#

When a page isn't getting traffic, walk the stages in order:

Not crawled?   → check robots.txt, internal links, sitemap, server errors
Not indexed?   → check noindex, canonicals, content quality, duplicates
Not ranking?   → check intent match, content depth, links, competition
No clicks?     → check title/description, SERP features stealing attention

Google Search Console's URL Inspection tool tells you exactly where a URL sits in this pipeline - make it your first stop. Setting it up is covered in Your SEO Toolkit.

Next: Anatomy of a SERP - what the results page actually contains in the age of AI.