seo101

How Search Engines Work

Every search engine, from Google to the crawler feeding an AI chatbot, runs the same three-stage pipeline: crawl → index → rank. Understanding this pipeline is the single highest-leverage piece of knowledge in SEO, because every problem you'll ever debug lives in one of these stages.

Stage 1: Crawling#

Crawling is discovery. A program called a crawler (or spider/bot - Google's is Googlebot) fetches pages, parses them, extracts every link, and queues those links to fetch next. Repeat forever.

Crawlers discover URLs from:

  • Links on pages they already know about
  • XML sitemaps you submit (guide)
  • Redirects and canonical hints
  • Previous crawls (re-checking known pages for changes)

You control crawler behavior primarily through robots.txt:

robots.txt
User-agent: *
Disallow: /admin/
Disallow: /cart
 
Sitemap: https://example.com/sitemap.xml

Crawl budget#

Search engines won't crawl your site infinitely. Crawl budget = how many URLs a crawler will fetch in a given window, a function of:

  • Crawl capacity - how fast your server responds without degrading
  • Crawl demand - how popular and frequently-updated your pages are

Small sites (under ~10k pages) rarely need to think about it. Large sites live and die by it - wasting budget on duplicate or junk URLs means real pages go uncrawled.

Rendering#

Modern crawlers render JavaScript, but in a second wave that may lag the initial HTML crawl. Content that only exists after client-side JS execution is discovered later - and sometimes not at all. This is the entire subject of JavaScript SEO.

Stage 2: Indexing#

Indexing is comprehension and storage. The engine parses the rendered page and decides:

  1. What is this page about? Text, headings, structured data, image alt text, link anchors.
  2. Is it the canonical version? Duplicate and near-duplicate pages get clustered; one URL is chosen as canonical and the rest are folded into it.
  3. Is it worth storing? Thin, duplicate or low-quality pages may be crawled but never indexed - the dreaded "Crawled - currently not indexed" in Search Console.

The result goes into the inverted index: a giant map from words and entities to the documents containing them, which is what makes searching billions of pages in 200ms possible.

Stage 3: Ranking#

When a user searches, the engine retrieves candidate documents from the index and ranks them using hundreds of signals. The major families:

Signal familyExamples
RelevanceQuery terms and related entities in title, headings, body
Quality & authorityLinks from other sites, brand mentions, E-E-A-T signals
User experiencePage speed (Core Web Vitals), mobile usability, HTTPS, intrusive interstitials
Intent & contextFreshness for news queries, location for "near me", language

Two important truths about ranking:

  • Ranking is per-query, not per-site. You don't "rank" in the abstract - you rank for specific queries, each with its own competitive landscape.
  • Intent dominates. Google's systems first decide what kind of result satisfies the query (product pages? tutorials? videos?) and rank within that. Matching search intent is a prerequisite, not an optimization.

Debugging with the pipeline#

When a page isn't getting traffic, walk the stages in order:

Not crawled?   → check robots.txt, internal links, sitemap, server errors
Not indexed?   → check noindex, canonicals, content quality, duplicates
Not ranking?   → check intent match, content depth, links, competition
No clicks?     → check title/description, SERP features stealing attention

Google Search Console's URL Inspection tool tells you exactly where a URL sits in this pipeline - make it your first stop. Setting it up is covered in Your SEO Toolkit.

Next: Anatomy of a SERP - what the results page actually contains in the age of AI.