seo101

Duplicate Content

Duplicate content isn't a penalty - it's a navigation problem. When multiple URLs serve identical or near-identical content, Google has to pick one to rank. If your signals are unclear or contradictory, Google picks wrong (or picks inconsistently), your PageRank gets split between versions, and none of them rank as well as they could.

The fix isn't panic. It's clarity.

Where duplication silently comes from#

Most duplicate content isn't intentional - it's a side effect of how websites are built.

HTTP vs. HTTPS#

http://example.com/page and https://example.com/page both returning 200. Fix: server-level 301 redirect of all HTTP to HTTPS, permanently.

www vs. non-www#

www.example.com and example.com both live. Fix: pick one canonical version, redirect the other, and set Preferred Domain in GSC.

Trailing slash variants#

/page and /page/ returning identical content. Fix: pick one, 301 the other universally across the site.

URL parameters - the big one for e-commerce#

Faceted navigation and sorting generate URLs like:

/shoes?color=red&size=10
/shoes?size=10&color=red   ← same result, different order
/shoes?color=red&size=10&sort=price
/shoes?utm_source=email&color=red&size=10

A single category page with 10 filter options and 3 sort orders can silently generate hundreds of duplicate URLs. Left unmanaged, this eats crawl budget and splits authority across URLs that should all be pointing to one canonical.

Fix options:

  1. <link rel="canonical"> on parameter URLs pointing to the clean base URL
  2. robots.txt Disallow for parameter combinations that add no indexable value
  3. Server-side redirect of known duplicate parameter patterns

Syndicated content#

You publish something. It gets republished on another site. Now two pages compete for the same query. Google usually picks the original - but "usually" and "always" are different things, especially if the syndicating domain has more authority than yours.

If your content goes elsewhere: Request the syndicating site add <link rel="canonical" href="https://yoursite.com/original">. Submit your original to GSC immediately after publishing so it's indexed first.

If you republish others' content: Add rel="canonical" on your copy pointing to the original, or noindex the page if it's purely for user convenience.

Thin location and service pages#

"We provide [service] in [city]" × 50 cities, all identical except the city name. This is exactly what Google's spam classifiers look for. Either write genuinely unique content for each location, or consolidate to a single regional service page.

Session IDs and tracking parameters#

URLs with appended session identifiers like ?sessionid=abc123 or tracking parameters that change the URL without changing the content. Fix: strip them server-side before they reach the client and add <link rel="canonical"> as a safety net.

The canonical signal stack#

Google treats canonicalization signals as hints, not commands - and weighs them in rough priority order:

  1. 301 redirects - the strongest signal; equity passes, story over
  2. <link rel="canonical"> - strong hint; usually respected, but Google can override if contradicted
  3. Sitemap inclusion - only including canonical URLs in your sitemap signals intent clearly
  4. Internal links - consistently linking to one URL version tells Google which one matters
  5. External backlinks - if the web predominantly links to one version, Google treats it as canonical

The critical rule: all signals must agree. A canonical pointing to /page/ while all internal links use /page is a contradictory signal. Google will make a judgement call - possibly the wrong one.

Cross-domain canonicals#

The rel="canonical" works across domains:

<!-- On partner-blog.com, which republished your article -->
<link rel="canonical" href="https://yoursite.com/original-article">

This is legitimate and Google honours it. What it won't honour: using cross-domain canonicals as a PageRank-extraction scheme to redirect authority from unrelated content. The relationship has to be genuine.

Near-duplicate content: harder but important#

Exact duplicates are easy. Near-duplicates - same template, marginally different data - are trickier.

The question to ask: does this variant provide genuinely different value to a user, or is it the same page with one word changed?

  • Product size/colour variants with identical descriptions → canonical to the parent product
  • Location pages where only the city name changes → consolidate or write genuinely local content
  • Date-archive pages → noindex; they're navigation, not content

For large e-commerce sites: estimate the combinatorial explosion of your facet system. If 5 filter categories with 4 options each gives you 1,024 possible parameter combinations - and your server returns a 200 for all of them - you have a crawl budget and duplication problem that needs architectural attention, not just canonical tags.

Running a duplicate content audit#

  1. Crawl with Screaming Frog and run the Near Duplicate Content and Exact Duplicate Content reports
  2. In GSC → Indexing → Pages, check "Duplicate without user-selected canonical" and "Duplicate, Google chose different canonical than user" - these tell you where Google is disagreeing with your signals
  3. Compare your sitemap URL count to GSC indexed count - large discrepancies often point to parameter duplication
  4. For e-commerce, audit whether faceted navigation URLs are crawlable and assess the combinatorial scale

Related: Crawling & Indexing for robots.txt and sitemap strategy. Site Architecture for faceted navigation patterns that prevent the duplication problem from arising in the first place.