How to prevent unwanted crawling and indexing | LA SEO Service Blog

31 Jul How to prevent unwanted crawling

For a variety of reasons, webmasters may not want search engine spiders to index specific files or directories.

Here’s how to prevent unwanted crawling

Unfortunately, no simple “do not crawl” directive exists, but a standard robots.txt file (robots exclusion standard) in the root directory of the domain does a pretty good job. A meta tag specific to robots can also assure exclusion from search engine databases, applying robots exclusion directives at a more granular, page-specific level.

The meta tag outlines search engine indexing procedures and appearance in results pages. Place the robots meta tag in the <head> section of a given page:

<!DOCTYPE html>

<html><head>

<meta name=”robots” content=”noindex” />

(…)

</head>

<body>(…)</body>

</html>

Voilà.

Most search engines will not show pages that include a robots meta tag in their results. Sometimes, webmasters want to prevent only one specific crawler (also known as a user agent because a crawler uses its user agent to request pages) from indexing.

This can be done by replacing the robots value of the name attribute with the name of the blocked crawler.

Only that crawler/spider will not show the page on website search results. For example, Google’s standard web crawler has the user-agent name Googlebot.

To axe it, try this:

<meta name=”googlebot” content=”noindex” />

Because many major search engines boast multiple crawlers (see all Google crawlers here), use this robots meta tag template to avoid Google crawling a page for Google News search results:

<meta name=”googlebot-news” content=”noindex” />

Even specifying multiple search engine crawlers with multiple robots meta tags is okay:

<meta name=”googlebot” content=”noindex”>

<meta name=”googlebot-news” content=”nosnippet”>

If robots meta tags are implemented incorrectly so that robots directives are competing with one another, search engines act upon the most restrictive directive.

The search engine spiders visit websites and crawl robots.txt files in root directories first. Spiders parse robots.txt files, which outline the pages and content to avoid processing or scanning. However, search engines are imperfect. Spiders might hang onto cached copies of robots.txt files and may inadvertently crawl login specific pages (shopping carts) or user-specific content (search results from internal searches). It is important to note that links to pages listed in robots.txt files can still appear in search results if they are linked to a page that is crawled. Google has instructed webmasters to prevent crawling of undesirable content and indexing of internal search results, as those are considered search spam.

Need help with your crawling issues?

Talk to our team of seasoned SEO professionals now.

GET STARTED TODAY.

No Comments

Post A Comment