Searching...
Sunday 24 March 2013

Custom Robots.txt Setting For your Blog

Custom Robots.txt Setting For your BlogGenerally in every blog settings there is a setting for custom robots.txt,which is used for instructing search engine to crawl certain pages of your own blog or not to crawl and index it . It depend upon how you set you set your blog custom robots.txt settings
Custom robots.txt is very useful when we do not want search engine crawl a specific part/url of blog.For that purpose we have to make some custom changes.

Blogger now allows custom robots.txt, this is very useful because we can set the visibility of our articles on search engines, we can determine whether the article will be indexed by search engines or not.

For example: there is a page on your blog ( author notes ) and you do not want search engine to crawl this part of your url page to be indexed.
Normally for every blogger platform user the default setting are as follows.


User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /search
Allow: /
Sitemap: http://www.rnhckr.com/feeds/posts/default?orderby=updated

www.rnhckr.com is the url of a site/blog

you can simply find it by writing www.rnhckr.com/robots.txt in your browser
and your robots.txt window will open.


Explanation : of the content present in robots.txt

1. User agent : Mediapartners-Google
It is a robot from Google adsense , ( If you are using Adsense ads . so the first line specification will be like that. ( do not try to make changes to it. )



2. User-agent: *
second line which is mark with an asterisk " * " which represents the default configration that your blog label is not not indexed


3. Disallow : / search 
The default configuration listed above means that your blog url and all pages and content inside it will be crawled and indexed by search engine.(Its the very default setting by Google for every blogger platform).


4. Allow: /
here your blog/website robots.txt is telling Google bot to allow search and intex whole website/blog and pages present inside it. " / " is representing your url . for example
www.rnhckr.com = /


5. Sitemap: http://rnhckr.com/feeds/posts/default?orderby=updated
It is the fixed value of your blog sitemap which is authorized by Google. It is generated from the feeds of your blog posts url.

By default, every blog that uses the Blogger platform will have a robots.txt as follows:

User-agent: Mediapartners-Google
Disallow:
User-agent: *
Disallow: /search
Allow: /

Sitemap: http://www.rnhckr.com/feeds/posts/default?orderby=updated

And has the following Explanations:

Mediapartners-Google is a robot from Google Adsense, leave it as is because if you mistakenly change that than the ads served will not fit with your content.

The next line is for all the robots and marked with an asterisk (*). On the default configuration, it is clear that the label of our blog is not indexed
Disallow: /search.Keep in mind that a slash (/) is as your homepage, so for example if you want the label to get indexed, do not just fill up with a slash like this Disallow: / because that would be you do not allow the robot tracing your blog, but it should like the example below:

User-agent: Mediapartners-Google Disallow: User-agent: *
Disallow:
Allow: /

Sitemap: http://www.rnhckr.com/feeds/posts/default?orderby=updated

With the configuration as above then all of the articles and the label will be indexed. And to block a robot for particular page (I take the example of my FAQ page) you can simply write as follows:

User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /p/faq.html
Allow: /
Sitemap: http://www.rnhckr.com/feeds/posts/default?orderby=updated

Update: To resolve the pagination problems on blogspot after we remove the Disallow: /search than we can use the following configuration to block the pagination page:

User-agent: Mediapartners-Google
Disallow:
User-agent: *
Disallow: /search?updated-min=
Disallow: /search?updated-max=
Disallow: /search/label/*?updated-min=
Disallow: /search/label/*?updated-max=
Allow: /

Sitemap: http://www.rnhckr.com/feeds/posts/default?orderby=updated

After the changes, make sure everything is fit like what we want by visiting www.rnhckr.com/robots.txt. Replace the rnhckr.com with your domain name.
Warning! Use with caution. Incorrect use of these features can result in your blog being ignored by search engines.

It is a function of blogger new interface. It is helpful for SEO. In self hosted website we create an extra robot.txt file and put code in it. In old interface this feature was not available. Now google has given this feature.

Making Changes In custom Robots.txt for your blog

If you want  not to allow Google crawlers to crawl a specific url of page on your blog then you have to make some changes in your robots.txt which are as 
follows.

User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow:/p/author.html
Allow: /

Sitemap: http://www.rnhckr.com/feeds/posts/default?orderby=updated

Here in the disallow:/ section the ( p ) is representing your page and author.html is the page name which you don,t want search engine crawl it .  Slash "/ " represents  your homepage.
1. Now go to your blog Design >> Settings >> Search preference >> 
2. In Crawlers and Indexing >> custom robots.txt 
3. Paste you edit robot.txt data.
4. Click save and refresh your page. 


Lets s
ee how to use:
  • Go to Blogger and LogIn to your acount.
  • Choose your blog.
  • Click Settings.
  • Here you will find Search Preferences. Click on it.
  • In Crawlers and indexing You will see Custom robots.txt
  • Now click Edit and Select Yes.
  • A new text area will come. Put code in it.
how to use Custom Robots.txt Setting For your Blog

User-agent: Mediapartners-Google Disallow:
User-agent: *
Disallow: /search?q=*
Disallow: /*?updated-max=*
Allow: /

Sitemap: http://www.rnhckr.blogspot.com/feeds/posts/default?orderby=updated

  • You just Change Sitemap by your own.
  • After that click Save changes.
Congrats!! You have done it. :)

0 comments:

Post a Comment