Okay... I concede. Apparently not everyone understands how search engines work when they launch their website or mini-website (like an MSN Space) on the internet. I have begun to get a small number of e-mails and comments regarding, "Why am I listed in your list? I didn't ask to be listed." These range from incredibly hostile to confused, but excited.
Before I jump into the details of the "Why", I'll make the unformal announcement that I have stopped automated processing of additional spaces until I can provide additional support for authors to remove themselves from the list and to limit how many times a space's content is indexed in a day. The hows of all this are up to me, but I'll give everyone an update within the next couple of days. For those of you who have been included in the list and already voiced your desire not to be included, I apologize, but you will have to wait. Read everything below if you are upset by this.
My List Works Just Like a Search Engine
If your "website" does not completely limit access to its content, the content is said to be "in the public domain". However, search engines recognize that not everyone wants their web pages to be included in the results, so they implemented a solution. Well, more of a "good practice". The program that updates my list is known as a "web agent", a kind of "internet bot" that seeks out information on the internet. It works similiar to a search engine spider. My program could also be considered a news aggregator
How Websites Limit Access to Content
Webpages may include a special HTML tag that tells an internet bot / web spider / web robot whether or not to index the page's content and whether or not to follow any links within the content. Another file, known as "robots.txt", can specify how an bot accesses your website or a specific web directory within the site. Links in HTML may also include information that tells a bot not to follow it, giving the web page's author complete control. However, internet bots that automatically read web pages are not required to respect these standard guidelines. This is an unfortunate reality on the internet. So, an author might specify everything to limit access to their content, but ultimately, unless the content is physically protected, it's up to the internet bot to decide.
How MSN Spaces Limits Access to Content
Most authors of mini-websites, like MSN Spaces, do not have control over "robots.txt" or the HTML tag for controlling web page indexing. So, MSN Spaces has provided the option to physically restrict access to a person's "Space" as well as providing support for specifying that a link is not to be followed. So, what if you do not want to restrict physical access to your space, but want to restrict access to your content by search engines or internet bots?
How MSN Spaces Should Limit Access to Content
Building on their existing support for limiting content, MSN Spaces should provide the additional options for limiting access to content:
- MSN Spaces should give us all the option to limit how an internet bot accesses our content, possibly providing a checkbox in our "Settings" panel for "noindex" and "nofollow" commands.
- MSN Spaces should also enable us to specify whether or not to include our Space in their own "Updated Spaces" list that appears on most spaces or in their "More Spaces" list.