WWE-2006 Weblog Datashare format
--------------------------------
- The dataset consists of about 10M weblog posts from 1M weblogs,
dated from July 7, 2005 - July 24, 2005 and collected by Intelliseek,
Inc.
- The data is in XML format; there is one XML file per day.
- Caveats:
* Parse the XML in .... chunks as there is the
occasional XML error
* Date/time are not normalized to a uniform time zone, so interpret
dates with care
* There are association errors in the data; in less than 2% of the
posts, the permalink, title, author and/or date will be
incorrectly associated with the content of the post
- Data is released in conjuntion with the 3rd Annual Workshop on the Weblogging Ecosystem
http://www.blogpulse.com/www2006-workshop/
- To obtain a copy of the data, sign and fax the datashare individual
agreement form to Intelliseek:
http://www.blogpulse.com/www2006-workshop/datashare-agreement.pdf
Format of the feed:
weblog url
title of the weblog
(defaults to weblog url if title not found)
permalink for the post
(defaults to weblog url if permalink not found)
title of the post
author of the post (may be empty or missing)
date of publication of the post
time of publication of the post in format HHMMSS
(defaults to 000000 if unknown)
content of the post
type of outlink: either "weblog" or "press"
url in href of post content
if type=="weblog", site is the parent weblog for the permlink url;
if type=="press", site is the news portal hosting the news article
tag/category associated with post