WWE-2006 Weblog Datashare format -------------------------------- - The dataset consists of about 10M weblog posts from 1M weblogs, dated from July 7, 2005 - July 24, 2005 and collected by Intelliseek, Inc. - The data is in XML format; there is one XML file per day. - Caveats: * Parse the XML in .... chunks as there is the occasional XML error * Date/time are not normalized to a uniform time zone, so interpret dates with care * There are association errors in the data; in less than 2% of the posts, the permalink, title, author and/or date will be incorrectly associated with the content of the post - Data is released in conjuntion with the 3rd Annual Workshop on the Weblogging Ecosystem http://www.blogpulse.com/www2006-workshop/ - To obtain a copy of the data, sign and fax the datashare individual agreement form to Intelliseek: http://www.blogpulse.com/www2006-workshop/datashare-agreement.pdf Format of the feed: weblog url title of the weblog (defaults to weblog url if title not found) permalink for the post (defaults to weblog url if permalink not found) title of the post author of the post (may be empty or missing) date of publication of the post time of publication of the post in format HHMMSS (defaults to 000000 if unknown) content of the post type of outlink: either "weblog" or "press" url in href of post content if type=="weblog", site is the parent weblog for the permlink url; if type=="press", site is the news portal hosting the news article tag/category associated with post