Temboz 0.4.4 released

I have released version 0.4.4 of my web-based aggregator, Temboz. Apart from cosmetic fixes, the new version improves filtering by supplying convenience functions to simplify writing rules. For instance:

content_any_words('foot', 'football', 'tennis', 'rugby')

is equivalent to the older syntax:

('foot' in content_words or 'football' in content_words
 or 'tennis' in content_words or 'rugby' in content_words)

This release also includes an automatic garbage collection mechanism to keep the database size manageable. Uninteresting articles (those flagged “thumbs down”) are purged of their content every day between 3AM and 4AM. The article metadata itself (title, timestamps, permalinks) is kept to avoid the articles reappearing if they are still present in the feeds (some infrequently updated feeds keep really old articles in their XML). By default, articles older than 7 days are purged, this is configurable in param.py, and you will need to update this file to add the corresponding parameter.

Temboz 0.4.2 released

I have released version 0.4.2 of Temboz. This is a bugfix release to address an issue that could cause a feed to stop the server. All users are advised to upgrade.

Temboz 0.4 released

I have released version 0.4 of my web-based aggregator, Temboz. The new version focuses on performance, by adding an index and rewriting some queries to gain almost an order of magnitude performance on the two most common operations, viewing unread articles and the “all feeds” summary page. Upgraders will need to read the UPGRADE file to add the index to their existing database.

RSS/Atom and information overload

I have been running Temboz, my home-made RSS/Atom aggregator, for half a year now, and it is interesting to take stock. I ran a report on the database to count how many items per day I read, how many are filtered out automatically, and how many I flagged as interesting.

Temboz statistics

The most obvious thing is the steady increase in the number of articles per day, while the number of articles I flag as interesting remains mostly constant (perhaps a sign of greater selectivity). The increase is primarily due to an increase in the number of feeds I subscribe to — as the ergonomics of the feed reader improved (at least from my perspective), I can read more feeds. The addition of filtering also allows me to read via RSS sources of information I used to check daily, such as the Photo.net forums. As time goes, I find I seldom regularly visit web sites on a daily basis any more, not even the New York Times (granted, the steadily deteriorating quality of their journalism might have something to do with that).

My filtering scheme is manual and rules-based. I am a bit leery of implementing something like Bayesian filtering, as articles I flag as “uninteresting” are not necessarily articles I would like to be filtered away – some are duplicates, some are worth a chuckle but not much more. The risk with the “Daily Me” is to lock oneself into a routine and self-reinforcing echo chamber, so I try and keep a balanced diet of information. Some subjects I am completely uninterested in, however, for instance one of my rules filters out anything sports-related from The Guardian.

As I enrich my library of filtering rules, the proportion of articles filtered is increasing steadily (the recent dip in September was caused by a flurry of feed subscriptions). A 20% time savings is nothing to sniff at. Fatigue plays a role — I dislike phones and finally got fed up with the plethora of cell phone reviews this week and filtered out all articles dealing with phones altogether.

Temboz statistics

One big win would be an algorithm that could reliably detect and group together articles that are on the same topic, the way Google News does (Google News has the potential to be the ultimate RSS/Atom aggregator). I experimented with a scheme to look for duplicated URLs inside the articles, but this didn’t work very well. Some form of statistical natural language processing would be needed, but that is more work than I am prepared to put in right now.