Temboz

Temboz 0.8 released

I am pleased to announce the release of Temboz 0.8.

The main change in this release is its ability to work with either SQLite 2.x or SQLite 3.x. SQLite 3.x is now the recommended version, see the Temboz Wiki for upgrade instructions. SQlite 3.x improves performance, database file sizes and concurrency, but it also introduced a condition where Temboz could deadlock, hence the long incubation time for this release.

Another enhancement is the ability to sort feeds by Signal to Noise Ratio (SNR). The default view for the all feeds page will list high-quality feeds with unread articles first. If you are catching up with many articles, it pays to concentrate on the richest lodes of information first, and possibly prune those that no longer provide an adequate level of interesting information..

I have a number of feature requests I received from users or thought up myself. You are welcome to suggest others on the ticket page for Temboz CVStrac.

Temboz 0.7 released

I have released version 0.7 of Temboz. The main improvements in the new version are a better user interface, ad filtering, and garbage collection of articles older than 6 months. Several facilities have also been added to make it easier to write and test filtering rules – you can now add comments to a rule, or purge and reload a feed from the feed details page to see if changes rules are kicking in or not.

Temboz now also has a publicly accessible CVStrac with a documentation Wiki and a bug-tracking database (where change requests can also be submitted). The Wiki is publicly read-only for now, but if you would like to contribute to it, drop me an email and I will create an account with edit privileges for you.

Temboz 0.5 released

I have released version 0.5 of Temboz. This version makes considerable improvements in its tracking of feed changes. Feeds where the GUID is distinct from the link are now handled correctly. Some feeds have a tendency to modify articles and reissue them with a different GUID or link, causing them to appear as duplicates. This is often the case with Reuters and Sun blogs (Sun is now handled as a special case in the feed normalization code). If the optional title-based duplicate detection flag is set on a feed (go into the feed details page from the all feeds view), articles with duplicate titles will not be recorded twice in the database. This is not on by default, as it could cause false positives on some feeds that have recurring titles.

The other big feature in this release is that Temboz now automatically backs up its database nightly, and keeps a configurable number of daily backups (7 by default).

These changes require a data model upgrade. A script is provided to perform the upgrade, as well as another one to reconcile items already recorded where the GUID differs from the link. Upgrade instructions are provided in the UPGRADE file. All users are advised to upgrade.

Temboz 0.4.4 released

I have released version 0.4.4 of my web-based aggregator, Temboz. Apart from cosmetic fixes, the new version improves filtering by supplying convenience functions to simplify writing rules. For instance:

content_any_words('foot', 'football', 'tennis', 'rugby')

is equivalent to the older syntax:

('foot' in content_words or 'football' in content_words
 or 'tennis' in content_words or 'rugby' in content_words)

This release also includes an automatic garbage collection mechanism to keep the database size manageable. Uninteresting articles (those flagged “thumbs down”) are purged of their content every day between 3AM and 4AM. The article metadata itself (title, timestamps, permalinks) is kept to avoid the articles reappearing if they are still present in the feeds (some infrequently updated feeds keep really old articles in their XML). By default, articles older than 7 days are purged, this is configurable in param.py, and you will need to update this file to add the corresponding parameter.

Temboz 0.4.2 released

I have released version 0.4.2 of Temboz. This is a bugfix release to address an issue that could cause a feed to stop the server. All users are advised to upgrade.

Temboz 0.4 released

I have released version 0.4 of my web-based aggregator, Temboz. The new version focuses on performance, by adding an index and rewriting some queries to gain almost an order of magnitude performance on the two most common operations, viewing unread articles and the “all feeds” summary page. Upgraders will need to read the UPGRADE file to add the index to their existing database.

RSS/Atom and information overload

I have been running Temboz, my home-made RSS/Atom aggregator, for half a year now, and it is interesting to take stock. I ran a report on the database to count how many items per day I read, how many are filtered out automatically, and how many I flagged as interesting.

Temboz statistics

The most obvious thing is the steady increase in the number of articles per day, while the number of articles I flag as interesting remains mostly constant (perhaps a sign of greater selectivity). The increase is primarily due to an increase in the number of feeds I subscribe to — as the ergonomics of the feed reader improved (at least from my perspective), I can read more feeds. The addition of filtering also allows me to read via RSS sources of information I used to check daily, such as the Photo.net forums. As time goes, I find I seldom regularly visit web sites on a daily basis any more, not even the New York Times (granted, the steadily deteriorating quality of their journalism might have something to do with that).

My filtering scheme is manual and rules-based. I am a bit leery of implementing something like Bayesian filtering, as articles I flag as “uninteresting” are not necessarily articles I would like to be filtered away – some are duplicates, some are worth a chuckle but not much more. The risk with the “Daily Me” is to lock oneself into a routine and self-reinforcing echo chamber, so I try and keep a balanced diet of information. Some subjects I am completely uninterested in, however, for instance one of my rules filters out anything sports-related from The Guardian.

As I enrich my library of filtering rules, the proportion of articles filtered is increasing steadily (the recent dip in September was caused by a flurry of feed subscriptions). A 20% time savings is nothing to sniff at. Fatigue plays a role — I dislike phones and finally got fed up with the plethora of cell phone reviews this week and filtered out all articles dealing with phones altogether.

Temboz statistics

One big win would be an algorithm that could reliably detect and group together articles that are on the same topic, the way Google News does (Google News has the potential to be the ultimate RSS/Atom aggregator). I experimented with a scheme to look for duplicated URLs inside the articles, but this didn’t work very well. Some form of statistical natural language processing would be needed, but that is more work than I am prepared to put in right now.

The Temboz RSS aggregator

2013-03-14: Google’s announcement that their Reader service will be discontinued has spurred interest in Temboz. This software is not dead, in fact I use it daily, but have not made an official release in a long time. You should use the version from Github instead. There are currently a number of bugs which can lead to Temboz locking up and requiring a restart. I am planning on completing my long overdue overhaul before Google’s July deadline.

Contents

Introduction

Temboz is a RSS aggregator. It is inspired by FeedOnFeeds (web-based personal aggregator), Google News (two column layout) and TiVo (thumbs up and down). I have been using FeedOnFeeds for some time now, but that software seems to have stopped evolving, and I had a number of optimizations to the user experience I wanted to make.

Features

Already implemented:

  • Multithreaded, download feeds in parallel.
  • Built-in web server.
  • Two-column user interface for better readability and information density. Automatic reflow using CSS.
  • Ratings system for articles
  • Real-time hunter-gatherer user interface: items flagged with a “Thumbs down” disappear immediately off the screen (using Dynamic HTML), making room for new articles. No laborious flagging of items as in FeedOnFeeds.
  • Filtering entries (using Python syntax, e.g. ‘Salon’ in feed_title and title == “King Kaufman’s Sports Daily”, or simply by selecting keywords/phrases and hitting “Thumbs down”).
  • Ability to generate a RSS feeds from “Thumbs Up” articles, which is why Temboz would be a true aggregator, not just a reader.
  • Ad filtering
  • Automatic garbage collection: every day between 3AM and 4AM, uninteresting articles (by default those older than 7 days) are purged of their contents (but not metadata such as titles, permalinks or timestamps) to keep the database size manageable. After 6 months (by default), they are deleted altogether
  • Automatic database backups daily (immediately after garbage collection)

On the to do list:

  • Write better documentation
  • Handle permanent HTTP redirects for feed XML URLs
  • Automatic pacing of feed polling intervals using the average and standard deviation of observed feed item inter-arrival times, to reduce bandwidth usage and load for both client and server. Most feeds should be polled on a daily rather than hourly interval (e.g. my own, since I update once a week on average), but the mechanisms for a feed to indicate its polling rate preferences are quite inconsistent from one flavor of RSS/Atom to another.
  • “Survivor mode” – vote feeds that no longer perform off the aggregator based on relevance statistics.
  • Ability to cluster together articles (I tried a heuristic of looking for common URLs they are all pointing to, but this didn’t work well in practice).
  • Portability to Windows, distribution as a standalone package.

History

I have been using it successfully for well over a year. It still has rough edges, with some administration functions only doable using the SQLite command-line utility. Here is a screen shot showing the reader user interface. The article highlighted in yellow was given a “Thumbs Up”. You can also see the user interface at work in a view of the last 50 articles I flagged as “thumbs up” among the feeds I read.

Screen shots

Click on a screen shot thumbnail for a full-sized version

The first screen shot shows the article reading interface, using a two-column layout. Clicking on the “Thumbs down” icon makes the article disappear, bringing a new one in its place (if available). Clicking on the “Thumbs up” icon highlights it in yello and flags it as interesting in the database.

view itemsThe feed summary page shows statistics on feeds, starting with feeds with unread articles, then by alphabetical order. Feeds can be sorted based on other metrics. You have the option of “catching up” with a feed (marking all the articles as read). Feeds with errors are highlighted in red (not shown).

view feedsClicking on the “details” link for a feed brings this page, which allows you to change title or feed URL, and shows the RSS or Atom fields accessible for filtering.

feed detailsFeeds can be filtered using Python expressions.

filtering rules

Known bugs

You can check outstanding bug reports, change requests and more at the public CVStrac site.

Credits

Temboz is written in Python, and leverages Mark Pilgrim’s Ultra-liberal feed parser, SQLite 2.x, Cheetah.

Download

You can download the current version: temboz-0.8.tar.gz I welcome any feedback you may have, specially as concerns improving installation.

The CVS version is far ahead of 0.8 in features. I have not yet had the time to test and document the migration procedure from 0.8 to 1.0, but if you are a new Temboz user I strongly advise you to get a nightly CVS snapshot instead (they are what I run on my own server): temboz-CVS.tar.gz or temboz-CVS.zip.

Updates

For news on Temboz, please subscribe to the RSS feed.

Temboz has a CVStrac where you can submit bug reports or change requests, and a Wiki, where all future documentation will ultimately reside.

Post scriptum

The name “Temboz” is a reference to Malima Temboz, “The mountain that walks”, an elephant whose tormented spirit is the object of Mike Resnick’s excellent SF novel, Ivory.