Fazal Majid's low-intensity blog

Sporadic pontification

Fazal Fazal

Just enough Weave

Note: I am keeping this code around for historical purposes, but it has not worked since Weave 1.0 RC2. I created this because Mozilla’s public sync servers were initially quite unreliable, but they have remedied the situation and performance problems are a thing of the past. I also learned the inner workings of Weave/Firefox Sync in the process, and am satisfied as to the security of the system. Since I no longer use Firefox myself, I do not expect to ever revive this project. Feel free to take it over, otherwise you are best served by using Mozilla’s cloud.

Like most of my readers, I use multiple computers: my Mac Pro at home, my MacBook Air when on the road, 3 desktop PCs at work, a number of virtual machines, and so on. I have Firefox installed on all of them. The Mozilla Weave extension allows me to sync bookmarks, passwords et al between them. Weave encrypts this data before uploading it to the server, but I do not like to rely on third-party web services for mission-critical functions (my Mozilla server was down last Monday, for instance, due to the surge of traffic from people returning to work and performing a full sync against 0.5). Through Weave 0.5, I ran my own instance of the Mozilla public Weave server version 0.3. Unfortunately, Weave 0.6 requires server version 0.5 and I had to upgrade.

The open-source Weave server is implemented in PHP. It doesn’t require Apache compiled with mod_dav as early versions did (I prefer to run nginx), but it is still a fairly gnarly piece of code that is anything but plug-and-play. Somehow I had managed to get version 0.3 running on my home server, but no amount of blundering around got me to a usable state with 0.5. I ended up deciding to implement a minimalist Weave server in Python, as it seemed less painful than continuing to struggle with the Mozilla spaghetti code, which confusingly features multiple pieces of code that appear to do exactly the same thing in three different places. Famous last words…

Three days of hacking later, I managed to get it working. 200 or so lines of Python code replaced approximately 12,000 lines of PHP. Of course, I am not trying to reproduce an entire public cloud infrastructure like Mozilla’s, just enough for my own needs, using the “simplest thing that works” principle. Interestingly, the Mozilla code includes a vestigial Python reference implementation of a Weave server for testing purposes. It does not seem to have been working for a while, though. I used it as a starting point but ended up rewriting almost everything. Here are the simplifying hypotheses:

  • My weave server is meant for a single user (my wife prefers Safari)
  • It does not implement authentication, logging or SSL encryption — it is meant to be used behind a nginx (or Apache) reverse proxy that will perform these functions.
  • It has no configuration file. There are just three variables to set at the top of the source file.
  • It does not implement the full server protocol, just the parts that are actually used by the extension today.
  • More controversially, it does not even implement persistence, keeping all data in RAM instead. Python running on Solaris is very reliable, and the expected uptime of the server is likely months on end. If the server fails, the Firefoxes will just have to perform a full sync and reconciliation. Fortunately, that has been much improved in Weave 0.6, so the cost is minimal. This could even be construed as a security feature, since there is no data on disk to be misplaced. It would take catastrophically losing all my browsers simultaneously to risk data loss. Short of California falling into the ocean, that’s not going to happen, and if it does, I probably have more pressing concerns…

The code could be extended fairly easily to lift these hypotheses, e.g. adding persistence or multiple user support using SQLite, PostgreSQL or MySQL.

Here is the server itself, weave_server.py:

#!/usr/local/bin/python
"""
  Based on tools/scripts/weave_server.py from
  http://hg.mozilla.org/labs/weave/

  do the Simplest Thing That Can Work: just enough to get by with Weave 0.6
  - SSL, authentication and loggin are done by nginx or other reverse proxy
  - no persistence, in case of process failure do a full resync
  - only one user. If you need more, create multiple instances on different
    ports and use rewrite rules to route traffic to the right one
"""

import sys, time, logging, socket, urlparse, httplib, pprint
try:
  import simplejson as json
except ImportError:
  import json
import wsgiref.simple_server

URL_BASE = 'https://your.server.name/'
#BIND_IP = ''
BIND_IP = '127.0.0.1'
DEFAULT_PORT = 8000

class HttpResponse:
  def __init__(self, code, content='', content_type='text/plain'):
    self.status = '%s %s' % (code, httplib.responses.get(code, ''))
    self.headers = [('Content-type', content_type),
                    ('X-Weave-Timestamp', str(timestamp()))]
    self.content = content or self.status

def JsonResponse(value):
  return HttpResponse(httplib.OK, value, content_type='application/json')

class HttpRequest:
  def __init__(self, environ):
    self.environ = environ
    content_length = environ.get('CONTENT_LENGTH')
    if content_length:
      stream = environ['wsgi.input']
      self.contents = stream.read(int(content_length))
    else:
      self.contents = ''

def timestamp():
  # Weave rounds to 2 digits and so must we, otherwise rounding errors will
  # influence the "newer" and "older" modifiers
  return round(time.time(), 2)

class WeaveApp():
  """WSGI app for the Weave server"""
  def __init__(self):
    self.collections = {}

  def url_base(self):
    """XXX should derive this automagically from self.request.environ"""
    return URL_BASE

  def ts_col(self, col):
    self.collections.setdefault('timestamps', {})[col] = str(timestamp())

  def parse_url(self, path):
    if not path.startswith('/0.5/') and not path.startswith('/1.0/'):
      return
    command, args = path.split('/', 4)[3:]
    return command, args

  def opts_test(self, opts):
    if 'older' in opts:
      return float(opts['older'][0]).__ge__
    elif 'newer' in opts:
      return float(opts['newer'][0]).__le__
    else:
      return lambda x: True

  # HTTP method handlers

  def _handle_PUT(self, path, environ):
    command, args = self.parse_url(path)
    col, key = args.split('/', 1)
    assert command == 'storage'
    val = self.request.contents
    if val[0] == '{':
      val = json.loads(val)
      val['modified'] = timestamp()
      val = json.dumps(val, sort_keys=True)
    self.collections.setdefault(col, {})[key] = val
    self.ts_col(col)
    return HttpResponse(httplib.OK)

  def _handle_POST(self, path, environ):
    try:
      status = httplib.NOT_FOUND
      if path.startswith('/0.5/') or path.startswith('/1.0/'):
        command, args = self.parse_url(path)
        col = args.split('/')[0]
        vals = json.loads(self.request.contents)
        for val in vals:
          val['modified'] = timestamp()
          self.collections.setdefault(col, {})[val['id']] = json.dumps(val)
        self.ts_col(col)
        status = httplib.OK
    finally:
      return HttpResponse(status)

  def _handle_DELETE(self, path, environ):
    assert path.startswith('/0.5/') or path.startswith('/1.0/')
    response = HttpResponse(httplib.OK)
    if path.endswith('/storage/0'):
      self.collections.clear()
    elif path.startswith('/0.5/') or path.startswith('/1.0/'):
      command, args = self.parse_url(path)
      col, key = args.split('/', 1)
      if not key:
        opts = urlparse.parse_qs(environ['QUERY_STRING'])
        test = self.opts_test(opts)
        col = self.collections.setdefault(col, {})
        for key in col.keys():
          if test(json.loads(col[key]).get('modified', 0)):
            logging.info('DELETE %s key %s' % (path, key))
            del col[key]
      else:
        try:
          del self.collections[col][key]
        except KeyError:
          return HttpResponse(httplib.NOT_FOUND)
    return response

  def _handle_GET(self, path, environ):
    if path.startswith('/0.5/') or path.startswith('/1.0/'):
      command, args = self.parse_url(path)
      return self.handle_storage(command, args, path, environ)
    elif path.startswith('/1/'):
      return HttpResponse(httplib.OK, self.url_base())
    elif path.startswith('/state'):
      return HttpResponse(httplib.OK, pprint.pformat(self.collections))
    else:
      return HttpResponse(httplib.NOT_FOUND)

  def handle_storage(self, command, args, path, environ):
    if command == 'info':
      if args == 'collections':
        return JsonResponse(json.dumps(self.collections.get('timestamps', {})))
    if command == 'storage':
      if '/' in args:
        col, key = args.split('/')
      else:
        col, key = args, None
      try:
        if not key: # list output requested
          opts = urlparse.parse_qs(environ['QUERY_STRING'])
          test = self.opts_test(opts)
          result = []
          for val in self.collections.setdefault(col, {}).itervalues():
            val = json.loads(val)
            if test(val.get('modified', 0)):
              result.append(val)
          result = sorted(result,
                          key=lambda val: (val.get('sortindex'),
                                           val.get('modified')),
                          reverse=True)
          if 'limit' in opts:
            result = result[:int(opts['limit'][0])]
          logging.info('result set len = %d' % len(result))
          if 'application/newlines' in environ.get('HTTP_ACCEPT', ''):
            value = '\n'.join(json.dumps(val) for val in result)
            return HttpResponse(httplib.OK, value,
                                content_type='application/text')
          else:
            return JsonResponse(json.dumps(result))
        else:
          return JsonResponse(self.collections.setdefault(col, {})[key])
      except KeyError:
        if not key: raise
        return HttpResponse(httplib.NOT_FOUND, '"record not found"',
                            content_type='application/json')

  def __process_handler(self, handler):
    path = self.request.environ['PATH_INFO']
    response = handler(path, self.request.environ)
    return response

  def __call__(self, environ, start_response):
    """Main WSGI application method"""

    self.request = HttpRequest(environ)
    method = '_handle_%s' % environ['REQUEST_METHOD']

    # See if we have a method called 'handle_METHOD', where
    # METHOD is the name of the HTTP method to call.  If we do,
    # then call it.
    if hasattr(self, method):
      handler = getattr(self, method)
      response = self.__process_handler(handler)
    else:
      response = HttpResponse(httplib.METHOD_NOT_ALLOWED,
                              'Method %s is not yet implemented.' % method)

    start_response(response.status, response.headers)
    return [response.content]

class NoLogging(wsgiref.simple_server.WSGIRequestHandler):
  def log_request(self, *args):
    pass

if __name__ == '__main__':
  socket.setdefaulttimeout(300)
  if '-v' in sys.argv:
    logging.basicConfig(level=logging.DEBUG)
    handler_class = wsgiref.simple_server.WSGIRequestHandler
  else:
    logging.basicConfig(level=logging.ERROR)
    handler_class = NoLogging
  logging.info('Serving on port %d.' % DEFAULT_PORT)
  app = WeaveApp()
  httpd = wsgiref.simple_server.make_server(BIND_IP, DEFAULT_PORT, app,
                                            handler_class=handler_class)
  httpd.serve_forever()

Here is the relevant fragment from my nginx configuration file:

# Mozilla Weave
location /0.5 {
  auth_basic            "Weave";
  auth_basic_user_file  /home/majid/web/conf/htpasswd.weave;
  proxy_pass            http://localhost:8000;
  proxy_set_header      Host $http_host;
}
location /1.0 {
  auth_basic            "Weave";
  auth_basic_user_file  /home/majid/web/conf/htpasswd.weave;
  proxy_pass            http://localhost:8000;
  proxy_set_header      Host $http_host;
}
location /1/ {
  auth_basic            "Weave";
  auth_basic_user_file  /home/majid/web/conf/htpasswd.weave;
  proxy_pass            http://localhost:8000;
  proxy_set_header      Host $http_host;
}

This code is hereby released into the public domain. You are welcome to use it as you wish. Just keep in mind that since it is reverse-engineered, it may well break with future releases of the Weave extension, or if Mozilla changes the server protocol.

Update (2009-10-03):

I implemented some minor changes for compatibility with Weave 0.7. The diff with the previous version is as follows:

--- weave_server.py~	Thu Sep  3 17:46:44 2009
+++ weave_server.py	Sat Oct  3 02:59:19 2009
@@ -65,8 +65,7 @@
     command, args = path.split('/', 4)[3:]
     return command, args

-  def opts_test(self, environ):
-    opts = urlparse.parse_qs(environ['QUERY_STRING'])
+  def opts_test(self, opts):
     if 'older' in opts:
       return float(opts['older'][0]).__ge__
     elif 'newer' in opts:
@@ -92,7 +91,7 @@
   def _handle_POST(self, path, environ):
     try:
       status = httplib.NOT_FOUND
-      if path.startswith('/0.5/') and path.endswith('/'):
+      if path.startswith('/0.5/'):
         command, args = self.parse_url(path)
         col = args.split('/')[0]
         vals = json.loads(self.request.contents)
@@ -113,7 +112,8 @@
       command, args = self.parse_url(path)
       col, key = args.split('/', 1)
       if not key:
-        test = self.opts_test(environ)
+        opts = urlparse.parse_qs(environ['QUERY_STRING'])
+        test = self.opts_test(opts)
         col = self.collections.setdefault(col, {})
         for key in col.keys():
           if test(json.loads(col[key]).get('modified', 0)):
@@ -142,10 +142,14 @@
       if args == 'collections':
         return JsonResponse(json.dumps(self.collections.get('timestamps', {})))
     if command == 'storage':
-      col, key = args.split('/')
+      if '/' in args:
+        col, key = args.split('/')
+      else:
+        col, key = args, None
       try:
         if not key: # list output requested
-          test = self.opts_test(environ)
+          opts = urlparse.parse_qs(environ['QUERY_STRING'])
+          test = self.opts_test(opts)
           result = []
           for val in self.collections.setdefault(col, {}).itervalues():
             val = json.loads(val)
@@ -155,6 +159,8 @@
                           key=lambda val: (val.get('sortindex'),
                                            val.get('modified')),
                           reverse=True)
+          if 'limit' in opts:
+            result = result[:int(opts['limit'][0])]
           logging.info('result set len = %d' % len(result))
           if 'application/newlines' in environ.get('HTTP_ACCEPT', ''):
             value = '\n'.join(json.dumps(val) for val in result)

Update (2009-11-17):

Weave 1.0b1 uses 1.0 as the protocol version string instead of 0.5 but is otherwise unchanged. I updated the script and nginx configuration accordingly.

Snow Leopard enhancement: Image Capture and scanners

The Image Capture utility in Snow Leopard works with my Epson 3170 flatbed scanner, and is far superior to the clunky and bloated Epson Scan software. What’s more, it can automatically deskew documents that are not perfectly level. Just a small touch, but a nice one. Thanks, Apple!

The new Preview PDF editing behavior is a major step back in usability, however.

Image Capture

Securing WordPress

WordPress has been getting a lot of bad press the last few days, as a worm is out in the wild exploiting a security vulnerability. This is leading to somewhat unfair comparisons with Windows, and thoughtful articles from John Gruber and Maciej Ceglowski.

To be sure, the ease of programming in PHP leads a great many people to contribute to projects, who may not have the experience or security awareness they should. This is not helped by poorly designed features in PHP that were enabled by default in previous versions, and cannot always be disabled outright due to legacy compatibility concerns, reminiscent of the persistent security woes due to the C standard library’s insecure old string processing facilities.

For many users, migrating away from WordPress may not be a practical option. My recommendations would be:

  • Reduce your exposure by exporting a static HTML version of your site, as suggested by Maciej. This is really only simple if you use a non-default permalink structure that does not use question mark characters in URLs, like that used by the SEO plugins. Otherwise you would need quite a bit of mod_rewrite jiggery-pokery to get it to work. In any case, this will also disable quite a bit of functionality on your site, such as comments.
  • If you are an Apache user, install modsecurity, a truly outstanding Apache module that acts as a firewall of sorts and will inspect requests for suspicious behavior like SQL injection attempts and malformed requests. Configuring modsecurity is not for the faint of heart, but there are some papers online like this one by Daniel Cuthbert (PDF) that walk you through this. This is probably the single most significant thing you can do to make your WordPress blog safer.
  • Practice security in depth — keep regular backups of both your wordpress directory and database, so you can recover in case of attack, and if possible run WordPress in an isolated account with minimal privileges.

At long last real broadband in San Francisco

I upgraded my broadband connection yesterday from a puny 3-6Mbps/384-768K DSL connection to 20Mbps symmetrical Metro Ethernet service from an outfit called WebPass. My current ISP, Raw Bandwidth, has excellent service with no restrictions on hosting servers or traffic shaping shenanigans unlike the likes of Comcast, but they are still hobbled by the AT&T last-mile connection.

WebPass finesses around the incumbent monopoly by using newer buildings’ data-grade wiring plant to bring 100MBps Ethernet connections right into your home (all they had to do was change a wall plate and patch some cables in the closet) and use microwave links to backhaul traffic to their data center. They claim to use a mesh network for backhaul, but I think this just means a standard network of microwave links where some sites have to hop multiple microwave links to get to the transit connection, rather than a purely centralized hub and spoke model. In my case their offices are a mere two blocks away. This would allow me the pleasure of ditching the scumbags at AT&T altogether (were it not for the fact my building requires an entirely unnecessary landline for its security system).

AT&T is probably the worst telco in the US now, and is notorious for starving its infrastructure of investment to maximize short-term profits, unlike Verizon, who is investing heavily in its FiOS optical network. Unfortunately San Francisco is in AT&T territory and will not get true optical networks anytime soon. Municipalities can usually reassign the cable franchise every so many years, but there is no such provision for involuntary transfer of telcos that I know of.

The new service is $45 a month with no installation fee, vs. $70 a month for Raw Bandwidth, but it does not include a static IP address (they do offer it as part of their prohibitively expensive metered business service). Configuring my home router (a Cisco 877) to use both connections was incredibly painful, but I will run the two ISPs side by side for the next few months. If WebPass proves as reliable as Raw bandiwdth, I may just have to find a work-around for the static IP issue, or just rely on DHCP lease pinning.

If you live in San Francisco, or are moving there, definitely have a look at the buildings they have covered. The service is a glimpse of what people not in broadband backwater USA get.

Diminishing returns

I have an eight-core Nehalem Mac Pro. Most of these cores sit idle most of the time due to poorly written software that is not optimized for the post-Moore multicore world.

I am beginning to wonder if Intel’s transistor budget wouldn’t be better allocated to more SRAM cache instead of more cores. One SRAM bit uses up 4 transistors, the Xeon 5500 have 751 million transistors, of which 8Mx8x4 or 256 million are for the 8MB L3 cache. If the chip were brought down from quad-core to dual-core, that would allow doubling the cache. Many programs could run entirely from cache, including interpreters.