Data mining Outlook for fun and profit
For a few years now, I have owned the domain name majid.fm. Dot-fm stands for the Federated States of Micronesia, a micro-state in the Pacific Ocean, and they market their domain names to FM radio stations. Those are also my initials. Unfortunately, the registration fees are quite expensive ($200 every two years), and the domain is redundant now that I have acquired majid.info and majid.org (majid.com is reserved by a Malaysian cybersquatter who is demanding a couple thousand dollars for it – I may be vain, but not that vain). I have decided to let the domain lapse when it expires on April 1st.
I used the majid-dot-FM domain for my emails, and set it up so emails sent to anything @majid.fm would be sent to my primary mailbox fazal@majid.fm. For instance, if I registered with Dell, I would give them the email address dell@majid.fm. This was helpful in tracing where I got my email from, and blacklisting companies that started spamming me (they shall remain nameless to protect the guilty yet litigious).
Unfortunately, spammers and some worms attempt dictionary attacks by trying all possible combinations like jim@majid.fm, smith@majid.fm, and so on. My spam filter would catch some, but not all of them, and it would be a terrible hassle. I do not want to have an auto-responder send emails back to people who email me at the old address, as this would at best flood innocent people whose addresses spammers are impersonating, and at worst actually give my new address to the spammers.
My solution to this dilemma is to produce a Python script that scans through all the emails in my Outlook personal folder (PST) files of archived emails, flag all those who sent me an email, and them manually send them a change of address notification (or in the case of websites and online stores, update my contact info online).
Simply using Outlook’s advanced search function will not work, as in many cases the To: header is set to something other than the address the email is delivered to, such as undisclosed-recipients, or the sender’s address when they send the email to multiple Bcc: recipients (the proper way to proceed when you want to send an email to multiple recipients without giving everyone in the list the email addresses of the other recipients). I actually have to sift through the raw message headers to see the envelope destination address.
Here is a simplified version of olmine.py, the script I used. It requires Python 2.x with the win32all extensions, and Outlook 2000 with the Collaboration Data Objects (CDO) option installed (this is not the default). CDO is required to access the full headers. Of course, this script can be useful for all sorts of social network analysis fun on your own Outlook files, or more prosaically to generate a whitelist of email addresses for your spam filter.
import re, win32com.client
srcs = {}
dsts = {}
pairs = {}
# regular expression that scans for valid email addresses in the headers
m_re = re.compile(r'[-A-Za-z0-9.,_]*@majid\.fm')
# regular expression that strips out headers that can cause false positives
strip_re = re.compile(r'(Message-Id:.*$|In-Reply-To:.*$|References:.*$)',
re.IGNORECASE | re.MULTILINE)
def dump_folder(folder):
"""Iterate recursively over the given folder and its subfolders"""
print '-' * 72
print folder.Name
print '-' * 72
for i in range(1, folder.Messages.Count + 1):
try:
# PR_SENDER_EMAIL_ADDRESS
_from = folder.Messages[i].Fields[0x0C1F001F].Value
# PR_TRANSPORT_MESSAGE_HEADERS
headers = folder.Messages[i].Fields[0x7d001e].Value
except:
# ignore non-email objects like contacts or calendar entries
continue
stripped_headers = strip_re.sub('', headers)
for _to in m_re.findall(stripped_headers):
srcs[_from] = srcs.get(_from, 0) + 1
dsts[_to] = dsts.get(_to, 0) + 1
if (_from, _to) not in pairs:
print _from, '->', _to
pairs[_from, _to] = pairs.get((_from, _to), 0) + 1
# recurse
for i in range(1, folder.Folders.Count + 1):
dump_folder(folder.Folders[i])
# connect to Outlook via CDO
cdo = win32com.client.Dispatch('MAPI.Session')
cdo.Logon()
# iterate over all the open PST files
for i in range(1, cdo.InfoStores.Count + 1):
store = cdo.InfoStores[i]
root = store.RootFolder
m = root.Messages
store.ID
print '#' * 72
print store.Name
print '#' * 72
dump_folder(root)
cdo.Logoff()