 |
| Author: Anand Kishore
(anand@semanticvoid.com) | Back to my blog |
logging my life |
Inspired by MyLifeBits |
|
|
|
Screenshots (click to enlarge)
|

Main
interface of the LifeLogger UI.
|

Search
results from users data.
The results are depicted in different
colors with each colorrepresenting one
of the disparate data sources.
|

Various
graphs help the user visualize his
browsing/reading/seach patterns.
|

Browse
results for a given date range.
|

Interesting
picks of the day. |
|
|
Gordon Bell has been
recording every bit of his life for the past seven years. His
custom-designed software, "MyLifeBits"
saves everything it can, from
every email he sends and receives, every document he types, every chat session he
engages in, every Web page he
surfs. The advantages of such a software are obvious: total recall. It gives one the
ability to search ones life for any reference of a person/thing.
Inspired by it I have decided to start logging my life as well. As of
now its restricted to only my online life as I do not have resources
like the SenseCam.
The
data collected in this process could be used in numerous ways: total
recall, recommendations, predictions, and so on. As Peter Norvig says,
"Its about the data and not the algorithm".
Infact I have been doing this (I didn't realize its advantages back
then) for the past two years or so (all thanks to Google's copious
amounts of storage). Following are the different aspects of my online
life that I have been logging/stashing away:
- Email: I primarily use Gmail for all
my email correspondence. Infact, I have setup filters to forward emails
from all my other mail boxes to the primary Gmail account where they
are archived.
- Chat Sessions (IM): Here again I
rely on another Google service, GTalk. Most of my IM conversations are
on GTalk with record feature turned on. This archives all my chat
sessions in my Gmail Chats folder. Meebo also allows one to save their
chat sessions. But they do not provide any api to retrieve it back.
- Search: I had opted for the Google
Search History two years back. My search history has about 4500+
searches logged as of today. This forms my database of intentions.
- Browsing History: I have been
keeping a track of all the sites I visit. There are two ways to log
such data:
- Slogger:
This firefox extension saves every page you visit into a designated
folder, ordered day wise and is highly configurable. It supports
various log formats from xml, text to html. Its ability to save text as
well as html versions of the page locally facilitate a very fast
recall. The only problem one has to deal with here is about storage
(that is when you get to the point where you have terabytes of browsing
history).
- Google Web History: This
feature launched just a few days back. Coupled with the Google Toolbar,
Google logs every page you visit (if you have the PageRank feature
turned on in the toolbar). I have just begun exploring this feature,
but it certainly relieves me of the burden of storage. (Note: Don't try this if you are paranoid
about your privacy)
- Online Reading: Among all the feed
readers, Google Reader provides an interesting view of the trends in
your reading history. This data recorded by the reader is not trapped
inside Google but is very much accessible. Thus all my feed reading
history and patterns are logged without much hassle.
- Bookmarks (things I find interesting):
I usually bookmark pages which I find interesting or which I think I
would refer to in the future. I use del.icio.us
and Simpy for the same.
Both these services provide easy api's and feeds for retrieving the
data from their servers. This forms my database of interest.
|
# How do I aggregate all this
data?
|
Although most of the data
logged resides on remote servers (with privacy not being an issue for
me), it can all be aggregated into a unified database. The different
tools/ways the data can be retrieved is as follows:
- Email: To retrieve all the emails
from Gmail one can use the g4j
(a Java library) or Gmail-APIc
(.NET). There are other apis available for other languages as well.
- Chat Sessions (IM): Chat sessions
can be retrieved from the Chats folder in Gmail. This can be done using
the apis mentioned above.
- Search: One can get hold of all
their logged searches by the following URL (items in the resulting feed
can be identified by the category type as 'web result' for
clickthroughs and 'web query' for saerch queries):
- RSS:
https://www.google.com/searchhistory/?output=rss&num=some large number
- Browsing History: If you use Slogger
for logging your browsing history, you can access all your logs in the
configureed folder. But if you have dared to let Google save your
browsing history, it can be accessed using the URL (items in the resulting feed can
be identified by the category type as 'browser result'):
- RSS:
https://www.google.com/searchhistory/?output=rss&num=some large number
- Online Reading: Every user on Google
Reader has a unique id, which is visible in every Google Reader URL
e.g. http://www.google.com/reader/view/user/unique id/state/com.google/reading-list.
Reading history can be retrieved from Google Reader by using the
following URL:
- Atom: http://www.google.com/reader/atom/user/unique id/state/com.google/read?n=some large number
- Bookmarks (things I find interesting):
Both del.icio.us and Simpy provide numerous ways to access the data off
their servers.
|
#
Code for aggregating the data
& the unfied database schema
|
The code for importing
such data can be found at the code repository. The
code is GPL
lisenced, hence feel free to modify and redistribute. Alternatively,
you can browse the source code here.
Currently, the code supports importing data from Google Web History,
Google Reader and del.icio.us xmls. Support for Gmail and Gtalk is in
progress.
|
#
Algorithms
for analyzing the
data
|
Time Damping Of Textual Relevance [PDF]
|
Abstract:
When a user
performs a search in the Life Logger application, he is interested in
results which are most relevant but also recent. If the results are
ordered by just the textual relevance it may have relevant but older
records in the top results. Whereas if they are ordered by time then it
may have recent but less relevant records in the top results. This
algorithm provides an interesting way to rank results by both textual
relevance and time. Click
here to read more...
|
|
|