Friday, January 21, 2011

Relevancy Mechanics

If you're a software developer and you're interested in developing search applications, there are tons of resources on the web and in open source.  Lucene/Solr is a great example of industrial strength software with a vibrant community of developers and legions of implementers building quality applications on top of it.

I love Lucene because it pretty much solves the deep plumbing of search - creating dictionaries and postings list and providing support for accessing those data structures at search time.  And it does it really fast.  (A great general resource on the low-level plumbing of search is: Managing Gigabytes, though the title now sounds a little quaint.)  Using software like Lucene allows you to leave the critical bit-shifting, skip-listing, byte-squishing inner loops to experts like Michael McCandless who work on Lucene full time - and you get build cool stuff on top of that.

As vibrant and prolific as the Lucene and Solr community is, there is surprisingly little out there on the nuts and bolts of relevancy in a commercial setting.  There is lots on the plumbing - piping data into the appropriate place, scaling, replication, performance issues - but much less on field weighting, global IDF, document length normalization, intelligent stop words, objective measures of relevancy, static rank, putting the Porter Stemmer out of our misery!  These are the things I plan to write about and hopefully discuss with others about here.