Relevancy Mechanics
Software techniques and musings on how to achieve the most relevant search results.
Wednesday, February 15, 2012
Static Rank Framework for Solr / Lucene
I gave a talk at code4lib here in Seattle last week about static rank. You can listen to it here: code4lib 2012 video. My talk is at about 29:30. Despite my um-fest I think the basic idea comes across. The slides themselves are here: code4lib 2012 ppt. Comments welcome.
Friday, January 21, 2011
Relevancy Mechanics
If you're a software developer and you're interested in developing search applications, there are tons of resources on the web and in open source. Lucene/Solr is a great example of industrial strength software with a vibrant community of developers and legions of implementers building quality applications on top of it.
I love Lucene because it pretty much solves the deep plumbing of search - creating dictionaries and postings list and providing support for accessing those data structures at search time. And it does it really fast. (A great general resource on the low-level plumbing of search is: Managing Gigabytes, though the title now sounds a little quaint.) Using software like Lucene allows you to leave the critical bit-shifting, skip-listing, byte-squishing inner loops to experts like Michael McCandless who work on Lucene full time - and you get build cool stuff on top of that.
As vibrant and prolific as the Lucene and Solr community is, there is surprisingly little out there on the nuts and bolts of relevancy in a commercial setting. There is lots on the plumbing - piping data into the appropriate place, scaling, replication, performance issues - but much less on field weighting, global IDF, document length normalization, intelligent stop words, objective measures of relevancy, static rank, putting the Porter Stemmer out of our misery! These are the things I plan to write about and hopefully discuss with others about here.
I love Lucene because it pretty much solves the deep plumbing of search - creating dictionaries and postings list and providing support for accessing those data structures at search time. And it does it really fast. (A great general resource on the low-level plumbing of search is: Managing Gigabytes, though the title now sounds a little quaint.) Using software like Lucene allows you to leave the critical bit-shifting, skip-listing, byte-squishing inner loops to experts like Michael McCandless who work on Lucene full time - and you get build cool stuff on top of that.
As vibrant and prolific as the Lucene and Solr community is, there is surprisingly little out there on the nuts and bolts of relevancy in a commercial setting. There is lots on the plumbing - piping data into the appropriate place, scaling, replication, performance issues - but much less on field weighting, global IDF, document length normalization, intelligent stop words, objective measures of relevancy, static rank, putting the Porter Stemmer out of our misery! These are the things I plan to write about and hopefully discuss with others about here.
Subscribe to:
Posts (Atom)