Collection Construction Methodologies for Learning-to-Rank

Award:   NSF IIS-1017903
PI:   Javed A. Aslam
Institution:   Northeastern University


Modern search engines, especially those designed for the World Wide Web, commonly analyze and combine hundreds of features extracted from the submitted query and underlying documents (e.g., web pages) in order to assess the relative relevance of a document to a given query and thus rank the underlying collection. The sheer size of this problem has led to the development of learning-to-rank algorithms that can automate the construction of such ranking functions: Given a training set of (feature vector, relevance) pairs, a machine learning procedure learns how to combine the query and document features in such a way so as to effectively assess the relevance of any document to any query and thus rank a collection in response to a user input. Much thought and research has been placed on feature extraction and the development of sophisticated learning-to-rank algorithms. However, relatively little research has been conducted on the choice of documents and queries for learning-to-rank data sets nor on the effect of these choices on the ability of a learning-to-rank algorithm to "learn", effectively and efficiently.

The proposed work investigates the effect of query, document, and feature selection on the ability of learning-to-rank algorithms to efficiently and effectively learn ranking functions. In preliminary results on document selection, a pilot study has already determined that training sets whose sizes are as small as 2 to 5% of those typically used are just as effective for learning-to-rank purposes. Thus, one can train more efficiently over a much smaller (though effectively equivalent) data set, or, at an equal cost, one can train over a far "larger" and more representative data set. In addition to formally characterizing this phenomenon for document selection, the proposed work investigates this phenomenon for query and feature selection as well, with the end goals of (1) understanding the effect of document, query, and feature selection on learning-to-rank algorithms and (2) developing collection construction methodologies that are efficient and effective for learning-to-rank purposes.


Former Personnel


Acknowledgment and Disclaimer

This material is based upon work supported by the National Science Foundation under Grant No. IIS-1017903. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).