To provide real-time context-based retrieval, we have developed a new algorithm, WordSieve, which generates vector representations of documents as the user accesses them without requiring comprehensive statistics about word distributions in the documents accessed. Instead, working with a relatively small memory (in current tests, at most 650 unique words), it identifies task-specific keywords from document access sequences. WordSieve exploits information the user provides implicitly by virtue of accessing similar documents together. It does this by building access profiles which identify terms occurring frequently in sequences of document accesses and which are expected to be useful for distinguishing sets of documents related to the same task. In this way, WordSieve exploits extra knowledge about the document access context to generate indices that reflect the task context.
In our experiments, WordSieve outperformed term frequency/inverse document frequency (TFIDF) [9], a popular indexing algorithm, at matching documents to hand-coded vector representations of the task contexts in which they were originally consulted, where the task context representations are term vectors representing a specific search task given to the user. This paper presents WordSieve's architecture and performance.