Intelligent information retrieval agents are an emerging class of software designed to aid computer users in various aspects of their work. By observing the user's interaction with the computer to determine contextual cues, these agents can suggest information that is likely to be useful in the current context. Much research has studied the formalization of context (e.g., [2,5,4])), and rich context representations have been proposed (e.g., [18,19]). However, the circumstances in which information retrieval agents function strongly constrain the context extraction methods and representations that are practical for them to use. In order to provide robust support, information retrieval agents must automatically generate context descriptions in a wide, unpredictable range of subject areas; in order to provide information when it is most useful, they must make context-related decisions in real time, as rapidly as possible.
Because of their need to handle unpredictable subject areas, intelligent information retrieval agents generally forgo carefully-crafted, pre-specified context representation schemes in favor of representations gathered implicitly by observing the user working in some task context. In systems that seek to aid a user by retrieving documents relevant to the user's current context, a common approach is to analyze the content of the current document being accessed and retrieve similar documents, under the assumption that documents with similar content will be useful in the current context. These systems often index documents based on Term Frequency/Inverse Document Frequency [17], a term-based approach which focuses on the relationship of the document to the corpus. In particular, TFIDF does not take into account another contextual cue, the order in which the documents were accessed. Our goal is to develop a practical method for using information about document access patterns to help improve on standard techniques for selecting context-relevant documents.
We are developing a context extraction algorithm, called WordSieve, to take advantage of implicit contextual information provided by considering the user's document access patterns over time. By considering information developed from the order in which the documents are accessed, the algorithm is able to make suggestions that reflect not only the content of documents, but also when documents with that content tend to be consulted. WordSieve has been implemented in an intelligent information retrieval agent, CALVIN, which observes users as they browse the world wide web, indexes the documents they access in a given context, and suggests those documents in similar contexts in the future [13]. The success of this approach can be measured by whether it selects documents that are appropriate to the user and task context, even when no explicit description of the task is available to the system. In initial experiments, with CALVIN monitoring users who were given two browsing tasks, WordSieve outperformed TFIDF at matching documents to the tasks in which they were accessed.
This paper is organized as follows. We first discuss issues involved in selecting a context model for intelligent agents. Next we describe the WordSieve algorithm and its performance in our initial experiments. Finally, we discuss some issues raised by this work and its relationship to previous research.