Both the end user GUI and the Task Generator can automatically collect data streams, which in the context of information retrieval are sequences of document accesses. The system stores the time stamp, and full document text, providing the information needed for programs to ``replay'' the web search to the data analysis component. In Java, this replay is implemented as an Iterator. After specifying the path the user wants to replay, going through a past instance of user behavior is as simple as creating a for loop.
The ability to replay user behavior provides an important advantage for research in personal information agents: it allows the creation of a test bed of data accumulated from user interactions with the system, to use as a standard for comparing different kinds of information retrieval. Information retrieval agents are sometimes tested on manually-collected test sets of documents, which are used to simulate document access. The Reuters collection, a set of pre-categorized new articles, is an example of such a test set. One problem with such test sets is that the documents are already sifted and organized and thus do not represent the same kind of information stream that occurs in real user behavior. Passing a set of documents from the Reuters collection through an agent is much different from providing the actual text that gets passed to a web browser during information search. Consequently, results from testing an agent against the Reuters collection may not be indicative of expected performance during actual use. However, always relying on test with real users is problematic as well: it is unrealistic to test real users on hours of web searching to test small improvements to the algorithm. Even if subjects were available, variation in the documents accessed would reduce certainty that any change in performance was due to changes in the algorithm rather than variations is user behavior.
The ability to capture test beds of past document access solves this problem. The stored documents are what users really accessed (even the advertisements can be stored), and the same test bed can be sent to multiple versions of IR algorithms. This assures that when new versions of the algorithm are tested, any changes in the performance of the algorithm are due to the algorithm, and not just variation in user behavior.
For example, we are currently developing textual analysis techniques to automatically identify keywords which serve as topic identifiers for specific user interests. The textual analysis code has undergone multiple revisions as we explore various approaches. By wrapping our analysis code into a class descended from Calvin data analysis abstract class, we do not have to change Calvin itself to explore different analysis methodologies. By replaying past data streams (in this case, sessions of web browsing), we can see how the analysis techniques are improving against a standard set of data. Because our current analysis techniques are stochastic, we also run the algorithm against the same data set multiple times to measure the variance in performance.