To evaluate the performance of WordSieve, we performed an experiment to test its ability to match a document to a hand-coded vector representation of the web search task during which it was consulted. In particular, we wanted to see how closely WordSieve could correlate a document to the original search task given to the user. To test this, seven users (computer science graduate students) were asked to browse the Internet for twenty minutes each while being monitored by Calvin. For the first ten minutes, they were asked to find documents on the WWW that were about ``The use of genetic algorithms in artificial life.'' For the second ten minutes, they were asked to search for information about ``Tropical butterflies in Southeast Asia.'' The users were given no restrictions on how to find the pages. They were only instructed that the documents must be loaded into the web browser provided to them.
User access profiles were developed from this data by passing each set of data through WordSieve three times (in the order originally browsed by the user) to simulate one hour of browsing.
Two term vectors were generated for each document, one using WordSieve, and one using TFIDF. To provide a search task characterization, vectors were created to represent the task description given to the users. The WordSieve and TFIDF vectors from each document were compared to their associated task description.
In our experiments, WordSieve generated indices (i.e. term vectors) for documents which were reliably more strongly correlated to the original task description than those produced by TFIDF (F(1,82)=91.1, repeated-measures ANOVA). This generally held across various subsets of the data. The average TFIDF similarity was 0.145 and the average WordSieve similarity was 0.224. This suggests that WordSieve is performing better at generating profiles that reflect a user's task.
In the experiments presented, the tasks were quite distinct. We have not yet conducted experiments on browsing where the tasks overlap. However, if the tasks overlapped, the keywords which overlapped would be treated both by TFIDF and WordSieve as non-discriminators and would have relatively low values in the term vectors generated, and the non-overlapping keywords would have relatively high values. Because these terms would have similar effects on both algorithms, we expect that performance would be equally affected in both algorithms and that the conclusions of the comparison would be the similar.