Personalizing Search

Although search engine algorithms are proprietary, it's reasonable to guess that they are probably doing a keyword analysis of all pages, perhaps using a thesaurus to find similar words, then linking one page to another based on keyword confluence and (presumably) the presence of direct linkage. Were they not doing something like that they couldn't provide a "find similar pages" button. This is a step in the right direction but there is more that could be done to improve searches.

Search engines do not exploit a lot of information about the user---namely, the knowledge of what pages the user has already found interesting enough to bookmark or save. We can try to exploit such contextual information to find more pages the user is likely to find interesting.

Currently, pages have high profile only if they are very popular across a broad spectrum of the web's population. That's fine for the lowest common denominator, but it is hardly suited to varied tastes. There's no reason pages couldn't be ranked in popularity within small groups of users (say, all computer science professors, all race car drivers, or all manga otaku) instead of the single domain of all users.

Here are some first steps toward personalizing searching and mapping by estimating a particular user's interest in various pages. These stages are in order of increasing computational complexity, and should be done in that order. If the system has little time to make an estimate of which pages to look at next, Stage I techniques only should be used, if it has a little more time then it should use Stage I and Stage II, and so on.

In what follows, call the non-bookmarked pages that any bookmarked page points to, or is pointed to by, the nearest neighbors. Call the pages one link away from any page that page's neighbors. The object is to find and evaluate promising non-bookmarked pages. A page with a high score under the following measures is assumed to be interesting to the user since it is related in some way to some of the pages the user has already valued enough to bookmark.

Stage I

Every nearest neighbor gets ten points for each link it contains to a bookmarked page. Repeated links to the same bookmarked page from the same nearest neighbor count as one hit, but repeated hits to different bookmarked pages from the same nearest neighbor count as multiple hits. Any page that neighbors one of those nearest neighbors gets one point for each link to a distinct nearest neighbor. Any neighbors of the neighbors get a tenth of a point, and so on in decreasing factors of ten.

More generally, the user could be polled by the system to estimate levels of interest within the bookmarked pages. Some pages, or page clusters, may be much more important to the user than others. Further, that evaluation should also be biased dynamically depending on where (that is, in which page clusters) the user spends the most time. This evaluation should then bias the attachment of weight to non-bookmarked pages.

This neighbor measure should split into a tuple who's second half is a measure of blacklisted pages as signs of negative interest. Neighbors of blacklisted pages would go down in value just as neighbors of bookmarked pages go up in value.

Should the neighbor measure distinguish between incoming and outgoing links? That is, should a non-bookmarked page be evaluated higher if it pointed to a bookmarked page rather than the other way around?
If two pages point to each other, should they get more points than otherwise?

Stage II

Estimates of semantic relatedness based solely on direct linkage can't be very accurate. To improve accuracy the system must pay attention to the structure of the pages themselves. For example, the system should collect all names referenced in the bookmarked pages (Microsoft, Sun Microsystems, Harry Turtledove, or whatever) and save those names (appropriately weighting them based on the present weighting of the pages containing them plus their frequency of occurrence). Then it can search through the non-bookmarked neighbors of the bookmarked pages for mentions of those names. As before, pages that mention those names get ten points, and pages that mention pages with those names get one point, and so on.

Stage III

The system could also analyze the bookmarked pages in more detail, creating histograms of word frequencies, paying particular attention to uncommon words (it uses a dictionary coupled with a thesaurus to determine word groupings). After tossing out extremely common words (and, but, so, you, and so on) it boils down those histogram into general classes of pages (again weighting those classes based on the average weight of the corresponding pages). It simplifies these classes further (thus discarding even more information) until it has a fairly small fingerprint function (experiment will have to determine how small this fingerprint needs to be for fast testing and reasonable discrimination). Then it searches the neighbors of the bookmarked pages for pages that fit the fingerprints of the most interesting (most highly weighted so far) bookmarked pages.

The sophistication of the fingerprinting can be arbitrarily extended depending on how much processing power the system has at its disposal. For example, many pages are mainly lists of links, such pages should show up in a sophisticated fingerprinter as being link-heavy, while others might be image-heavy, prose-heavy, and so on. Pages can also be more or less current---currency is easy to determine from the page's date-last-modified field.

Stage IV

The system should also allow query-by-example, where the user specifies some pages and anti-pages as a way of implicitly defining the current interest. Further, if the user chooses to divulge the information, the system could also know something about the user's predilections: society memberships, hobbies, research interests, and so on. Finally, the system should use its previously created fingerprints of bookmarked pages to derive information about the user's interests.

Whenever the user stops surfing and frees up the computer and the communications line, the system could autonomously initiate searches at one of the commercial search engines with any and all of the above derived information: sublists of the user's declared list of keywords and interests, pages similar to popular bookmarked pages, pages containing names popular with the bookmarked pages, pages with similar word-histograms as popular bookmarked pages, neighbors of any of those pages, and so on, in increasing order of detail. Once these pages arrive, the system would process them in the usual way as sketched above.

This autonomous behavior would have to be fairly moderate otherwise search engine companies may disallow future searches from that user (the technology to identify users may soon be in use on most browsers). The system could also create webworms to do searches if denied access to commercial search engines.

Now web searches can return pages ordered by their likely relevance to the user's bookmarked pages. Also, two kinds of web searches are now possible: searches within the home and searches within the web. Automating the webmapping process would make it possible for your computer to tell you exactly what you're interested in on an ongoing basis. Not only is that useful information for your computer to have to better serve you, but it's useful information for you to have too.