ferret package

Package Contract

This package includes the Ferret, FerretArray and the FerretFingerprintPool class.

A Ferret fetches a page from the web given its URL, tests the page, and, if the page is approved, passes it on the filters.

The FerretArray creates 10 new Ferrets to start up the process. Then on each Ferret, before it dies, creates a new Ferret and the process goes on.

The FerretFingerprintPool is a store for the FerretFingerprintFunctions.

The FerretFingerprintFunction contains information related to the page to be fetched and also gives and idea about the kind of tests that must be applied to this page in order to pass it on to the next stage if approved. A seedsite is the term given to the basic representative page in the space of pages chosen to be picked up by one of the Ferrets. It is of type PageID and acts as a reference to a DocumentNode.

The DocumentNode has a list of PageAttributes for each page in the space of pages like URL of the page, number of links on the page etc. which are used by the Ferret.

The Cache contains all pages fetched from the web by the Ferrets. It is used as a lookup for future ferrets. Ferrets first examine this to check if the page need to be fetched from the web or it has already been fetched by a previous Ferret.

The FerretApprovedPagesCache contains all the pages approved by the Ferrets. This is looked up by the Filter, the next link in the chain of page analyzers.

This package allows the FerretAdvisor to insert FerretFingerprintFunctions into the FerretFingerprintPool. The Ferret then randomly picks up a FerretFingerprintFunction, gets hold of the seedsite contained in it (which is of type PageID). Using this PageID, it then requests a DocumentNode. The DocumentNode provides a list of PageAttributes (which include the links on the page). The ferret checks the Cache to see if all these URLs (links on the seedsite page) have already been fetched by previous ferrets. If not, then it fetches the pages and modifies them for base tags. (Some websites use relative addressing for images and links. It is therefore essential for the Ferret to insert a base tag which indicates the Home URL, that is, the base URL of the current page). The Ferret then dumps these pages into the Cache. The Ferret applies the corresponding FerretFingerprintFunction on each page to find out whether or not each page fetched should be accepted. If yes it stores the page in the ApprovedPagesCache. Else it rejects the page.

Package-Level CRCs


fingerprint package: FerretFingerprintFunction
advisor package;: FerretAdvisor
cache package: Cache, FerretApprovedPagesCache
webpageDB package: DocumentServerImpl, DocumentNode, PageID
parser package: PageAttribute, AddPageContent, AddPageURL, AddSeedSite


Class-Level CRCs

The ferret package contains the following classes: Ferret, FerretArray and the FerretFingerprintPool

* Ferret
* FerretArray
* FerretFingerprintPool

Class Ferret

* Responsibilities:
To fetch web pages that are above a certain threshold for a given fingerprint function.
What the Ferrets do?
- grab a FerretFingerprintFunction
- access the PageID of the seedsite stored within the FerretFingerprintFunction
- uses the PageID to obtain a DocumentNode
- uses the DocumentNode to obtain the PageAttributes like the URL of the seedsite and the number of links on the seedsite
- fetch the pages corresponding to links in the seedsite and modify them for Base tags.
- Store the fetched pages in the Cache for lookup by future ferrets
- invoke the method of the appropriate fingerprint function for approval.
- if approved, store the page details in the ApprovedPagesCache for the Filter
- It also provides a method which takes in a PageID and returns the text of the corresponding page in a string format.
* Collaborators:
package ferrets: FerretArray, FerretFingerprintPool
package fingerprint: FerretFingerprintFunction
package cache: Cache, FerretApprovedPagesCache
package filter: Filter
package webpageDB: DocumentServerImpl, DocumentNode, PageID
package parser: PageAttribute, AddPageContent, AddPageURL, AddSeedSite
* Variables and Methods:
public Ferret(FerretFingerprintPool ferretPool, Filter filter)
- Called by some Driver program or FerretAdvisor
- Accepts an instance of a FerretFingerprintPool and a Filter
- This method is invoked as the first step. It initializes the FerretFingerprintPool, the Filter, the Cache, FerretApprovedPagesCache and the DocumentServerImpl which are used later on by the ferrets.
public Ferret()
-Called by FerretArray to create the initial 10 Ferrets. Also once a Ferret is about to die it creates a new one using this constructor.
public void run()
-Called by the Ferret constructor
-This method fetches the FerretFingerprintFunction from the Pool. Then it grabs the Seedsite (of type PageID) stored within the FerretFingerprintFunction. Using this PageID, it requests a DocumentNode. From the DocumentNode it obtains the list of links on the page and the number of URLs on that page. For each link (which is a string) on that page, it constructs a URL object and calls getPage which returns a PageAttribute. It then calls evaluatePage(PageAttribute) to check if this page is to be passed on to the filter. Also, as said before, each Ferret, before it dies, creates a new Ferret.
public PageAttribute getPage(URL pageUrl, URL ferretSeed)
-Called by the run method
-Accepts the URL object of a page and the URL object of the page where it originated from (that is, the page on which this was a link)
-Returns a PageAttribute
-For each URL it receives, this method examines if the page is already stored in the Cache. If not it calls fetchPage(URL) to get it from the web and then stores this fetched page in the cache for future ferrets. It also creates an instance of the pageAttribute class, fills in the URL, seedsite (which page it originated from) and the actual content of the page.
public DataInputStream fetchPage(URL pageUrl)
-Called by getPage
-Accepts the URL object of the page to be fetched
-Returns the DataInputStream of the page fetched
-This method establishes a URL connection from the URL object received, gets the InputStream of the page and converts it into a dataInputStream To handle AUTOFORWARDING it then overwrites the page URL with the URL of the fetched page. It also calls insertBaseTag to insert a Base tag into the HTML page if there doesn't already exist one.
public DataInputStream insertBaseTag(URL pageUrl, DataInputStream pageDataStream)
-Called by fetchPage()
-accepts the DataInputStream of a page
-returns the DataInputStream of the modified page
-This method first converts the page to an HTML string representation. Then it scans the HTML document for a Base tag. If it does not find one, it prepares a base tag using the basic URL of this page. It then inserts this into the HTML document just after the HTML tag. This is accomplished to handle relative addressing of the links and images in the page.
public void evaluatePage(URL pageUrl, PageAttribute pageAttribute)
-Called by the run method once the page has been fetched and modified
-accepts the URL object of a page and also the PageAttribute with the three fields (URL, seedsite, and pagecontent) values set.
-For each pageAttribute it receives, it invokes the isPageGood method of the corresponding Fingerprint Function.
-If the isPageGood method returns a positive value it puts the pageAttribute into the FerretApprovedPagesCache, otherwise it rejects the page.
public String getPageText(PageID pageId)
-A method provided which takes as input a PageAttribute and returns the corresponding string representation.

Class FerretFingerprintFunctionPool

* Responsibilities:
The purpose if this class is to store all the FerretFingerprintFunctions for the Ferrets.
What this class does?
- store all FerretFingerprintFunctions.
- returns a fingerprint function for the ferret
- allow the Ferret Advisor to insert/revoke a FerretFingerprintFunction
* Collaborators:
package ferrets: Ferret
package fingerprint: FerretFingerprintFunctions
package advisor: FerretAdvisor
* Variables and Methods:
public FerretFingerprintPool()
-Called by ferretAdvisor at the beginning
-Constructs the FerretFingerprintPool
public void insert(FerretFingerprintFunction ferretFunction)
-Called by the FerretAdvisor
-Provides facility to insert a FerretFingerprintFunction into the pool to be picked up by the Ferrets.
public FerretFingerprintFunction getFingerprint()
-called by a Ferret
-Returns a FerretFingerprintFunction
-This method first obtains a random number calculated from the number of FerretFingerprintFunctions currently in this pool. It divides this random number by this count and use the remainder as an index to the pool. It returns the FerretFingerprintFunction located at this index in the pool.
public boolean remove(FerretFingerprintFunction ferretFunction)
-Called by FerretAdvisor
-Facilitates removal of a FerretFingerprintFunction from the pool (if required)
public Enumeration listFingerprints()
-returns the number of FerretFingerprintFunctions currently active in the pool

Class FerretArray

* Responsibilities:
To create the initial Ferrets. Once created, each Ferret will create a new one before it dies.
What this class does?
- creates 10 initial ferrets
- provides a method to change the initial number of ferrets active
* Collaborators:
package ferrets: Ferret
* Variables and Methods:
public FerretArray()
-creates the initial Ferrets
public void changeFerretCapacity(int ferretCapacity)
-change the initial number of Ferrets activated

Message Interactions

Message Interactions with other packages
* with fingerprint
The Ferret grabs a FerretFingerprintFunction from the FerretFingerprintPool, and fetches the page specified. It then invokes a method of the FerretFingerprintFunction to check if the page is to passed on to the filter.
* with advisor
The FerretAdvisor inserts/revokes FerretFingerprintFunctions into/from the FerretFingerprintPool.
* with cache
The Ferret looks up the Cache to see if the page to be fetched has already been fetched by another Ferret. It also updates the Cache with pages fetched from the web. The Ferret dumps all pages approved into the FerretApprovedPagesCache and also passes an instance of the Filter to the FerretApprovedPagesCache for the Notifier.
* with webpageDB
The Ferret uses the webpageDB to obtain information about the PageAttributes for a particular seedsite. The flow is as follows:
-The Ferret creates an instance of the DocumentServerImpl
-The seedsite is of type PageID
-The Ferret invokes a method of the DocumentServerImpl instance and using the PageID obtains a DocumentNode.
-Using this DocumentNode it obtains the PageAttributes like URL of the page, and the number of links on the page.
* with parser
The ferret stores the page details in the Cache and FerretApprovedPagesCache as Page Attributes. The Ferret creates an instance of PageAttribute and fills in the URL, seedsite and pagecontent fields.
Message Interactions between classes in the package:
* FerretArray --- Ferret
The FerretArray creates the initial Ferrets according to a specified capacity. Once created, each Ferret before it dies creates a new one and the process continues.
* Ferret --- FerretFingerprintPool
The Ferret grabs a FerretFingerprintFunction at random from the FerretFingerprintPool. The Ferret creates an instance of the DocumentServerImpl The seedsite is of type PageID
last | | to sitemap | | up one level | | next