Classification of protein-protein interaction full-text documents using text and citation network features


Artemy Kolchinsky1,2, Alaa Abi-Haidar1,2, Jasleen Kaur1, Ahmed Abdeen Hamed, and Luis M. Rocha1,2,*

1School of Informatics and Computing, Indiana University, 1900 East Tenth Street, Bloomington IN 47408, USA
2FLAD Computational Biology Collaboratorium, Instituto Gulbenkian de Ciencia, Portugal
*To whom correspondence should be addressed: rocha@indiana.edu

Citation: A. Kolchinsky, A. Abi-Haidar, J. Kaur, A.A. Hamed, and L.M. Rocha [2010]."Classification of protein-protein interaction full-text documents using text and citation network features". IEEE/ACM Transactions On Computational Biology And Bioinformatics, 7(3):400-411. DOI: doi.ieeecomputersociety.org/10.1109/TCBB.2010.55. BibTex

The full text and pdf re-print are available from the TCBB site. Due to mathematical notation and graphics, only the abstract is presented here. Our pdf pre-print is also available.

Abstract.

We participated (as Team 9) in the Article Classification Task: binary classification of full-text documents relevant for protein-protein interaction of the Biocreative II.5 Challenge. We used two distinct classifiers for the online and offline challenges: (1) the lightweight Variable Trigonometric Threshold (VTT) linear classifier we successfully introduced in BioCreative 2 for binary classification of abstracts, and (2) a novel Naive Bayes classifier using features from the citation network of the relevant literature. We supplemented the supplied training data with full-text documents from the MIPS database. The lightweight VTT classifier was very competitive in this new full-text scenario: it was a top performing submission in this task, taking into account the rank product of the Area Under the interpolated precision and recall Curve, Accuracy, Balanced F-Score, and Matthew’s Correlation Coefficient performance measures. The novel citation network classifier for the biomedical text mining domain, while not a top performing classifier in the challenge, performed above the central tendency of all submissions and therefore indicates a promising new avenue to investigate further in bibliome informatics.

Keywords:Protein-protein interaction, text mining, bibliome informatics, support vector machines, citation network, complex networks, Literature Mining, Binary Classification.


For more information contact Luis Rocha at rocha@indiana.edu. Check the Web Design Credits, for due credit.
Last Modified: July 29, 2010