A HMM text classification model with learning capacity



In this paper a method of classifying biomedical text documents based on Hidden Markov Model is proposed and evaluated. The method is integrated into a framework named BioClass. Bioclass is composed of intelligent text classification tools and facilitates the comparison between them because it has several views of the results. The main goal is to propose a more effective based-on content classifier than current methods in this environment To test the effectiveness of the classifier presented, a set of experiments performed on the OSHUMED corpus are preseted. Our model is tested adding it learning capacity and without it, and it is compared with other classification techniques. The results suggest that the adaptive HMM model is indeed more suitable for document classification.


Hidden Markov Model; Text classification; BioInformatics; Adaptive models

Full Text:



ANAND, A., PUGALENTHI, G., FOGEL, G. B., and SUGANTHAN, P. N., 2010. An approach for classification of highly

imbalanced data using weighting and undersampling. Amino acids, 39:1385–1391.

BAEZA-YATES, R. A. and RIBEIRO-NETO, B., 1999. Modern Information Retrieval. Addison-Wesley Longman.

DASARANTHY, B. V., 1991. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE

Computer Society Press, Los Alamitos, CA.

DOMINGOS, P. and PAZZANI, M., 1997. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning, 29:103–130.

FARID, D. M., ZHANG, L., RAHMAN, C. M., HOSSAIN, M., and STRACHMAN, R., 2014. Hybrid decision tree and naive Bayes classifiers for multi-class classification tasks. Expert Systems with Applications, 41(4):1937–1946. doi:10.1016/j.eswa.2013.08.089.

FRANCOIS, J. Jahmm - An implementation of HMM in Java.

Garner, S. R., 1995. WEKA: TheWaikato Environment for Knowledge Analysis. In Proc. of the New Zealand Computer Science Research Students Conference, pages 57–64.

GLEZ-PEÑA, D., REBOIRO-JATO, M., MAIA, P., ROCHA, M., DÍAZ, F., and FDEZ-RIVEROLA, F., 2010. AIBench: A rapid application development framework for translational research in biomedicine. Computer Methods and Programs in Biomedicine, 98:191–203.

GU, N., WANG, D., FAN, M., and MENG, D., 2014. A kernel-based sparsity preserving method for semisupervised classification. Neurocomputing, 139:345–356. doi:10.1016/j.neucom.2014.02.022.

HARRIS, T., 2014. Credit scoring using the clustered support vector machine. Expert Systems with Applications, 42(2):741–750. Cited By 2.

HERSH, W. R., BUCKLEY, C., LEONE, T. J., and HICKMAN, D. H., 1994. OHSUMED: An Interactive Retrieval

Evaluation and New Large Test Collection for Research. In SIGIR, pages 192–201.

JOACHIMS, T., 1998. Text categorization with support vector machines: learning with many relevant features. In Nédellec, C. and Rouveirol, C., editors, Proceedings of ECML-98, 10th European Conference on

Machine Learning, pages 137–142. Springer, Heidelberg et al.

LOVINS, J. B., 1968. Development of a stemming algorithm. Mechanical Translation and Computational

Linguistics, 11:22–31.

MALDONADO, S., WEBER, R., and FAMILI, F., 2014. Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines. Information Sciences, 286:228–246. doi:10.1016/j.ins.2014.07.015.

OSUNA, E., FREUND, R., and GIROSI, F., 1997. Support Vector Machines: Training and Applications. Technical


PENG, T., ZUO, W., and HE, F., 2008. SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowledge and Information Systems, 16(3):281–301.

RABINER, L. R., 1990. Readings in speech recognition. chapter A tutorial on hidden Markov models and selected applications in speech recognition, pages 267–296. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

ROMERO, R., VIEIRA, A., IGLESIAS, E., and BORRAJO, L., 2014. BioClass: A Tool for Biomedical Text Classification.

In 8th International Conference on Practical Applications of Computational Biology & Bioinformatics, volume 294 of Advances in Intelligent Systems and Computing, pages 243–251. Springer.

SEARA-VIEIRA, A., IGLESIAS, E. L., and BORRAJO, L., 2014. T-HMM: A Novel Biomedical Text Classifier Based on Hidden Markov Models. In 8th International Conference on Practical Applications of Computational Biology & Bioinformatics, volume 294 of Advances in Intelligent Systems and Computing, pages

–234. Springer.

SIERRA ARAUJO, B., 2006. Aprendizaje automático: conceptos básicos y avanzados: aspectos prácticos utilizando el software Weka. Pearson Prentice Hall.

VILLMANN, T., SCHLEIF, F., and HAMMER, B., 2006. Comparison of relevance learning vector quantization with other metric adaptive classification methods. Neural Networks, 19(5):610–622.

WANG, J., BELATRECHE, A., MAGUIRE, L., and MCGINNITY, T., 2014a. An online supervised learning method for spiking neural networks with adaptive structure. Neurocomputing, 144(0):526–536.

WANG, S., JIANG, L., and LI, C., 2014b. Adapting naive Bayes tree for text classification. Knowl Inf Syst.


YI, K. and BEHESTHI, J., 2009. A hidden Markov model-based text classification of medical documents.

Journal of Information Science, 35(1):67–81.

ZHANG, J. and MANI, I., 2003. kNN Approach to Unbalanced Data Distributions: A case study involving Information Extraction. In Proc. of the ICML’2003 workshop on learning from imbalanced datasets.

DOI: http://dx.doi.org/10.14201/ADCAIJ2014332134

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.

Clarivate Analytics