Synthetic protein sequence oversampling method for classification and remote homology detection in imbalanced protein data
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), ISSN: 1611-3349, Vol: 4414 LNBI, Page: 263-277
2007
- 3Citations
- 13Captures
Metric Options: CountsSelecting the 1-year or 3-year option will change the metrics count to percentiles, illustrating how an article or review compares to other articles or reviews within the selected time period in the same journal. Selecting the 1-year option compares the metrics against other articles/reviews that were also published in the same calendar year. Selecting the 3-year option compares the metrics against other articles/reviews that were also published in the same calendar year plus the two years prior.
Example: if you select the 1-year option for an article published in 2019 and a metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019. If you select the 3-year option for the same article published in 2019 and the metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019, 2018 and 2017.
Citation Benchmarking is provided by Scopus and SciVal and is different from the metrics context provided by PlumX Metrics.
Example: if you select the 1-year option for an article published in 2019 and a metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019. If you select the 3-year option for the same article published in 2019 and the metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019, 2018 and 2017.
Citation Benchmarking is provided by Scopus and SciVal and is different from the metrics context provided by PlumX Metrics.
Conference Paper Description
Many classifiers are designed with the assumption of well-balanced datasets. But in real problems, like protein classification and remote homology detection, when using binary classifiers like support vector machine (SVM) and kernel methods, we are facing imbalanced data in which we have a low number of protein sequences as positive data (minor class) compared with negative data (major class). A widely used solution to that issue in protein classification is using a different error cost or decision threshold for positive and negative data to control the sensitivity of the classifiers. Our experiments show that when the datasets are highly imbalanced, and especially with overlapped datasets, the efficiency and stability of that method decreases. This paper shows that a combination of the above method and our suggested oversampling method for protein sequences can increase the sensitivity and also stability of the classifier. Synthetic Protein Sequence Oversampling (SPSO) method involves creating synthetic protein sequences of the minor class, considering the distribution of that class and also of the major class, and it operates in data space instead of feature space. We used G-protein-coupled receptors families as real data to classify them at subfamily and sub-subfamily levels (having low number of sequences) and could get better accuracy and Matthew's correlation coefficient than other previously published method. We also made artificial data with different distributions and overlappings of minor and major classes to measure the efficiency of our method. The method was evaluated by the area under the Receiver Operating Curve (ROC). © Springer-Verlag Berlin Heidelberg 2007.
Bibliographic Details
http://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=34548074502&origin=inward; http://dx.doi.org/10.1007/978-3-540-71233-6_21; http://link.springer.com/10.1007/978-3-540-71233-6_21; https://dx.doi.org/10.1007/978-3-540-71233-6_21; https://link.springer.com/chapter/10.1007/978-3-540-71233-6_21; http://www.springerlink.com/index/10.1007/978-3-540-71233-6_21; http://www.springerlink.com/index/pdf/10.1007/978-3-540-71233-6_21
Springer Science and Business Media LLC
Provide Feedback
Have ideas for a new metric? Would you like to see something else here?Let us know