DIMENSIONALITY REDUCTION FOR PROTEIN SECONDARY STRUCTURE PREDICTION
Abstract
Proteins are important for our lives and they execute essential metabolic processes. The functions of the proteins can be understood by looking at the three-dimensional structures of the proteins. Because the experimental detection of tertiary structure is costly computational systems that estimate the structure provides a convenient alternative. One of the important steps of protein structure estimation is the identification of secondary structure tags. As new feature extraction methods are developed, the data sets used for this estimation can have high dimensions and some of the attributes can contain noisy data. For this reason, choosing the right number of features and the right attributes is one of the important steps to achieve a good success rate. In this study, size reduction process is applied on two different datasets using a deep autoencoder and various dimension reduction and feature selection techniques such as basic component analysis, chi-square, information gain, gain ratio, correlation-based feature selection (CFS) and the minimum redundancy maximum relevance algorithm as well as search strategies such as best first, genetic search, greedy algorithm. To evaluate the prediction accuracy, a support vector machine classifier is employed.