Protein classifier

5 min readJul 11, 2021

Proteins play important roles in living organisms, and their function is directly linked with their structure. Due to the growing gap between the number of proteins being discovered and their functional characterization in particular as a result of experimental limitations, reliable prediction of protein function through computational means has become crucial.

Initially import all necessery datas

After we need to

Before preprocessing we have all the raw data that is not even required for us.So we only take necessary information to feed to our algorithm,otherwise it may be lead to bad performance .Here, the necessary infos are protein ID, protein sequence,function ID.Hence we collect these data(protein ID, protein sequence ,function ID which match our regular expression) into a new file and feed our algorithm .So that algorithm can quite easily learn it. And also in this project,We have a lot of data(almost around 500,000) but we feed our algorithm only in limited number.That also can be done in preprocessing

Once after we done preprocessing ,We need to work with the actual list

Import necessary libraries

We have a data of group of proteins These proteins are mentioned in a code ID uniquely and also have several amino acids sequence code in an order where that order makes them as a particular protein In this project, We are going to identify the particular protein that will perform the ATP binding function ATP binding function has a code GO:0005524 We are going to identify the proteins which will perform this ID and make them as two list First one is to make the proteins as a list along with their amino acids sequence Second one, We need to cluster all these proteins(only proteins) in an array

LOADING THE DATA AND GET THE SHAPE OF THE DATA : In this we are changing the data into another format.we need numpy, os ,KERAS sequence and json packages.first our has function will going to load by json .we wil get all proteins list which exhibit the ATP binding behaviour.we will set the sequence size .we intiating some values and we will get proteins sequence of lines by function seq of indices.we can change the seq of lines also.and finally print label with sequence.

Training and testing ===>We have all data-points in X_all and Y_all variables ===>What we want to do next is to split the data as training and testing sets Here,We split data, 66% as training data and 33% testing data ===>Next,We will shuffle the data because we don’t want to train the machine only in the beginning and test it only at end. We need to train algorithm randomly all and test randomly all.this will give the machine a wider view. ===>After shuffling We need to retrieve the data as Training set and testing data ===>Once we finish the randomizing and retrieving ,We print the shapes for confirmation before feeding into the machine Before the shapes will be as 7,7 and after splitting ,We have 5,5 as training set and 2,2 as testing set

Shapes: Initially , we have 5 datapoints and each datapoint have 500 elements Hence shape is (5,500) Then we are writing it as a one hot representation (5,500,23) where 23 is the number of amino acids ([[……..23]…..500], [[……..23]…..500], [[……..23]…..500], [[……..23]…..500], [[……..23]…..500]) Hence final shape is(5,500,23)

TESTING AND TRAINING THE DATASET : The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset. TRAIN DATASET : Used to fit the machine learning model. TEST DATASET : Used to evaluate the fit machine learning model. The objective is to estimate the performance of the machine learning model on new data: data not used to train the model. In our Protein function classifier Testing and training is only need for large predictions and this is done by based on size of the dataset here we have size is 500 we will split into two percentage they are 66% and 33%.we will get our x_shape is (7,500) and y_shape as (7,). Randomise our dataset by repeating the testing and training.for eg (6,2,5…)and print shapes again we will get , (5,500) as trainshape (5,)as trainshape Likewise etc..

Github Source code-https://github.com/PRIYADHARSHINI1911/Protein-Classifier

Thank you!!!

Protein classifier

Written by PRIYADHARSHINI K