Main types of classification in statistics |
Main Types Of Classification In Statistics
Classification is the process of assigning categories to a sample population based on commonalities among its members or items, for example, age, sex etc. A large number of methods are available for classifying types of people into groups based on their traits and characteristics. This article helps you to classify different types of classes in statistical studies using simple words like “classification” and “classifier”. The main aim of this article is to understand two types of classifications- the unsupervised and Supervised classification in details. In this article, we will discuss more about these two types of classes, which are important in classification studies in statistics, so that readers can get sufficient knowledge about them. The three most commonly used classifiers in our everyday life are Linear Regression (LR), Logistic Regression (LOG) and Random Forest Classifier (RF).
Types of Classification in statistics
In each aspect of human life (in our lives) we have to classify types of people based on their traits and characteristics. In order to solve this problem, we have given three different types of classifications- Unsupervised, Semi-supervised and supervised classification in this article. Unsupervised classification is one of the simplest and fastest methods to classify types of people into groups; but it has only three basic steps-
1st step is sorting out information from various sources for finding similarities.
2nd step is clustering these information in subsets.
3rd step involves building models from clusters to classify objects.
The classification can be performed on both numerical and categorical variables (like gender, age etc.) in any type of data, including text, audio, picture, product reviews, news etc.
Types of classification in statistics with examples is as if let us say that there are 3 types of students in your school.
Classification of data in research is as follows.
1- Female student,
2- Male student.
3- Other type.
Now, we have to consider that if you will divide them into subgroups, you will give your teachers list of male students as well.
But there is another kind of classification called semi-supervised classification. Here we have divided our variable into groups (binary classifier), so that each variable belongs to classes from the others. Then we have formed groups where each group contains both male and female but some variables are missing (for example, age, weight etc.).
We find the best parameters of linear regression (1) to describe relationship between age and weight. Now let us train our machine learning model with new labels and identify the right classifier for our task. It can either make a good prediction by itself based on training data, or we can use the error metric to select the classifier with highest value.
This method is often also used when we have already done supervised and unsupervised models. As you can observe from above plots, we have found that ‘l’ variable (age) is dependent on weights and ‘w’ variable (weight). If we take a step back to the beginning, both age and weight can be considered as independent groups. Since both of them have been considered in groups (it can be considered as one group in case of binary), we have used different groups. But if we label each one independently, then we do not need to label them as different groups. Hence we use more than one group based on data. As you can see from the above analysis, both are similar to each other. Age variable is also dependent on weights of all the persons in dataset. Weight is also dependent on the age, but it is not independent of the age, so we do not need to label each group as different subgroups. We can have many such groups in a dataset and still we will find a pattern. We can further divide these groups into small groups. These groups should not be very big otherwise the output would become complex. So while training our model, we should choose parameters which will create a huge matrix from the previous groups.
Now from the scatter plot of old vs new variables, we have selected the best feature. We can apply this decision tree algorithm and find the optimal values of parameters from it(or we can use k-fold Cross Validation to do so).
Linear Regression
Linear Regression refers to a simple approach, which works on linear relationships, but to provide non-linearities. Most of the time, it is used for predicting continuous values which can vary linearly. Thus, it works well in cases which we want our dependent variables (predicted values) to follow the trend. The equation for fitting linear regression is Y= W0 + W1X, where X is continuous and W0 is constant.
Logistic Regression
Logistic Regression algorithm assumes probability distributions, hence it is also called multicollinearity. Its focus is to fit a multivariate linear model and is the extension of linear regression to non-linear outcomes. The logic behind logistic regression is a little bit different from Linear Regression in which we have replaced independent and dependent variables with dummy variables.
Random Forest Algorithm
Random forest is an ensemble of decision forests, which are distributed randomly over samples of the entire set, forming forest blocks. Each block contains a collection of individual trees, and combines them by combining two or more independent sets into larger groups, thus constructing one multi-class classification system (or one multi-classification system per instance).
0 Comments