What is relative frequency distribution
What is relative frequency distribution

 This article "What is relative frequency distribution?"guide us towards how the relative frequency used in statistics and what is the difference between frequency and frequency distribution? and Which is better to learn this distribution.

What is Relative frequency in Statistics

Relative frequency Distribution is a distribution which refers its output value as an independent sum of many values of inputs. These kinds of distributions can be applied in a variety of fields and are often used by data scientists and statisticians to analyze huge datasets. If we look at each value individually then it is called discrete distribution and when we apply several variables together, it’s called continuous distribution. Therefore, it’s also called multivariate-dependent distribution. Let’s understand with an example of relative frequency distribution

Assume three numbers — 10, 15, 20. Now we want to find out which number will be more frequent or rarer to our dataset. So we can use below equation to calculate frequency distribution. We are dividing our range value into groups like ‘more than 10 times’, ‘more than 15 times’, ‘more than 20 times’

Our goal is to create a probability distribution for frequencies, which will tell us the probability of different values being more than 10, 15, or 20 times. This is just one step towards understanding the relative frequency distribution. To know more about probability theory please visit probability theory.

So, that’s all about relative frequency distribution, let’s go further and discuss applications of this in various areas of data science.

A Simple Example

Suppose I have two data sets — training-set and test-set and I have two input variables — gender and age. The first part of our problem is to find the probability distribution for our training-set and test-set that these input variables will occur more frequently in our given data set.

To solve this problem, let’s first take both of these inputs, and define them as ‘dependent’ variables. Then convert dependent variable into a sum of individual values of independent variables. By doing this, we have defined the dependent variables as a summation of individual values from two independent variables.

Now we have to apply some math and calculus behind this equation and calculate frequency distribution of these input variables. Since these inputs are categorical and discrete we have calculated their probabilities within our calculation. We can now see the total probability distribution of these inputs that occur more than 10, 15, or 20 times or not, but we aren’t going to use this in our prediction. Here, the sample probability will go beyond 95% probability.

Binomial Random Variable

As we can see in the above picture, whenever person X happens i.e., has gender= Male & Age = 12 years, then we are calculating the probability that Y will occur i.e., having gender= Female & Age = 24 years. We are saying that person X happens because of two random variables Y and T. So we have Binomial Random Variable which can be defined a joint probability distribution of two random variables y and t.

Categorical Data

As we can observe in the graph, whenever someone has a gender as male (1) and age as less than 18 years, then he/she will be considered as male (2). That means our model will have a higher probability that gender= Male. Similarly, whatever gender will happen more than 10 times will lead to a female prediction. It can be seen that any gender will increase our probability to predict a female. So, we are considering gender in our model as a categorical variable. But when we apply other factors like age, gender, and location, our results become extremely complex. For example, if gender= Male&Age=14 years but location=North, it means that if anyone has a gender as 1 & Age=0 then they will have to consider as male. If location is North & Gender=female then it means that no matter where any person goes, they are bound to consider themselves male. There are many factors to keep in mind while predicting. So, we should always keep things in consideration and make predictions using the best possible approach.

Conclusion

From the above discussion it is quite clear that relative frequency distribution is most useful to make some predictions. It can also be used to find out the correlation among various inputs and help us build better predictive algorithms.