What is data in statistics
What is data in statistics

This article is about" What is data in statistics " and we will discuss what is data and its types in statistics and example of data in statistics

What is data in statistics


Data is generally defined as the collection of observations (observations), which can be either numerical, categorical or a combination of both numerical and categorical. Therefore it can be regarded as an information source for analysis, prediction or any other kind of analysis, data can be stored in tables and graphs in databases or on computer files. In many situations data can be stored in form of text records on computers or on paper paper and in digital libraries, etc., this data should be kept confidential and secured by encryption, authorization, and protection from unauthorized access.


Data sources used in statistics are mainly two kinds of tables, one contains raw data from sensors and processes, another data contains detailed information about processes, events, people which are essential for understanding the nature of processes. For example, we all have heard the story of Ford company where they have successfully completed their program called ‘Ford Motor Company X’. This program has helped to reduce car accidents by almost 99%.


A dataset is a collection of data that will help researchers to make some significant discoveries. A dataset is usually a set of columns, rows, fields or variables, which give us the ability to see how things happen and it contains the information which is required to perform analyses and visualize the information. There are three different types of datasets: structured data, unstructured data, semi-structured data. Structured data can be easily accessed using SQL and Excel or may be stored in multiple directories that can be accessible through web server.

The most common type of structured data is relational database. It contains two columns, ‘user’, ‘email’ and ‘phone_number’, these three columns are called attributes and each record in the table has a unique ‘user’, ‘email’ and ‘phone_number’ in the field(attribute). Each person is given his own unique set of values and every attribute is stored as a variable in the field(attribute). For example, in Ford Motor Company, when a customer purchases a model-S car, he will get certain attributes like weight of vehicle, engine size, fuel efficiency, etc., as well as, whether/or not the customer will buy a hybrid or the standard version of the car. Thus, every field contains a value of an attribute.


Structured and Unstructured Data


Both, structured and unstructured data contains data of form in which they are derived by data mining methods to extract meaningful information out of them. They are important as they provide rich information about companies, people, organisations, events and any other kind of data. There are different forms of unstructured data: media or social media (blogs, video clips, pictures and documents), web pages, images, text and audio recordings, etc., structured data usually includes facts and figures, numbers, graphs, tables, charts, etc., unstructured data may have the same attributes and data formats but it might contain more complex relationships among some fields and more details of how the things are going on. It can be described into six types:


1. Graphical models 2. Non-parametric estimations (NPO) 3. Clustering techniques 4. Regression techniques 5. Cluster Analysis 6. Anomaly detection and so on


1. Graphical Models


These kinds of data can be represented by tree diagrams, flowcharts, trees, biclustering diagrams, graphical representation of data mining techniques. Graphical modeling allows us to represent the relationship between various entities in unstructured data, such as customers, employees, managers and so forth and helps us to identify patterns, trends, similarities, differences, anomalies, etc. Graphical modelling enables the analysis of large amounts of data. It also facilitates us towards interpreting the data. It gives us better insight into the way people behave.


2. NPOs


Non-parametric estimates are considered to be the future of AI. These algorithms are very sensitive to outliers. The idea is to estimate parameters that describe a population without having to have access to the entire data set. To extract these parameters we can train non-parametric models in order to make inferences about the parameter values and their distribution. NPOs are used for predicting human-like activities such as facial recognition, voice interaction recognition, intelligent agents.

3. Clustering Techniques


In clustering, individuals within a particular group are identified by analysing data points from multiple groups. Generally, in multi-level clustering, the clusters are represented by centroids such as users, emails, calls, etc., while hierarchical clustering divides the data into levels of abstraction and uses the principle of partitioning. These methods divide data into subsets and assign each member to a cluster, grouping variables into subgroups and then forming higher level clusters based on the similarity of variables.


4. Regression Techniques


In regression, we try to predict the dependent variable y given some independent variables x. These are called weights and they describe the relationships between the independent and dependent variables. We use regression techniques to find a best fit line for our equation and then estimate the values for the coefficients. We can interpret the coefficients from the model and find the underlying cause of any relationship between dependent and independent variables.


5. Cluster Analysis


In cluster analysis, for example, if you want to know what effect the introduction of new rules of engagement for your students had on their performance, you can look at all the cases in school system, and analyse the factors contributing to the dropout rates in schools