440.00 a

Description: Learn what Exploratory Data Analysis (EDA) is, understand its importance, and discover how to effectively conduct EDA..

#400#ML_Engineer_Basic#440#ML_Methodology#440.00#EDA(Exploratory_Data_Analysis)#440.00 a#Initiating_Exploratory_Data_Analysis

Getting started with EDA

#EDA

Audience

Who wants to know the why, how, and what it takes to get started in EDA

Overview

  1. EDA Essentials: We often don't know the data we use
  2. EDA Defined: EDA is the act of exploring the actual data being produced and used.
  3. EDA Implementing: The process of understanding the current data and Understanding situations through hypotheses

Content

EDA Essentials: We often don't know the data we use

In medical data, we might end up modeling based on people whose blood pressure is recorded as zero. Instead of focusing on the common diseases that make up 90% of cases, we might prioritize the special diseases that account for 10%, even aiming for generalization. Or we might use numerical data without realizing that it shouldn't contain become negative values.

There could be situations, such as modeling for product sales, where the problem is not clearly defined as including or excluding overseas sales. If the problem was initially thought to be domestic only, realizing later that overseas sales must be considered can lead to significant omissions, requiring re-modeling.

Unless the data has already been preprocessed, we usually don't know what we're working with. We might not understand why certain units were used, why product codes were written in a certain way, or why measurement intervals were set as they were. Preprocessed data can reduce these issues, but most of the time, we receive raw, unprocessed data.

The conclusions and modeling derived from this process can only be disappointing.We need a process to understand the data we're using and clearly define the problem objective, and this process is known as EDA (Exploratory Data Analysis).

EDA Defined: EDA is the act of exploring the actual data being produced and used.

EDA can be methodological, so definitions may vary from person to person. In my view, EDA is about exploring the actual data being produced and used.

Understanding the data being produced involves identifying whether the data was collected through human measurement, acquired based on an already established form, or generated by special equipment. This process also leads to understanding the features of the collected data: why the collected features are as they are, whether additional features or data collection are possible, etc. Following this, an understanding of the data involves determining whether certain units are mandatory, if the format of product codes must be strictly followed, or if the first three digits signify a major category, thereby aiding in understanding the received data.

The process of understanding the data used is similar to understanding its using purpose. It allows for the organization of data deemed unnecessary and modeling for the actual intended use. For example, if the purpose is overseas product sales, distance and transportation costs will become significant factors, whereas for domestic sales, their importance might decrease.

Therefore, in cases where I believe EDA is well-conducted, one should be able to answer questions like:

"Why is this feature missing from the data? What does this mean? What data were used to achieve the objective?"

EDA Implementing: The process of understanding the current data and Understanding situations through hypotheses

There are more specific methods to do for EDA which include:

Exploring the visible aspects of the data.
Identifying the distribution of variation in the data, such as determining whether it follows a Poisson distribution or a normal distribution. Additionally, transforming appropriate data it into categories based on distribution, measurement cycles, or measurement stages could also be a method.
Finding correlations between variables is also a good EDA method. This can be done by visualizing heat maps to find significant correlations between specific variables.

Exploring the non-visible aspects of the data.
Another method of EDA is hypothesis testing, which involves exploring the unseen parts of the data.
For example, using Titanic data, There is higher survival rate for groups of 2-3 people compared to individuals, and while groups of 4-5 had a higher survival rate than individuals, their survival rate was lower than groups of 2-3.This could suggest that having too many family members might have increased the likelihood of death as they tried to take care of each other. Additionally, a simple analysis of the distribution of first, second, and third-class tickets could help us understand the business practices and standards of the time, including ticket prices and the limitations of accommodations

Conclusion

EDA is the process of understanding what is visible and what is invisible in the data, and clarifying the purpose, which is necessary for accurate modeling and decision-making.