Data preprocessing is a step in the data mining and data analysis process that takes raw data and transforms it into a format that can be understood and analyzed by computers and machine learning.
Raw, real-world data in the form of text, images, video, etc., is messy. Not only may it contain errors and inconsistencies, but it is often incomplete, and doesn’t have a regular, uniform design.
There are a lot of preprocessing methods but we will mainly focus on the following methodologies:
(1) Encoding the Data
(4) Imputing the Missing Values
I have used ‘California Housing Prices dataset’. This dataset contains information about longitude, latitude of ocean proximity area, population, number of beds, number of rooms, house price etc…
We will perform all of the preprocessing methods using scikit-learn library famous for ml-dl work loads. I have attached a colab file for your reference. We will try to get insights on each of the above listed methods So, lets dive in.
- Encoding: Encoding means converting information from one format to another format. Encoding is needed whenever we have categorical values. Encoding will assign one unique number to particular entities. Most of the time categorial values are in label form ( i.e. yes, no, true, false) So the computer will not consider them as features because the computer works with numbers. so we have to assign numerical quantity to each quantity and that process is called ‘Encoding’. There are two types of Encoding techniques. Label Encoding, Ordinal Encoding, One Hot Encoding are the types of the Encoding techniques.
- Normalization: Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms. when multiple attributes are there but attributes have values on different scales, this may lead to poor data models while performing data mining operations. So they are normalized to bring all the attributes on the same scale. There are 3 Methods of Data Normalization. 1. Decimal Scaling ,2. Min-Max Normalization and 3. z-Score Normalization(zero-mean Normalization)
- Standardization: Standardization is also one type of normalizer that will ensure that transformed data have mean = 0 & standard deviation = 1.Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.
- Imputing Missing values: there might be the case when your data set contains some null values but it may counted as important feature value and it can alter your calculations. skit-learn provides inbuilt function like dropna() which will drop rows with any value as null in any of its column.
- Discretization: Discretization is the process of putting values into buckets so that there are a limited number of possible states. The buckets themselves are treated as ordered and discrete values. You can discretize both numeric and string columns.
We learn about encoding, Normalization, Standardization, Imputing the Missing Values, and Discretization.
For Code: Click here