Data Science:- 2. Data Preprocessing using Scikit Learn

Data preprocessing is a step in the data mining and data analysis process that takes raw data and transforms it into a format that can be understood and analyzed by computers and machine learning.

Raw, real-world data in the form of text, images, video, etc., is messy. Not only may it contain errors and inconsistencies, but it is often incomplete, and doesn’t have a regular, uniform design.

There are a lot of preprocessing methods but we will mainly focus on the following methodologies:

(1) Encoding the Data

(2) Normalization

(3) Standardization

(4) Imputing the Missing Values

(5) Discretization

Dataset Description:

I have used ‘California Housing Prices dataset’. This dataset contains information about longitude, latitude of ocean proximity area, population, number of beds, number of rooms, house price etc…

We will perform all of the preprocessing methods using scikit-learn library famous for ml-dl work loads. I have attached a colab file for your reference. We will try to get insights on each of the above listed methods So, lets dive in.

  1. Normalization: Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms. when multiple attributes are there but attributes have values on different scales, this may lead to poor data models while performing data mining operations. So they are normalized to bring all the attributes on the same scale. There are 3 Methods of Data Normalization. 1. Decimal Scaling ,2. Min-Max Normalization and 3. z-Score Normalization(zero-mean Normalization)
  2. Standardization: Standardization is also one type of normalizer that will ensure that transformed data have mean = 0 & standard deviation = 1.Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.
  3. Imputing Missing values: there might be the case when your data set contains some null values but it may counted as important feature value and it can alter your calculations. skit-learn provides inbuilt function like dropna() which will drop rows with any value as null in any of its column.
  4. Discretization: Discretization is the process of putting values into buckets so that there are a limited number of possible states. The buckets themselves are treated as ordered and discrete values. You can discretize both numeric and string columns.

We learn about encoding, Normalization, Standardization, Imputing the Missing Values, and Discretization.

For Code: Click here



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store