# Data Science:- 2. Data Preprocessing using Scikit Learn

1. Encoding: Encoding means converting information from one format to another format. Encoding is needed whenever we have categorical values. Encoding will assign one unique number to particular entities. Most of the time categorial values are in label form ( i.e. yes, no, true, false) So the computer will not consider them as features because the computer works with numbers. so we have to assign numerical quantity to each quantity and that process is called ‘Encoding’. There are two types of Encoding techniques. Label Encoding, Ordinal Encoding, One Hot Encoding are the types of the Encoding techniques.
2. Normalization: Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms. when multiple attributes are there but attributes have values on different scales, this may lead to poor data models while performing data mining operations. So they are normalized to bring all the attributes on the same scale. There are 3 Methods of Data Normalization. 1. Decimal Scaling ,2. Min-Max Normalization and 3. z-Score Normalization(zero-mean Normalization)
3. Standardization: Standardization is also one type of normalizer that will ensure that transformed data have mean = 0 & standard deviation = 1.Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.
4. Imputing Missing values: there might be the case when your data set contains some null values but it may counted as important feature value and it can alter your calculations. skit-learn provides inbuilt function like dropna() which will drop rows with any value as null in any of its column.
5. Discretization: Discretization is the process of putting values into buckets so that there are a limited number of possible states. The buckets themselves are treated as ordered and discrete values. You can discretize both numeric and string columns.

--

--