Dealing with Categorical Data in Python

0
86
Dealing with Categorical Data in Python

Introduction Dealing with Categorical Data

In this article we are discussing about Dealing with Categorical Data in Python, These generally include different categories means numerical or categorical and it’s associated with the observation, which is non-numerical and thus needs to be converted to the computer can process them. In particular, many machine learning algorithms require that their input is numerical and therefore categorical features must be transformed into numerical features before we can use any of these algorithms.

Identifying Categorical Data

  1. Nominal
  2. Ordinal
  3. Continuous

Categorical features can only take on a limited.

For example, if a dataset is about information related to users, then you will typically find features like country, sex, fruit_name, etc. These are all categorical features in your dataset. These features are text values. For example, sex is described as Male (M) or Female (F), product type could be described as electronics, apparel, food, etc.

these types of features where the categories are only labeled without any order(randomly) of precedence are called nominal features.

Features that have some order associated with them are called ordinal features. For example, a feature like credit score, with three categories: low, medium, and high, which have an order associated with them. 

There are also continuous features. These are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or also a date/time.

Machine learning models, such as regression, or SVM (support vector machine) are algebraic. This means that their input must be numerical. Or categories must be transformed into numeric first before you can apply the learning algorithm on them.

For the machine, categorical data doesn’t contain the same context or information that humans can easily understand. For example, when looking at a feature called City with three cities New York, New Jersey, and New Delhi, humans can easily differentiate that New York is closely related to New Jersey as they are from the same country, while New York and New Delhi are much different. But for the model, New York, New Jersey, and New Delhi, are just three different levels (possible values) of the same feature City. If you don’t specify the additional contextual information, it will be impossible for the model to differentiate between highly different levels.

One of the most common ways to convert the numeric transformation is to one-hot encode the categorical features, especially when there does not exist a natural ordering between the categories (e.g. a feature ‘City’ with names of cities such as ‘Mumbai’, ‘Nagpur’, ‘Delhi’, etc.). For each unique value of a feature (say, ‘Mumbai’) one column is created (say, ‘City_Mumbai’) where the value is 1 if for that instance the original feature takes that value and 0 otherwise

import pandas as pd
import numpy as np

# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

# Read in the CSV file and convert "?" to NaN
df = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )
df.head()
_____________________________________________________________________
df.dtypes

Output:-
symboling               	int64
normalized_losses    float64
make                 	 	object
fuel_type            		 object
aspiration            	object
num_doors             	object
body_style            	object
drive_wheels          	object
engine_location       	object
wheel_base           	float64
length               		float64
width                		float64
height               		float64
curb_weight            	int64
engine_type           	object
num_cylinders         	object
engine_size            	int64
fuel_system          	 object
bore                 		float64
stroke               		float64
compression_ratio    float64
horsepower           	float64
peak_rpm             	float64
city_mpg               	int64
highway_mpg            int64
price                	float64
dtype: object

Since this article will only focus on encoding the categorical variables, Pandas has a helpful select_dtypes function.
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head()

output:-

makefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationengine_typenum_cylindersfuel_system
0alfa-Romerogasstdtwoconvertiblerwdfrontdohcfourmpfi
1alfa-romerogasstdtwoconvertiblerwdfrontdohcfourmpfi
2alfa-romerogasstdtwohatchbackrwdfrontohcvsixmpfi
3audigasstdfoursedanfwdfrontohcfourmpfi
4audigasstdfoursedan4wdfrontohcfivempfi

There are four ways to Dealing with Categorical Data in Python.

Method 1 – Find and Replace

There are two columns of data where the values are categorically used to represent numbers. Specifically the number of cylinders in the engine and number of doors on the car. Pandas make it easy for us to directly replace the text values with their numeric equivalent by using replace.

The number of cylinders only includes 7 values and num_doors data only includes 2 or 4 doors. 

obj_df[“num_cylinders”].value_counts()

four      159

six        24

five       11

eight       5

two         4

twelve      1

three       1

Name: num_cylinders, dtype: int64

cleanup_nums = {“num_doors”:     {“four”: 4, “two”: 2},

                “num_cylinders”: {“four”: 4, “six”: 6, “five”: 5, “eight”: 8,

                                  “two”: 2, “twelve”: 12, “three”:3 }}

To convert the columns to numbers using replace :

obj_df.replace(cleanup_nums, inplace=True)
obj_df.head()


output:


The benefit of this approach is that pandas “knows” the types of values in the columns so the object is now an int64.

obj_df.dtypes

make               object
fuel_type          object
aspiration         object
num_doors           int64
body_style         object
drive_wheels       object
engine_location    object
engine_type        object
num_cylinders       int64
fuel_system        object
dtype: object

Method 2 – Label Encoding

Encoding categorical values is to use a technique called label encoding. Label encoding is simply converting each value in a column to a number. 

One trick you can use in pandas is to convert a column to a category, then use those category values for your label encoding:

obj_df[“body_style”] = obj_df[“body_style”].astype(‘category’)

obj_df.dtypes

Output:

make                 object

fuel_type            object

aspiration           object

num_doors             int64

body_style         category

drive_wheels         object

engine_location      object

engine_type          object

num_cylinders         int64

fuel_system          object

dtype: object

Then you can assign the encoded variable to a new column using the cat.codes.

obj_df[“body_style_cat”] = obj_df[“body_style”].cat.codes

obj_df.head()

Output:

The approach is that you get the benefits of pandas categories (compact data size, ability to order, plotting support) but can easily be converted to numeric values for further analysis.

Method 3 – One Hot Encoding

Label encoding has a disadvantage that the numeric values can be “misinterpreted” by the algorithms. For example, the value of 0 is obviously less than the value of 4 but does that really correspond to the data set in real life? Does a wagon have “4X” more weight in our calculation than the convertible?

A common alternative approach is called one-hot encoding. Despite the different names, the basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column. 

Pandas support this feature using get_dummies.

Hopefully, a simple example will make this more clear. We can look at the column drive_wheels where we have values of 4wd , fwd or rwd . With the help of  get_dummies, we can convert this to three columns with a 1 or 0.

pd.get_dummies(obj_df, columns=[“drive_wheels”]).head()

The new data set contains three new columns:

  • drive_wheels_4wd
  • drive_wheels_rwd
  • drive_wheels_fwd

One hot encoding is very useful but it can cause the number of columns to expand greatly if you have very many unique values in a column.

Method 4 – Custom Binary Encoding

Depending on the data set, sometimes use the combination of label encoding and one-hot encoding to create a binary column that meets your needs for further analysis.

obj_df[“engine_type”].value_counts()

ohc      148

ohcf      15

ohcv      13

l         12

dohc      12

rotor      4

dohcv      1

Name: engine_type, dtype: int64

In other words, the various versions of OHC are all the same for this analysis. If this is the case, then we could use the str accessor plus np.where to create a new column the indicates whether or not the car has an OHC engine.

Conclusion

In this article we discussed about Dealing with Categorical Data, Encoding categorical variables is an important step in the data science and data analysis process. Because there are a number of methods to encoding variables, it is important to understand the various options and how to implement them on your own data sets.

LEAVE A REPLY

Please enter your comment!
Please enter your name here