Introduction To PyCaret
PyCaret is an open-source, low-code machine learning library in Python that aims to reduce the cycle time from hypothesis to insights. It is well suited for seasoned data scientists who want to increase the productivity of their ML experiments by using PyCaret in their workflows or for citizen data scientists and those new to data science with little or no background in coding. PyCaret allows you to go from preparing your data to deploying your model within seconds using your choice of notebook environment. Please choose your track below to continue learning more about PyCaret.
Installation of PyCaret
Installing PyCaret is the first step towards building your first machine learning model in PyCaret. Installation is easy and takes only a few minutes. All dependencies are also installed with PyCaret.
Run the below command to install PyCaret.
#installing for the first time
#installing for the first time pip install pycaret #if you have installed beta version in past, run the below code to upgrade pip install --upgrade pycaret #Run the below code in your notebook to check the installed version from pycaret.utils import version version()
Note:- that in order to avoid potential conflicts with other packages it is strongly recommended to use a virtual environment, e.g. python3 virtualenv
Create A Environment
#create a conda environment
conda create –name yourenvname python=3.6
conda activate yourenvname
Create A Environment:- #create a conda environment conda create --name yourenvname python=3.6 #activate environment conda activate yourenvname #install pycaret pip install pycaret #create notebook kernel connected with the conda environment python -m ipykernel install --user --name yourenvname --display-name "display-name"
PyCaret is a deployment-ready Python library which means that as you perform an experiment, all steps are automatically saved in a pipeline which can be deployed into production with ease. PyCaret automatically orchestrates all dependencies in a pipeline. Once a pipeline is developed, it can be transferred to another environment to run at scale.
- Sample and Splits
- Data Preparation
- Scale and Transform
- Feature Engineering
- Feature Selection
For more details: https://pycaret.org/preprocessing/
Welcome to the Binary Classification. This tutorial assumes that you are new to PyCaret and looking to get started with Binary Classification using the
‘ pycaret.classification Module ’
In this tutorial we will learn:
- Getting Data: How to import data from PyCaret repository
- Setting up Environment: How to set up an experiment in PyCaret and get started with building classification models
- Create Model: How to create a model, perform stratified cross-validation and evaluate classification metrics
- Tune Model: How to automatically tune the hyper-parameters of a classification model
- Plot Model: How to analyze model performance using various plots
- Finalize Model: How to finalize the best model at the end of the experiment
- Predict Model: How to make predictions on new/unseen data
- Save / Load Model: How to save/load a model for future use
Read Time: Approx. 30 Minutes
The first step to getting started with PyCaret is to install pycaret. Installation is easy and will only take a few minutes. Follow the instructions below:
Installing PyCaret in Local Jupyter Notebook
!pip install pycaret
Installing PyCaret on Google Colab or Azure Notebooks
!pip install pycaret
- Python 3.x
- The latest version of pycaret
- Internet connection to load data from pycaret’s repository
- Basic Knowledge of Binary Classification
For Google colab users
If you are running this notebook on Google colab, run the following code at top of your notebook to display interactive visuals.
from pycaret.utils import enable_colab
Overview of the Classification Module in PyCaret
PyCaret’s classification module (pycaret.classification) is a supervised machine learning module that is used for classifying the elements into a binary group based on various techniques and algorithms. Some common use cases of classification problems include predicting fraud customers (yes or no), predicting the sex (M or F).
The PyCaret classification module can be used for Binary or Multi-class classification problems. It has over 18 algorithms and 14 plots to analyze the performance of models. Be it hyper-parameter tuning, ensembling, or advanced techniques like stacking, PyCaret’s classification module has it all.
Dataset for the Tutorial
For this tutorial, we will use a dataset from Kaggle called Titanic Datasets Titanic: Machine Learning from Disaster
Start here! Predict survival on the Titanic and get familiar with ML basics.
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
Short descriptions of each column are as follows:
- survival: 0 = No and 1 = Yes
- pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
- sex: Gender (1=male, 2=female)
- age: in years
- sib: of siblings/spouses aboard the Titanic
- Parch: of parents/children aboard the Titanic
- ticket: Ticket number
- fare: Passenger fare
- cabin: Cabin number
- embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children traveled only with a nanny, therefore parch=0 for them.
6.1 Import Required Library from pycaret.datasets import get_data import pandas as pd import numpy as np 6.2 Read Dataset titanic_df = pd.read_csv("train.csv") titanic_df.head()
Setting up Environment in PyCaret
The setup() function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must be called before executing any other function in pycaret. It takes two mandatory parameters: a pandas data frame and the name of the target column. All other parameters are optional and are used to customize the pre-processing pipeline (we will see them in later tutorials).
When setup() is executed, PyCaret’s inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To account for this, PyCaret displays a table containing the features and their inferred data types after setup() is executed. If all of the data types are correctly identified enter can be pressed to continue or quit can be typed to end the experiment. Ensuring that the data types are correct is of fundamental importance in PyCaret as it automatically performs a few pre-processing tasks which are imperative to any machine learning experiment. These tasks are performed differently for each data type which means it is very important for them to be correctly configured.
In later tutorials, we will learn how to overwrite PyCaret’s inferred data type using the numeric_features and categorical_features parameters in setup().
from pycaret.classification import *
test1 = setup(titanic_df, target = ‘Survived’)
The majority of these features are out of scope for the purposes of this tutorial however a few important things to note at this stage include:
0 session_id 7238
1 Target Type: Binary. The Target type is automatically detected and shown. There is no difference in how the experiment is performed for Binary or Multiclass problems. All functionalities are identical.
2 Label Encoded: None. When the Target variable is of type string (i.e. ‘Yes’ or ‘No’) instead of 1 or 0, it automatically encodes the label into 1 and 0 and displays the mapping (0 : No, 1 : Yes) for reference. In this experiment, no label encoding is required since the target variable is of type numeric.
3 Original Data: (891, 12). Displays the original shape of the dataset. In this experiment (891, 12) means 891 samples and 12 features including the target column.
4 Missing Values: True. When there are missing values in the original data this will show as True. For this experiment, there are no missing values in the dataset.
5 Numeric Features: 3. The number of features inferred as numeric. In this dataset, 12 out of 3 features are inferred as numeric.
6 Categorical Features: 8. The number of features inferred as categorical. In this dataset, 8 out of 12 features are inferred as categorical.
7 Transformed Train Set: (623, 1051) Displays the shape of the transformed training set.
8 Transformed Test Set: (268, 1051) Displays the shape of the transformed test/hold-out set. This split is based on the default value of 70/30 that can be changed using the train_size parameter in setup.
Comparing All Models:
Now compare the machine learning model with a single line.
compare_models() #comapre the classification model with performance measures
Two simple words of code (not even a line) have created over 15 models using 10 fold stratified cross-validation and evaluated the 6 most commonly used classification metrics (Accuracy, AUC, Recall, Precision, F1, Kappa). The score grid printed above highlights the highest performing metric for comparison purposes only. The grid by default is sorted using ‘Accuracy’ (highest to lowest) which can be changed by passing the sort parameter. For example, compare_models(sort = ‘Recall’) will sort the grid by Recall instead of Accuracy. If you want to change the fold parameter from the default value of 10 to a different value then you can use the fold parameter. For example compare_models(fold = 5) will compare all models on 5 fold cross-validation. Reducing the number of folds will improve the training time.
Create a Model
While compare_models() is a powerful function and often a starting point in any experiment, it does not return any trained models. PyCaret’s recommended experiment workflow is to use compare_models() right after setup to evaluate top-performing models and finalize a few candidates for continued experimentation. As such, the function that actually allows to you create a model is unimaginatively called create_model(). This function creates a model and scores it using stratified cross-validation. Similar to compare_models(), the output prints a scoring grid that shows Accuracy, AUC, Recall, Precision, F1, and Kappa by fold.
For the remaining part of this tutorial, we will work with the below models as our candidate models.
- Extra Tree Classifier(‘et’)
- Decision Tree Classifier (‘dt’)
- K Neighbors Classifier (‘knn’)
- Random Forest Classifier (‘rf’)
8.1 XGBoost xgboost_model = create_model('xgboost')
Extra Tree Classifier
extra_tree_model = create_model(‘et’) #extra tree classifier
Tune a Model
When a model is created using the create_model() function it uses the default hyperparameters. In order to tune hyperparameters, the tune_model() function is used. This function automatically tunes the hyperparameters of a model on a pre-defined search space and scores it using stratified cross-validation. The output prints a scoring grid that shows Accuracy, AUC, Recall, Precision, F1, and Kappa by fold.
Note: tune_model() does not take a trained model object as an input. It instead requires a model name to be passed as an abbreviated string similar to how it is passed in the create_model(). All other functions in pycaret.classification require a trained model object as an argument.
tune_adaboost_model = tune_model(‘ada’)
#tuned model object is stored in the variable ‘tune_adaboost_model’.
tune_xgboost_model = tune_model(‘xgboost’)
The tune_model() function is a random grid search of hyperparameters over a pre-defined search space. By default, it is set to optimize Accuracy but this can be changed using optimize parameter. For example: tune_model(‘dt’, optimize = ‘AUC’) will search for the hyperparameters of a Decision Tree Classifier that results in highest AUC. For the purposes of this example, we have used the default metric Accuracy for the sake of simplicity only. Generally, when the dataset is imbalanced (such as the credit dataset we are working with) Accuracy is not a good metric for consideration.
Plot a Model
Before model finalization, the plot_model() function can be used to analyze the performance across different aspects such as AUC, confusion_matrix, decision boundary etc. This function takes a trained model object and returns a plot based on the test / hold-out set.
There are 15 different plots available, please see the plot_model() docstring for the list of available plots.
10.3 Feature Importance Feature Importance Plot plot_model(xgboost_model, plot='feature') 10.4 Confusion Matrix plot_model(xgboost_model, plot = 'confusion_matrix')
Another way to analyze the performance of models is to use the evaluate_model() function which displays a user interface for all of the available plots for a given model. It internally uses the plot_model() function. evaluate_model(xgboost_model) 11.0 Interpret our Model Interpreting complex models is very important in most machine learning projects. It helps in debugging the model by analyzing what the model thinks is important. In PyCaret, this step is as simple as writing interpret_model to get the Shapley values. #create model creat_xgboost_model = create_model('xgboost') #summary plot interpret_model(creat_xgboost_model) ##put output plot 12.0 Reason Plot interpret_model(creat_xgboost_model, plot='reason', observation=0)
Predict model also work on unseen datasets
prediction = predict_model(xgboost_model, data=titanic_df)
Save All Experiment
In this article, we are learning new AutoML technology. With a line of code how it performs with all algorithm. This library best for a senior data scientists.Everything is automated and result with all parameters.