Let’s start with the big picture steps
- Defining Project Objectives
- Gathering Data
- Exploratory Data Analysis(EDA)
- Data Cleaning
- Choosing a Model
- Training
- Evaluation
- Hyperparameter Tuning
- Interpret and Communicate
- Deployment and Documentation
Defining Project Objectives
Ask yourself what is the problem that you are trying to solve? Identify the central objectives of your project by identifying what needs to be predicted ? What is your end goal?
Gathering Data
The quality and quantity of data you gather in this step will determine how efficient your model will be.
Data can be collected in any format. We will do some pre-processing steps to ensure data integrity before exposing it to a model
- More the training examples, the efficient your final model will be. Even a top-notch algorithm would do better with more data. In real world, sometimes we decide not to train with large samples, purely to avoid long training time and high compute resources that it warrants
- Make sure the number of samples for every class or topic is not overly imbalanced. Sometimes this is not possible, for example in Fraud detection model, the percentage of fraud makes the data imbalanced by nature
- Make sure that your samples adequately cover the space of possible inputs, not only the common cases.
Exploratory Data Analysis (EDA)
EDA is all about analyzing datasets to summarize their key characteristics
- Data exploration can help to discover hidden patterns, anomalies in the training data
- Plays a major role in checking assumptions and hypothesis with the help of summary statistics such as mean, median, standard deviation
Data Cleaning
Real world data is often messy. If you thought the data that you get from common open source repositories are messy, you are in for a messy mess. Nothing to worry, you will get used to it, and there are lot of existing patterns and techniques for cleaning and formatting data.
Some possible challenges
- Missing values
- Duplicate data
- Invalid data
- Inconsistent data formats.
Choosing a Model
Choosing the right model for your problem will often require trials with various models to understand what works best.
Model selection can be further categorized based on the type of machine learning that you want to try.
Each of the ML type has numerous algorithms to choose from. Your choice of model will also be dependent on your objectives. Is it a classification problem or a regression problem?
- Supervised Learning if the source data is labeled. i.e We know target value for the training data
- Unsupervised Learning if the source data is unlabeled.
- Reinforcement Learning – Actions based on reward and punishment
Training
- The input data is split into Training Data and Testing Data.
- Model is trained with the training data using different ML algorithms by adjusting the parameters in multiple iterations.
- Testing Data are put aside as unseen data to evaluate your models accuracy and precision
Evaluation
Once training is complete, it’s time to see if the model is any good, using Evaluation.
This is where that dataset that we set aside earlier comes into play(i.e) Testing Data.
- Evaluation allows us to test our model against data that has never been used for training.
- The objective is to get an idea of how the model might perform against data that it has not yet seen.
Hyperparameter tuning
Every ML model comes with some default values for it’s configuration. A hyperparameter is a parameter whose value is used to control the learning process.
The time required to train and test a model can depend upon the choice of its hyperparameters.
- The evaluation step was done with default Hyperparameters of the model, we can improve our training furthermore by tuning different parameters that were implicitly assumed in the training process and this process is called Hyperparameter Tuning.
- The tuned model is once again evaluated for model performance, and this cycle continues until the final best performing model is chosen.
Interpret and Communicate
The most challenging task of the ML project is explaining the model’s output.
The more interpretable your model is, then more it is easier to communicate your model’s importance to the stakeholders
Deployment and Documentation
The trained model has to be deployed in a real-world system for it to be of any use. There are many open source or vendor supported options available.
Leave a comment