Machine Learning From 10000 Feet

Let’s start with the big picture steps

Defining Project Objectives
Gathering Data
Exploratory Data Analysis(EDA)
Data Cleaning
Choosing a Model
Training
Evaluation
Hyperparameter Tuning
Interpret and Communicate
Deployment and Documentation

Defining Project Objectives

Ask yourself what is the problem that you are trying to solve? Identify the central objectives of your project by identifying what needs to be predicted ? What is your end goal?

Gathering Data

The quality and quantity of data you gather in this step will determine how efficient your model will be.
Data can be collected in any format. We will do some pre-processing steps to ensure data integrity before exposing it to a model

More the training examples, the efficient your final model will be. Even a top-notch algorithm would do better with more data. In real world, sometimes we decide not to train with large samples, purely to avoid long training time and high compute resources that it warrants
Make sure the number of samples for every class or topic is not overly imbalanced. Sometimes this is not possible, for example in Fraud detection model, the percentage of fraud makes the data imbalanced by nature
Make sure that your samples adequately cover the space of possible inputs, not only the common cases.

Exploratory Data Analysis (EDA)

EDA is all about analyzing datasets to summarize their key characteristics

Data exploration can help to discover hidden patterns, anomalies in the training data
Plays a major role in checking assumptions and hypothesis with the help of summary statistics such as mean, median, standard deviation

Data Cleaning

Real world data is often messy. If you thought the data that you get from common open source repositories are messy, you are in for a messy mess. Nothing to worry, you will get used to it, and there are lot of existing patterns and techniques for cleaning and formatting data.

Some possible challenges

Missing values
Duplicate data
Invalid data
Inconsistent data formats.

Choosing a Model

Choosing the right model for your problem will often require trials with various models to understand what works best.

Model selection can be further categorized based on the type of machine learning that you want to try.

Each of the ML type has numerous algorithms to choose from. Your choice of model will also be dependent on your objectives. Is it a classification problem or a regression problem?

Supervised Learning if the source data is labeled. i.e We know target value for the training data
Unsupervised Learning if the source data is unlabeled.
Reinforcement Learning – Actions based on reward and punishment

Training

The input data is split into Training Data and Testing Data.
Model is trained with the training data using different ML algorithms by adjusting the parameters in multiple iterations.
Testing Data are put aside as unseen data to evaluate your models accuracy and precision

Evaluation

Once training is complete, it’s time to see if the model is any good, using Evaluation.
This is where that dataset that we set aside earlier comes into play(i.e) Testing Data.

Evaluation allows us to test our model against data that has never been used for training.
The objective is to get an idea of how the model might perform against data that it has not yet seen.

Hyperparameter tuning

Every ML model comes with some default values for it’s configuration. A hyperparameter is a parameter whose value is used to control the learning process.

The time required to train and test a model can depend upon the choice of its hyperparameters.

The evaluation step was done with default Hyperparameters of the model, we can improve our training furthermore by tuning different parameters that were implicitly assumed in the training process and this process is called Hyperparameter Tuning.
The tuned model is once again evaluated for model performance, and this cycle continues until the final best performing model is chosen.

Interpret and Communicate

The most challenging task of the ML project is explaining the model’s output.
The more interpretable your model is, then more it is easier to communicate your model’s importance to the stakeholders

Deployment and Documentation

The trained model has to be deployed in a real-world system for it to be of any use. There are many open source or vendor supported options available.

Nijil Chandran