What is machine learning?

Machine learning is algorithms that is implemented as models in a way that these models can learn from training data and make predictions on unseen data, this without being given explicit instructions.

Machine learning relies on data mining, mathematics, statistics and programming. Data mining is a process to find patterns, relationships and trends in large datasets.

Machine learning is among other things used for spam detection, image classification, skin cancer detection, object detection in images, text categorization, language detection, text translations and creation of art.


Alan Turing published “Computing Machinery and Intelligence” in 1950 and is considered to be the father of artificial intelligence. Alan Turing was interested to know if machines can think and he introduced the Turing test. A machine is said to have passed the Turing test if a human that is chatting it thinks that it is a human.

The term “machine learning” was first introduced by Arthur Lee Samuel in 1959. Arthur Lee Samuel:s self-learning program for Checkers was one of the first programs in the world that demonstrated artificial intelligence (AI).

Tom Michael Mitchell published the book Machine Learning in 1997 and he offered a formal definition of machine learning: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E”.

Supervised and unsupervised learning

Machine learning is said to supervised if the training dataset includes both input and output data. The training data includes the correct answers in supervised learning and that means that the dataset might need to be prepared by people with expertise in the actual area. For example, if we want to detect skin cancer in images then someone how knows how to detect skin cancer must annotate this dataset.

Unsupervised learning takes a dataset that only consists of input data and applies an algorithm to produce the output. Unsupervised learning algorithms tries to find structures in the input data to produce groups or clusters. We can for example use unsupervised learning to classify news articles in groups (clusters) with similar articles in each group.

Classification and regression

Classification problems deals with categorical output, categorical data can be nominal data or ordinal data. Categorical data can be represented by numbers but these numbers do not have any mathematical meaning. Each element in an dataset must be classified into one category but not in more than one category (mutually exclusive and exhaustive).

Nominal data is the most primitive type of data and can only be classified and summarized on the number of observations. Nominal data are divided into categories like color, gender and brand.

Ordinal data can be classified and ranked, one can therefore say that one category is better than another category. For ordinal data, we can not say how much better a category is compared to another. An example of ordinal data are grades in school.

Classification problems with only two categories are called binary problems and other problems is called multi-class problems. Classifications problems can be to categorize articles, detect spam and classify images.

Regression problems deals with numerical data output, numerical data can be interval data or ratio data. Numerical variables can be discrete or continuous. Discrete variables can only assume certain clearly separated values (integers), continuous variables can assume any value within a range.

Interval data is the second highest data level, we can rank internal data and it is possible to measure the difference between two values. An example of interval data is temperature. A ratio between two interval data values does not provide any useful information.

Ratio data is the highest level of data. Ratio data can be ranked, it is possible to measure the differences between two values and it is meaningful to calculate the ratio between two values of ratio data. Examples of ratio data are salaries, turnover, weight and height.

Regression means that values will return to a mean, tall parents will get shorter kids and short parents gets taller kids. Regression analysis means that we want to find the function that best fits the input data. Regression problems can be to predict stock prices, predict product prices based on attributes and to predict the temperature.

We need to use both classification and regression in some machine learning problems. In object detection problems, a model must predict what an object is (cat) and where it is (bounding box).

Overfitting and underfitting

The goal with machine learning algorithms is to create a model that is perfect in making predictions on unseen data. Generalization refers to a models ability to adapt to new data.

A model is trained on a seen training dataset and the generalization abilty can be less perfect if we are underfitting or overfitting the model. Underfitting means that a model is to simple and therefore unable to learn well from training data. Overfitting means that our model has an almost perfect fit on training data but is performing badly on unseen data, the model is too complex and not good at generalization.

Underfitting is best prevented by making the model more complex (update input data and change parameters) and evaluate the performance on the training/validation set. Overfitting is best prevented by using validation. Validation can be implemented by splitting the training set in a training set and a validation set or to use cross-validation. Cross-validation is powerful and means that you don’t have to waste any training data on validation. Cross-validation can be a time-consuming process and a train/validation split can be better if you have a lot of training data.

Dataset splits

A dataset is usually divided in a training dataset and a test dataset. A test dataset shall never been used in training, a test dataset is used to compare the performance of different models. A test dataset is used in competitions to declare the winner.

A common split ratio is 80/20, 80 % of the dataset is used for training and 20 % of the dataset is used for testing. The same ratio can be used if you want to split the training dataset in a train set and a validation set. A train/validation split is not nessecary if you can/want to use cross-validation.

Cross-validation means the training dataset is divided in equal sized groups (k), validation is performed on one group and training is performed on the rest of the groups (k-1). It is a iterative process that is repeated k-times, so every group will be a validation set one time.


Leave a Reply

Your email address will not be published. Required fields are marked *