Decision Tree Learning

5 min readOct 27, 2020

Decision tree learning is a set of machine learning algorithms which are especially useful in data mining and machine learning. It is considered to be a supervised learning method that works based on a quantity called Entropy (Entropy is not the only method used for decision tree learning and there are other functions too). Moreover, decision tree learning is capable of analyzing both continuous and discrete data (the tree for continuous and discrete data are called regression tree and classification tree respectively).

Entropy:

In our case Entropy is the Measure of randomness in our set. Let’s see what entropy is in practice. Imagine 3 sets of black and white balls. The first set consists of 4 black balls, the second one has three black balls and one white ball, and the third one has 2 black and 2 white balls.

Now let's play a game, for each set We randomly pick a ball and we repeat that 4 times, and each time we put the selected ball back to our set (in this way our events which are being black or white will be independent). If we get the following results from our sets we win the game.

In our first set, we have four black balls, therefore the probability of having black balls in all places is 1.

In the second case, we have three black and one white ball, the probability of having a blue ball in a square is 0.75 and is 0.25 for having a white one, and therefore the probability of getting this result is 0.75 * 0.75 * 0.75 * 0.25 = 0.105.

In our last case, we have two white balls and two black balls, therefore the probability of having a black ball in a place is 0.5 and this value is the same for the event of having a white ball. So the probability of getting this result is 0.5 * 0.5 * 0.5 * 0.5 = 0.0625.

Note that the probability of getting balls with the same color is the same because we put the ball back into the set after seeing its color (the events are independent, two events a and b are called to be independent if and only if P(a|b) = p(a) and P(b|a) = p(b)).

We have calculated the probability of winning for each set using multiplication, but multiplication is not as efficient as sum, since small changes in factors can alter the product remarkably. Also if we have more balls (say 1000) it is very hard to manage. Now how can we change the product to sum? The best function is logarithm (log(x)) since we have log(ab) = log(a) + log(b) (we use logarithm base 2 which is shown as log2(x) or lg(x)). Moreover, since all of our probabilities are between 0 and 1, the logarithm of them is a negative value (0 < p(x) < 1 => log2(p(x)) < 0 ) so we use -log2(x) to get a positive value.

In order to calculate entropy we just need the average of our -log values, and we get:

So as we can see, as the probability of winning decreases, the entropy increases. Also as the entropy of each set increases the possible orientations of balls in that set increase as well.

Now let’s generate a formula for entropy. In our black and white example, with n black balls and m blue balls

And if we have k events with probabilities of p1, p2, p3, p4, …, pk we have:

Now let’s find a relation between entropy and the amount of information obtained from a set, here we classified data points of one set in three ways. Which classification method do you think gives us more information?

The first method gives us the least amount of information and the third gives us the most. If we calculate the entropy of the child nodes we figure that, as the entropy increases the amount of obtained information decreases.

The general formula for the information gained from the classification is:

Information Gain = Entropy of Parent — Weighted Sum of Children’s Entropies

Now that we know what entropy is, let’s classify data based on this concept.

In the diagram below we need to divide data to get portions with the most number of data points of the same color.

First, we need to divide data based on their x values. To do so we should find a point on the x-axis that gives us the minimum entropy.

If we calculate entropy for different x values, x = 5 gives us the minimum entropy. Now we should repeat the same process but for the y-axis (This time we need to determine two y values, one for data points with x values less than 5 and one for data points with x values greater than 5). If we do so we get y = 8 and y = 4 that gives us the minimum entropy for x < 5 and x > 5 respectively