Decision tree python code from scratch

4/11/2024

With this, you already know the two main methods that can be used in a decision tree to calculate impurity. Y: variable with which calculate entropy.Īs we see, it gives us a value very close to 1, which denotes an impurity similar to that indicated by the Gini impurity, whose value is close to 0.5. Given a Pandas Series, it calculates the entropy. Let’s see how entropy works by calculating it for the same example that we have done with the Gini index: def entropy(y): In this way, values close to zero are less impure than those that approach 1. Unlike the Gini index, whose range goes from 0 to 0.5, the entropy range is different, since it goes from 0 to 1. Entropy is defined by the following formula: Calculate impurity with entropyĮntropy it is a way of measuring impurity or randomness in data points. Now that you know how the index works, let’s see how entropy works. This indicates that the Gender variable is very impure, that is, the cutting results are not will both have equally the same proportion of incorrectly classified data. Y: variable with which calculate Gini Impurity.Īs we can see, the Gini index for the Gender variable is very close to 0.5. Given a Pandas Series, it calculates the Gini Impurity. Let’s program the function, considering the input will be a Pandas series: def gini_impurity(y): Where P i is the probability of having that class or value. This is an index that ranges from 0 (a pure cut) to 0.5 (a completely pure cut that divides the data equally). This index calculates the amount of probability that a specific characteristic will be classified incorrectly when it is randomly selected. The Gini index is the most widely used cost function in decision trees. Now, let’s see what ways exist to calculate impurity: Calculate impurity using the Gini index In short, the cost function of a decision tree seeks to find those cuts that minimize impurity. ) Misclassified when cutting at 100kg: 18 In the example above, impurity will include the percentage of people that weight >=100 kg that are not obese and the percentage of people with weight=100) & (data=0),:].shape, "\n",ĭata.loc>=80) & (data=0),:].shape Impurity refers to the fact that, when we make a cut, how likely is it that the target variable will be classified incorrectly. In the case of decision trees, there are two main cost functions: the Gini index and entropy.Īny of the cost functions we can use are based on measuring impurity. Impurity and cost functions of a decision treeĪs in all algorithms, the cost function is the basis of the algorithm. How does the algorithm decide which variable to use as the first cutoff? How do you choose the values? Let’s see it little by little programming our own decision tree from scratch in Python. Now you know the bases of this algorithm, but surely you have doubts. In fact, we will code a decision tree from scratch that can do both. Let’s see a graphic example:īesides,a decision trees can work for both regression problems and for classification problems. This last node is known as a leaf node or leaf node. This is so until we get to a node that does not split. el sitio web de Betcris PerúĪs you can see, decision trees usually have sub-trees that serve to fine-tune the prediction of the previous node. Thus, the decision tree continues to create more branches that generate new conditions to “refine” our predictions. However, that cut will not be precise: there will be people who weigh 100kg or more who are not obese. In that case, a decision tree would tell us different rules, such as that if the person’s weight is greater than 100kg, it is most likely that the person is obese. Based on the description of the dataset (available on Kaggle), people with an index of 4 or 5 are obese, so we could create a variable that reflects this: data = (data.Index >= 4).astype('int')ĭata.drop('Index', axis = 1, inplace = True) Imagine that we want to predict whether or not the person is obese. To do this, we will use the following dataset. For example, let’s say we train an algorithm that predicts whether or not a person is obese based on their height and weight. Does it sound interesting? Let’s get to it! Understanding how a decision tree worksĪ decision tree consists of creating different rules by which we make the prediction. To do this, we are going to create our own decision tree in Python from scratch. However, do you know how it works? In this post I am going to explain everything that you need about decision trees. The decision tree is one of the most widely used machine learning algorithms due to its ease of interpretation.

0 Comments

Decision tree python code from scratch

Leave a Reply.

Author

Archives

Categories