Parallel Algorithms for Decision Trees Induction
Nuno Reis Amado
Mestrado em Inteligência Artificial e Computação
Universidade do Porto
January 2002
Abstract
In the field of Data Mining and Machine Learning the amount of data
available for building classification procedures is growing very
fast. For the construction of classification procedures, the use of
larger amounts of data increases the probability of achieving
procedures with higher classification accuracy. However, it also
implies higher costs in time and complexity for building those
procedures. Therefore, there is a great need for algorithms capable of
building classifiers from larger datasets and, simultaneously, being
computationally efficient and scalable. One possible solution for
reducing the amount of time used in the construction of classification
procedures from very large datasets, and, by this, keeping their
classification accuracy, is the use of parallelism.
In this work we only considered those classification procedures based
upon decision trees since they have been extensively used in many
fields and considered as the most suitable models for data mining and
machine learning. We present a few strategies used in the
implementation of parallel decision tree induction algorithms based on
techniques such as task parallelism, data parallelism and hybrid
parallelism.
Based upon the C4.5 decision tree induction algorithm and on some of
the introduced strategies and techniques, we developed a new parallel
decision tree induction algorithm, implemented it using MPI and
analysed its performance on a distributed memory cluster.