4

The choice of algorithm must be geared towards the type of application that is under consideration. No longer is it enough to show that accuracy of one algorithm is better than another. It is vital to show the overall performance of an algorithm on many vital criteria like accuracy, memory efficiency and retrieval times. Many system designers and researchers are only concerned with obtaining highly precise algorithms via sophisticated modeling and analysis. This may actually be a hindrance and a bottleneck in many deployable applications where retrieval time requires a great deal of computation.

One algorithm outperforming another on a dataset may be somewhat irrelevant to its use in another application where the data and user demands are incomparable.

For simplicity in order to give a general framework of when to choose a given algorithm, the set of algorithms are divided into four main groups:

1) Symbolic/logic learning methods involve the divide-and-conquer paradigm that generates decision trees, e.g. (ID3, C4.5), or the covering paradigm that induces a set of decision rules.

2) Case-based methods extract representative examples from the dataset to approximate the knowledge hidden in an information repository. These techniques include popular algorithms like k-Nearest Neighbor methods and case-based reasoning.

3) Statistical methods encompass various techniques including discriminant-function (parametric) methods, non-parametric(distribution-parameters-estimate) methods, linear and nonlinear regression, Bayesian methods, cluster methods, etc.

4) Biological methods encompass Neural Nets and Genetic Algorithms. Neural nets are applied to numerical data, and include various topologies and a wide variety of learning techniques. Genetic algorithms emulate the biological evolution, and are utilized for optimization processes, similarly to simulated annealing.

It may seem obvious that each one has its own forte and appropriate use. For example, statistical and case algorithms can obtain very high prediction accuracy but suffer in retrieval times. Rule-based systems and decision trees are very efficient in retrieval but are not very good in prediction accuracy. Biological algorithms are good in time efficiency and marginal in prediction.

Strengths and Weakness of Symbolic Learning Methods

The strengths of decision tree methods are:

· Decision trees are able to generate understandable rules.

· Decision trees perform classification without requiring much computation.

· Decision trees are able to handle both continuous and categorical variables.

· Decision trees provide a clear indication of which fields are most important

for prediction or classification.

Ability to Generate Understandable Rules. The ability of decision trees to generate rules that can be translated into comprehensible English or SQL is the greatest strength of this technique. Even when a complex domain or a domain that does decompose easily into categories causes the decision tree to be large and complex, it is generally fairly easy to follow any one path through the tree. So the explanation for any particular classification or prediction is relatively straightforward.

Ability to Perform in Rule-Oriented Domains . It may sound obvious, but rule induction in general, and decision trees in particular, are an excellent choice in domains where rules are to be found. Many domains, ranging from genetics to text retrieval classification have underlying rules, though these may be quite complex and obscured by noisy data. Decision trees are a natural choice when you suspect the existence of underlying rules.

Ease of Calculation at Classification Time. Algorithms used to produce decision trees generally can yield trees with a low branching factor and simple tests at each node. Typical tests include numeric comparisons, set membership, and simple conjunctions. When implemented on a computer, these tests translate into simple Boolean and integer operations that are fast and inexpensive. This is an important point because in a commercial environment, a predictive model is likely to be used to classify many millions or even billions of records.

Ability to Handle Both Continuous and Categorical Variables. Decision-tree methods are equally adept at handling continuous and categorical variables. Categorical variables, which pose problems for neural networks, come ready-made with their own splitting criteria: one branch for each category. Continuous variables are equally easy to split by picking a number somewhere in their range of values.

Ability to Clearly Indicate Best Fields. Decision-tree building algorithms put the field that does the best job of splitting the training records at the root node of the tree.

Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous variable such as income, blood pressure, or interest rate. Decision trees are also problematic for time-series data unless a lot of effort is put into presenting the data in such a way that trends and sequential patterns are made visible.

Error-Prone with Too Many Classes. Some decision-tree algorithms can only deal with binary-valued target classes (yes/no, accept/reject). Others are able to assign records to an arbitrary number of classes, but are error-prone when the number of training examples per class gets small. This can happen rather quickly in a tree with many levels and/or many branches per node.

Computationally Expensive to Train. The process of growing a decision tree is computationally expensive. At each node, each candidate splitting field must be sorted before its best split can be found. In some algorithms, combinations of fields are used and a search must be made for optimal combining weights. Pruning algorithms can also be expensive since many candidate subtrees must be formed and compared.

Strengths and Weakness of Case Based Methods

The main strengths of Case Based methods are:

· Automatic cluster detection works well with categorical, numeric, and textual data

· Easy to apply

Clustering Can Be Performed on Diverse Data Types. By choosing different distance measures, automatic clustering can be a plied to almost any kind of data. It is as easy to find clusters in collections of new stories or insurance claims as in astronomical or financial data.

Automatic Cluster Detection Is Easy to Apply. Most cluster detection techniques require very little massaging of the input data and there is no need to identify particular fields as inputs and others as outputs.

The main weaknesses of Case Based methods are:

· It can be difficult to choose the right distance measures and weights

· Sensitivity to initial parameters

· It can be hard to interpret the resulting clusters

· Retrieval Time is expensive

Difficulty with Weights and Measures. The performance of automatic cluster detection algorithms is highly dependent on the choice of a distance metric or other similarity measure. It is sometimes quite difficult to devise distance metrics for data that contains a mixture of variable types. It can also be difficult to determine a proper weighting scheme for disparate variable types.

Sensitivity to Initial Parameters . In the K-means method, the original choice of a value for K determines the number of clusters that will be found. If this number does not match the natural structure of the data, the technique will not obtain good results.

Difficulty Interpreting Results. A strength of case based reasoning is that it is an unsupervised knowledge discovery technique. The flip side is that when you don’t know what you are looking for, you may not recognize it when you find it! The clusters you discover are not guaranteed to have any practical value.

Computationally Expensive to Retrieve. The process of retrieving classification from a k-Nearest Neighbor algorithm is computationally expensive.

Strengths and Weakness of Biological Methods

The main strengths of Neural Networks are:

· Versatility

· Can Produce Good Results in Complicated Domains

· Handles Continuous Data Types

Neural Networks Are Versatile. Neural networks provide a very general way of approaching problems. When the output of the network is continuous, such as the appraised value of a home, then it is performing prediction. When the output has discrete values, then it is doing classification. A simple rearrangement of the neurons and the network becomes adept at detecting clusters. The fact that neural networks are so versatile definitely accounts for their popularity. The effort needed to learn how to use them and to learn how to massage data is not wasted, since the knowledge can be applied wherever neural networks would be appropriate.

Neural Networks Can Produce Good Results in Complicated Domains. Neural networks produce good results. Across a large number of industries and a large number of applications, neural networks have proven themselves over and over again. These results come in complicated domains, such as analyzing time series and detecting fraud, that are not easily amenable to other techniques. The largest neural network in production use is probably the system that AT&T uses for reading numbers on checks. This neural network has hundreds of thousands of units organized into seven layers.

As compared to standard statistics or to decision-tree approaches, neural networks are much more powerful. They incorporate non-linear combinations of features into their results, not limiting themselves to rectangular regions of the solution space. They are able to take advantage of all the possible combinations of features to arrive at the best solution.

Neural Networks Can Handle Continuous Data Types. Although the data has to be massaged, neural networks have proven themselves mostly useful in continuous data, both for inputs and outputs. Categorical data can be handled in two different ways, either by using a single unit with each category given a subset of the range from 0 to 1 or by using a separate unit for each category. Continuous data is easily mapped into the necessary range. It should be noted that neural networks are much better within a continuous domain such as modeling water flow or air flows.

The main weaknesses of Neural Networks are:

· Inputs and Outputs Must Be Massaged

· Cannot Explain Results

· May Converge on an Inferior Solution

All Inputs and Outputs Must Be Massaged to [0.1]. The inputs to a neural network must be massaged to be in a particular range, usually between 0 and 1. This requires additional transforms and manipulations of the input data that require additional time, CPU power, and disk space. The requirement to massage the data is actually a mixed blessing. It requires analyzing the training set to verify the data values and their ranges. Since data quality is the number one issue in data mining and machine learning, this additional perusal of the data can actually forestall problems later in the analysis.

Neural Networks Cannot Explain Results. This is the biggest criticism directed at neural networks. In domains where explaining rules may be critical, neural networks are not the tool of choice. They are the tool of choice when acting on the results is more important than understanding them. Even though neural networks cannot produce explicit rules, sensitivity analysis does enable them to explain which inputs are more important than others. This analysis can be performed inside the network, by using the errors generated, or it can be performed externally by poking the network with specific inputs.

Neural Networks May Converge on an Inferior Solution. Neural networks usually converge on some solution for any given training set. Unfortunately, there is no guarantee that this solution provides the best model of the data.

Strengths and Weakness of Statistical Methods

The main strengths of statistical methods are:

· Statistical methods works well with categorical, numeric, and textual data

· Easy to apply

Statistics Can Be Performed on Diverse Data Types. By choosing different distance measures, methods can be a plied to almost any kind of data. It is possible to find probabilistic distributions in collections of news stories or insurance claims just as in financial data.

Statistical methods are Easy to Apply. Most statistical techniques require very little massaging of the input data and there is no need to identify particular fields as inputs and others as outputs.

The main weaknesses of statistical methods are:

· Wrong assumptions will result in bad performance

· It can be hard to interpret the resulting clusters

· Retrieval Time is expensive

Difficulty with Assumptions. The domain assumptions play a major part in the use of statistical methods and even which one is appropriate. For example, Naïve Bayes assumes there is little correlation among the attributes of a dataset.

Difficulty Interpreting Results. A strength of a statistical method is that it is an unsupervised knowledge discovery technique. The flip side is that when you don’t know what you are looking for, you may not recognize it when you find it.

Computationally Expensive to Retrieve. The process of retrieving query results is computationally expensive. However, given probabilistic assumptions, it is possible that simpler methods would be appropriate.