© 2020 The original authors.

Welcome to Nessus Weka. This is a collection of examples and tutorials related to the Weka data mining toolset.

Weka is tried and tested open source machine learning software that can be accessed through a graphical user interface, standard terminal applications, or a Java API. It is widely used for teaching, research, and industrial applications, contains a plethora of built-in tools for standard machine learning tasks, and additionally gives transparent access to well-known toolboxes such as scikit-learn, R, and Deeplearning4j.

In this intro, we’ll show you how to get started with simple data mining tasks and how to incorporate these into larger Apache Camel workflows.

1. Getting Weka

Weka download and install instructions are here.

When you open Weka, you’ll initially see two windows.

The Weka Chooser

Weka Chooser

and the Weka Explorer

Weka Explorer

Following on from here, we mostly explore how to do data mining tasks in Java APIs and in Apache Camel workflows.

However, it is Weka that is the source of incredibly comprehensive and powerful data mining functionality. You may find, once you are more familiar with it, that Java APIs and Camel workflows become an afterthought of what you have decided on using the Weka tools.

This isn’t so much a limitation of Nessus/Camel APIs but quite intentional. In Nessus, we aim to provide entry points that are general enough to access the full scope of Weka functionality.

Without further ado, lets do it …​

2. Data Input/Output

When you get started, it is quite likely that your data isn’t available (yet) in a format that data mining libraries can use effectively. Lets asume, you have some csv data that you first want to convert into Weka’s native .arff format.

Here is the list of file formats that Weka supports for reading and writing …​

File Formats

The data that we’ll be using here, is borrowed from R2D3’s excellent visual introduction to machine learning. We have just short of 500 instances of real estate homes with various attributes assigned to them.

Reading data is easy …​

Dataset dataset = Dataset.create("data/sfny.csv")

Converting this into an .arff file is easy too …​

dataset.write("data/sfny.arff")

In fact, because Dataset has a functional flow API, we could have done

Dataset.create("data/sfny.csv").write("data/sfny.arff")

Lets open the converted file in the Weka Explorer

sfny numeric

If this is the first time you see a dataset in Weka, I’ll quickly talk you through what we can learn from this view.

  • We have 492 data instances with 8 attributes each

  • All attributes are of type numeric

  • Attribute in_sf has two distinct values: 0, 1

We can assume that this attribute is the so called "instance class". An attribute value of '1' means a home is in San Francisco, '0' means a home in New York.

2.1. Preparing for Classification

What we have here, is a very simple form of binary classification (i.e. the home is either in San Francisco or it is not). The data does not say that yet however. Instead, it only knows about 8 attributes, each of which are numeric.

What we want to do next, is to convert the in_sf attribute into a nominal value. The respective nominal values could be "ny" and "sf", but we leave it as "0" and "1" for now.

Weka can apply filters to transform a dataset into another more suitable form.

What we want here, is to apply the NumericToNominal attribute filter like this …​

// Convert the 'in_sf' attribute to nominal
dataset.apply("NumericToNominal -R first")

and to reorder the attributes such that in_sf becomes the last attribute in the list, like this …​

// Move the 'in_sf' attribute to the end
dataset.apply("Reorder -R 2-last,1")

We could have left the class attribute in the first position and set this explicitly to be the class attribute. Instead, we moved it to the last position where Weka expects to see the class attribute by default.

Putting it all together in Nessus API, it looks like this …​

Dataset.create("data/sfny.csv")

    // Convert the 'in_sf' attribute to nominal
    .apply("NumericToNominal -R first")

    // Move the 'in_sf' attribute to the end
    .apply("Reorder -R 2-last,1")

    // Reset the relation name
    .apply("RenameRelation -modify sfny")

    // Write out the resulting dataset
    .write("data/sfny.arff");

If you haven’t done already, lets now open the resulting dataset in the Weka Explorer

sfny nominal

If you click on the "Visualize All" button, you’ll see …​

sfny visualize

Which one of those attributes is a good candidate for initial class discrimination? Have a guess …​

3. Classification

With classification, we would ultimately like to tell which class an yet unseen data instance belongs to. In our case, we would like to to predict whether a home is located in San Fancisco or New York.

Lets now build a prediction model, which we train with our knowledge from the data that we have.

3.1. ZeroR

But first, lets establish a few base lines which will later help us to assess the quality of our prediction model.

The most basic prediction algorithm is ZeroR. It stands for "no rule at all". It simply counts the number of instances per class and always predicts the class with the highest number of instances.

Evaluation eval = Dataset.create("data/sfny.arff")

    .buildClassifier("ZeroR")

    .evaluateModel(dataset)

    .getEvaluation();

Here is the output for this simple ZeroR analysis

ZeroR predicts class value: 1

=== Summary ===

Correctly Classified Instances         268               54.4715 %
Incorrectly Classified Instances       224               45.5285 %
Kappa statistic                          0
Mean absolute error                      0.496
Root mean squared error                  0.498
Relative absolute error                100      %
Root relative squared error            100      %
Total Number of Instances              492

According to ZeroR, every home ever presented is located in San Francisco and it would be correct about that in 54.47% of the cases. This would be just a little better than tossing a coin (i.e. not very reliable at all).

Why is it exactly that number? Well, if you remember our class distribution we have 268 instances in SF and 224 in NY. ZeroR calculates 268/492 = 0.5447

Lets try to build a better model …​

3.2. OneR

OneR is still a very simple classification algorithm. It stands for "one rule". Whereas ZeroR did not look at any of the attributes, OneR finds the one attribute that leads to the highest accuracy.

Evaluation eval = Dataset.create("data/sfny.arff")

    .buildClassifier("OneR")

    .evaluateModel(dataset)

    .getEvaluation();

Here is the output for the OneR analysis

elevation:
    < 4.5   -> 1
    < 12.5  -> 0
    < 13.5  -> 1
    < 30.5  -> 0
    >= 30.5 -> 1
(407/492 instances correct)

=== Summary ===

Correctly Classified Instances         407               82.7236 %
Incorrectly Classified Instances        85               17.2764 %
Kappa statistic                          0.6553
Mean absolute error                      0.1728
Root mean squared error                  0.4156
Relative absolute error                 34.8303 %
Root relative squared error             83.4643 %
Total Number of Instances              492

A prediction accuracy of 82.72% is quite good. Actually, much better than tossing a coin.

Do you rember when I asked you to guess the best descriminator attribute? Here you have it …​ it is "elevation".

But, hang on. Didn’t we just cheat ourselves? We trained the model with the full dataset and then ask for a prediction using the same dataset. The model already knew every data instance and could therefore find the ideal boundaries for every attribute.

This is called "over fitting". We built a model that works very well for the given dataset, but might be useless for data it hasn’t seen yet - it learned the data.

Lets fix that …​

3.3. Splitting the Data

Assuming that we don’t have another source of data, we need to split up the data that we do have. A good rule of thumb is to use 80% for training the model and 20% for evaluating its performance.

When you look at sfny.arff you will notice that all instances for NY come first, followed by the instances for SF. Simply using the first 80% of instances for training would not work so well. We therefore randomly reorder the data instances before we do the split.

Many of the algorithms in Weka involve a fair bit of randomization. It would however be a nightmare, if we saw different results on every run. It would also be pointless to show actual figures as we have done so far. The solution is to use an explicit randomization seed. Given the same seed the randomizer produces the same sequence of numbers - below we use -S 0

Dataset rndset = Dataset.create("data/sfny.arff")

        .apply("Randomize -S 0")

        .apply("RenameRelation -modify sfny-random")

        .write("data/sfny-random.arff");

int numTotal = rndset.getInstances().numInstances();
int firstTrainIdx = (int) Math.round(numTotal * 0.20);
int lastTestIdx = firstTrainIdx - 1;

Dataset trainset = new Dataset(rndset.getInstances())

        .apply("RemoveRange -R 1-" + lastTestIdx)

        .apply("RenameRelation -modify sfny-train")

        .write("data/sfny-80pct.arff");

Dataset testset = new Dataset(rndset.getInstances())

        .apply("RemoveRange -R " + firstTrainIdx + "-" + numTotal)

        .apply("RenameRelation -modify sfny-test")

        .write("data/sfny-20pct.arff");

Assert.assertEquals(492, rndset.getInstances().numInstances());
Assert.assertEquals(395, trainset.getInstances().numInstances());
Assert.assertEquals(97, testset.getInstances().numInstances());

Lets run OneR again …​

3.4. OneR Training/Test

Now that we have split our data in two sets, lets run OneR again …​

Dataset training = Dataset.create("data/sfny-80pct.arff");
Dataset testing = Dataset.create("data/sfny-20pct.arff");

Evaluation eval = training

    .buildClassifier("OneR")

    .evaluateModel(testing)

    .getEvaluation();

The result is different, but still much better than ZeroR

elevation:
    < 1.5   -> 1
    < 3.5   -> 0
    < 5.5   -> 1
    < 30.5  -> 0
    >= 30.5 -> 1
(325/395 instances correct)

=== Summary ===

Correctly Classified Instances          75               77.3196 %
Incorrectly Classified Instances        22               22.6804 %
Kappa statistic                          0.5473
Mean absolute error                      0.2268
Root mean squared error                  0.4762
Relative absolute error                 45.9273 %
Root relative squared error             96.2495 %
Total Number of Instances               97

Now we have a model that is still very simple, but would likely work in more than 3/4 of all cases.

3.5. Stratification

Because we used a random process to split our data there is chance that we introduced some skew. How would our model be effected if the training/test data did not have the same class distribution as the full dataset. Lets say, the training set had a significant higher percentage of SF homes than the test set. It this case, the model would likely be biased on SF homes.

There is a method that can split our data in a "supervised" way, such that the class value distribution it taken into account.

Lets try that as well …​

Dataset dataset = Dataset.create("data/sfny.arff")

        // Push the full dataset to the stack
        .push()

        .apply("StratifiedRemoveFolds -N 5")

        .apply("RenameRelation -modify sfny-test")

        .write("data/sfny-20pct-strat.arff")

        .pushTestSet()

        // Pop the full dataset from the stack
        .pop()

        .apply("StratifiedRemoveFolds -N 5 -V")

        .apply("RenameRelation -modify sfny-train")

        .write("data/sfny-80pct-strat.arff")

        .pushTrainingSet();

Above, we use the concept of named dataset slots from the Dataset API. It simply means that a Dataset can maintain a theoretically unlimmited number of named Weka Instances. And because the split into "training/testing" is so common, we have explicit methods to push/pop those.

Running OneR using a stratified data split, gives us …​

elevation:
    < 4.5   -> 1
    < 12.5  -> 0
    < 14.5  -> 1
    < 25.5  -> 0
    >= 25.5 -> 1
(323/393 instances correct)

=== Summary ===

Correctly Classified Instances          81               81.8182 %
Incorrectly Classified Instances        18               18.1818 %
Kappa statistic                          0.6333
Mean absolute error                      0.1818
Root mean squared error                  0.4264
Relative absolute error                 36.6589 %
Root relative squared error             85.6347 %
Total Number of Instances               99

I guess an almost 5% improvement is significant. Do we already trust this model?

3.6. Cross-Validation

You might think OneR is quite boring and there is only so much improvement you can do using this algorithm. Well yes, you might be right about this, but we are not quite there yet …​

The stratified split above divides the data into five "folds" it reserves one fold (i.e. 20%) for testing and uses the other four folds for training the model. We could also have used 10 folds and we could have used a different fold (i.e. not just the first one) as our test set. We could also have done the whole process over and over again using a different random seeds every time. At the end, we could have aggregated the results and produce a model that works best for all of those iterations. Only then we would go to the pub with high confidence in our model.

Lets finally do that and see what it gives us …​

Evaluation eval = Dataset.create("data/sfny.arff")

    .buildClassifier("OneR")

    .crossValidateModel(10, 1)

    .getEvaluation();

As you can see, this is really not a lot of code. All data splitting, stratification and re-building the model several time is done under the hood. This is also the default method that Weka uses when you open a dataset and run any classifier with default options.

Finally, this is what we get …​

elevation:
    < 4.5   -> 1
    < 12.5  -> 0
    < 13.5  -> 1
    < 30.5  -> 0
    >= 30.5 -> 1
(407/492 instances correct)

=== Summary ===

Correctly Classified Instances         379               77.0325 %
Incorrectly Classified Instances       113               22.9675 %
Kappa statistic                          0.5401
Mean absolute error                      0.2297
Root mean squared error                  0.4792
Relative absolute error                 46.3022 %
Root relative squared error             96.2313 %
Total Number of Instances              492

Interestingly enough, the model configuration is quite similar to our own stratified split and the result quite similar to our own percentage split. I guess, we’ve just been lucky in the way we split the data. Anyhow, I’d say this is OneR with a good level of confidence.

How about, building a model that works on multiple attributes …​

3.7. Decision Tree

Lets meet J48, "a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date" (Ian H. Witten, et al.)

Evaluation eval = Dataset.create("data/sfny.arff")

    .buildClassifier("J48")

    .crossValidateModel(10, 1)

    .getEvaluation();

With the default 10-fold cross-validation method it produces a model significantly more complex than that from OneR. It also performs significantly better than OneR.

J48 pruned tree

elevation <= 32
|   price_per_sqft <= 1072
|   |   year_built <= 1972
|   |   |   beds <= 1
|   |   |   |   sqft <= 756: 0 (28.0)
|   |   |   |   sqft > 756
|   |   |   |   |   sqft <= 784: 1 (2.0)
|   |   |   |   |   sqft > 784
|   |   |   |   |   |   sqft <= 1063: 0 (5.0)
|   |   |   |   |   |   sqft > 1063
|   |   |   |   |   |   |   price_per_sqft <= 750: 1 (2.0)
|   |   |   |   |   |   |   price_per_sqft > 750: 0 (2.0)
|   |   |   beds > 1
|   |   |   |   price_per_sqft <= 829
|   |   |   |   |   elevation <= 10
|   |   |   |   |   |   year_built <= 1924: 0 (4.0)
|   |   |   |   |   |   year_built > 1924: 1 (2.0)
|   |   |   |   |   elevation > 10: 1 (13.0)
|   |   |   |   price_per_sqft > 829
|   |   |   |   |   price_per_sqft <= 1002: 0 (12.0)
|   |   |   |   |   price_per_sqft > 1002: 1 (3.0/1.0)
|   |   year_built > 1972: 1 (46.0/3.0)
|   price_per_sqft > 1072
|   |   elevation <= 4
|   |   |   bath <= 2.5
|   |   |   |   year_built <= 2005: 0 (6.0/1.0)
|   |   |   |   year_built > 2005: 1 (7.0/1.0)
|   |   |   bath > 2.5: 0 (10.0/2.0)
|   |   elevation > 4
|   |   |   price_per_sqft <= 1379
|   |   |   |   year_built <= 2008
|   |   |   |   |   beds <= 3: 0 (42.0/4.0)
|   |   |   |   |   beds > 3: 1 (3.0/1.0)
|   |   |   |   year_built > 2008: 1 (6.0)
|   |   |   price_per_sqft > 1379: 0 (110.0/2.0)
elevation > 32
|   price <= 569000
|   |   year_built <= 1916: 1 (5.0)
|   |   year_built > 1916
|   |   |   year_built <= 1948: 0 (4.0)
|   |   |   year_built > 1948: 1 (5.0/1.0)
|   price > 569000: 1 (175.0/3.0)

Number of Leaves  :     22

Size of the tree :  43

=== Summary ===

Correctly Classified Instances         420               85.3659 %
Incorrectly Classified Instances        72               14.6341 %
Kappa statistic                          0.7069
Mean absolute error                      0.1727
Root mean squared error                  0.3601
Relative absolute error                 34.8079 %
Root relative squared error             72.3008 %
Total Number of Instances              492

An accuracy of 85.37% with an high level of confidence in the model, is quite good I’d say.

When you right-click on the classification result, you can see the tree model visualized. Please note, that J48 also chooses "elevation" as the initial discriminator. Each split is then performed such that it yields to the maximum of information gain.

J48 Tree

4. Prediction

Ultimately, we would like to predict some unseen data. Quite likely also, that this data will only become available in the future.

We have seen how to build a classifier using cross-validation. We found that J48 works well for the data that we have. To simulate unseen data, we have used a 20% (i.e. 1/5 folds) stratified split.

4.1. Persisting a Model

Lets now rebuild that model and persit it for later use with unseen data

ModelPersister persister = new ModelPersister("data/sfny-j48.model");

Dataset.create("data/sfny-train.arff")

        .buildClassifier("J48")

        .crossValidateModel(10, 1)

        .consumeClassifier(persister);

We can also do this in memory, with the ModelEncoder respectively.

ModelEncoder encoder = new ModelEncoder();

Dataset.create("data/sfny-train.arff")

        .buildClassifier("J48")

        .crossValidateModel(10, 1)

        .consumeClassifier(encoder);

String encoded = encoder.getEncodedModel();

4.2. Restoring a Model

Restoring the model for use with some unseen data is equally simple …​

Dataset.create("data/sfny-test.arff")

        .loadClassifier(new ModelLoader("data/sfny-j48.model"))

        .consumeClassifier(cl -> logInfo("{}", cl))

or, alternatively …​

Dataset.create("data/sfny-test.arff")

        .loadClassifier(new ModelDecoder(encoded))

        .consumeClassifier(cl -> logInfo("{}", cl))

4.3. Predicting a Class

You might have noticed, that above we loaded sfny-test (i.e. 20% of the stratified split) that we’ve set aside for testing. Worth noting also, that we trained the model with sfny-train (i.e. the 80% of the stratified split). The model has therefore not yet seen sfny-test instances.

Our test data still contains the in_sf attribute that holds the expected class values for SF or NY.

Lets get rid of that attribute and replace it with another attribute that we call predicted. Then we pass on this dataset to the NominalPredictor, which will use the loaded classifier to fill in the prediced values. Finally, lets save the results in a new data file.

Dataset.create("data/sfny-test.arff")

        .loadClassifier(new ModelDecoder(encoded))

        .apply("Remove -R last")

        .apply("Add -N predicted -T NOM -L 0,1")

        .apply("RenameRelation -modify sfny-predicted")

        .applyToInstances(new NominalPredictor())

        .write("data/sfny-predicted.arff")

When you look at the diff between sfny-test and sfny-predicted you will find, that most instances have identical values and only a few have not. Infact, it is those 11 instances that Weka tells us it would not correcly classify when we train J48 with sfny-train and validate against the supplied test set sfny-test.

J48 Diff
=== Summary ===

Correctly Classified Instances          88               88.8889 %
Incorrectly Classified Instances        11               11.1111 %
Kappa statistic                          0.7747
Mean absolute error                      0.1445
Root mean squared error                  0.3134
Relative absolute error                 29.1259 %
Root relative squared error             62.9411 %
Total Number of Instances               99

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.844    0.074    0.905      0.844    0.874      0.776    0.942     0.923     0
                 0.926    0.156    0.877      0.926    0.901      0.776    0.942     0.924     1
Weighted Avg.    0.889    0.119    0.890      0.889    0.888      0.776    0.942     0.924

=== Confusion Matrix ===

  a  b   <-- classified as
 38  7 |  a = 0
  4 50 |  b = 1

We have now seen how to prepare new data for classification with an already existing model.

I’d say an accuracy of 88.89% is not bad at all. Congratulations J48!

5. Nessus API Concepts

The Nessus-Weka API is here to compliment rather than duplicate or reinvent functionality that is already present in Weka APIs.

The Dataset, which is the main entry point for everything, is a contextual mutable, non-thread-safe entity. It is the holder of Instances, Classifier, Evaluation.

Dataset holds multiple Instances one of which is the current, the others can be associated with a name and stashed for later use. Pushing the current Instances will put them on the stack. Popping Instances will remove them from the stack and make them current.

A Classifier will exist once loaded or built. An Evaluation will exist once the Classifier has been evaluated.

Appart from getting Instances, Classifier or Evaluation, you’ll find methods of this type …​

// Do something with Foo and return a Foo
Dataset applyToFoo(UnaryOperator<Foo> operator);

// Do something with with the Datasset and return a Foo
Dataset applyToFoo(Function<Dataset, Foo> function);

// Do something with Foo and return nothing
Dataset consumeFoo(Consumer<Foo> consumer);

These are very general extension points to the API, which allow you to get at the Weka primitives, do something with them (using the Weka APIs) and then return the primitive back to the Dataset such that the next step can see the modified result from the previous step.

So for example, if you like to log the state of the current Classifier, you could do …​

Evaluation eval = Dataset.create("data/sfny.arff")

        .buildClassifier("J48")

        .consumeClassifier(cl -> logInfo("{}", cl))

The ModelPersister is another example of a Consumer<Classifier>. This time, defined externally rather than inline.

Dataset dataset = Dataset.create("data/sfny-train.arff")

        .buildClassifier("J48")

        .crossValidateModel(10, 1)

        .consumeClassifier(new ModelPersister("data/sfny-j48.model"))

The NominalPredictor is a good example of a Function<Dataset, Instances>. It accepts the Dataset so that it can get access to the Classifier and the Instances then does some number crunching and return new Instances that can then be passed on to the next step.

Dataset dataset = Dataset.create("data/sfny-test.arff")

        .applyToInstances(new NominalPredictor())

You get the idea, it is all about passing in functionality that is defined elsewhere. The Dataset has only very few "convenients" methods that are so common that we prefer to use a simplified syntax rather than the functional approach.

These would for example be …​

Dataset dataset = Dataset.create("data/sfny-test.arff")

        .apply("StratifiedRemoveFolds -N 5")

        .buildClassifier("J48")

Under the hood, these invoke functions that return Instances and a Classifier respectively.

One last example perhaps. Instead of invoking the rename filter like this …​

Dataset dataset = Dataset.create("data/sfny-test.arff")

    .apply("RenameRelation -modify sfny-test")

we could equally well have done this in Java like this …​

Dataset dataset = Dataset.create("data/sfny-test.arff")

        .applyToInstances((Instances ins) -> {
            ins.setRelationName("sfny-test");
            return ins;
         })

6. Camel-Weka

Camel-Weka moves this functionality one level up and provides Weka Data Minig as part of a Camel workflow. There are hundreds of components available in Camel. Data can be obtained from a wide array of data sources, then be passed on to the Camel-Weka component for analysis and desicion making with the results yet again being passed on to other components for further processing.

6.1. Consuming from file

This first example shows how to read a CSV file with the file component and then pass it on to Weka. In Weka we apply a few filters to the data set and then pass it on to the file component for writing.

CamelContext camelctx = new DefaultCamelContext();
camelctx.addRoutes(new RouteBuilder() {

    @Override
    public void configure() throws Exception {

        // Use the file component to read the CSV file
        from("file:src/test/resources/data?fileName=sfny.csv&noop=true")

        // Convert the 'in_sf' attribute to nominal
        .to("weka:filter?apply=NumericToNominal -R first")

        // Move the 'in_sf' attribute to the end
        .to("weka:filter?apply=Reorder -R 2-last,1")

        // Rename the relation
        .to("weka:filter?apply=RenameRelation -modify sfny")

        // Use the file component to write the Arff file
        .to("file:target/data?fileName=sfny.arff")
    }
});
camelctx.start();

6.2. Read + Filter + Write

Here we do the same as above without use of the file component.

CamelContext camelctx = new DefaultCamelContext();
camelctx.addRoutes(new RouteBuilder() {

    @Override
    public void configure() throws Exception {

        // Initiate the route from somewhere
        .from("...")

        // Use weka to read the CSV file
        .to("weka:read?path=src/test/resources/data/sfny.csv")

        // Convert the 'in_sf' attribute to nominal
        .to("weka:filter?apply=NumericToNominal -R first")

        // Move the 'in_sf' attribute to the end
        .to("weka:filter?apply=Reorder -R 2-last,1")

        // Rename the relation
        .to("weka:filter?apply=RenameRelation -modify sfny")

        // Use Weka to write the Arff file
        .to("weka:write?path=target/data/sfny.arff");
    }
});
camelctx.start();

In this example, would the client provide the input path or some other supported type. Have a look at the WekaTypeConverters for the set of supported input types.

CamelContext camelctx = new DefaultCamelContext();
camelctx.addRoutes(new RouteBuilder() {

    @Override
    public void configure() throws Exception {

        // Initiate the route from somewhere
        .from("...")

        // Convert the 'in_sf' attribute to nominal
        .to("weka:filter?apply=NumericToNominal -R first")

        // Move the 'in_sf' attribute to the end
        .to("weka:filter?apply=Reorder -R 2-last,1")

        // Rename the relation
        .to("weka:filter?apply=RenameRelation -modify sfny")

        // Use Weka to write the Arff file
        .to("weka:write?path=target/data/sfny.arff");
    }
});
camelctx.start();

6.3. Building a Model

When building a model, we first choose the classification algorithm to use and then train it with some data. The result is the trained model that we can later use to classify unseen data.

Here we train J48 with 10 fold cross-validation …​

CamelContext camelctx = new DefaultCamelContext());
camelctx.addRoutes(new RouteBuilder() {

    @Override
    public void configure() throws Exception {

        // Use the file component to read the training data
        from("file:src/test/resources/data?fileName=sfny-train.arff")

        // Build a J48 classifier using cross-validation with 10 folds
        .to("weka:model?build=J48&xval=true&folds=10&seed=1")
    }
});
camelctx.start();

Instead of doing cross-validation, we can also train the model with a set of named instances …​

CamelContext camelctx = new DefaultCamelContext());
camelctx.addRoutes(new RouteBuilder() {

    @Override
    public void configure() throws Exception {

        // Use the file component to read the training data
        from("file:src/test/resources/data?fileName=sfny-train.arff")

        // Push the current instances to the stack
        .to("weka:push?dsname=sfny-train")

        // Build a J48 classifier with a set of named instances
        .to("weka:model?build=J48&dsname=sfny-train")
    }
});
camelctx.start();

Or perhaps even with the current set of instances …​

CamelContext camelctx = new DefaultCamelContext());
camelctx.addRoutes(new RouteBuilder() {

    @Override
    public void configure() throws Exception {

        // Use the file component to read the training data
        from("file:src/test/resources/data?fileName=sfny-train.arff")

        // Build a J48 classifier with a set of named instances
        .to("weka:model?build=J48")
    }
});
camelctx.start();

6.4. Persisting a Model

When we build a model, it becomes available in the Dataset context. Building a good model with lots of training data may become a lengthy process that we don’t wish to do that over again every time we have some data to analyse.

Persisting a trained Classifier is easy …​

CamelContext camelctx = new DefaultCamelContext());
camelctx.addRoutes(new RouteBuilder() {

    @Override
    public void configure() throws Exception {

        // Initiate the route from somewhere
        .from("...")

        // Build a J48 classifier with a set of named instances
        .to("weka:model?build=J48")

        // Persist the J48 model
        .to("weka:model?saveTo=src/test/resources/data/sfny-j48.model")
    }
});
camelctx.start();

6.5. Restoring a Model

Instead of building a model, we can also load an existing model that we have built before …​

CamelContext camelctx = new DefaultCamelContext());
camelctx.addRoutes(new RouteBuilder() {

    @Override
    public void configure() throws Exception {

        // Initiate the route from somewhere
        .from("...")

        // Load an already existing model
        .to("weka:model?loadFrom=src/test/resources/data/sfny-j48.model")
    }
});
camelctx.start();

6.6. Predicting a Class

Similar to what has been shown above we can now start to predict unseen data …​

Please note, how we use a Camel Processor to access functionality that is not directly available from endpoint URIs.

In case you come here directly and this syntax looks a bit overwhelming, you might want to have a brief look at the section about Nessus API Concepts.

CamelContext camelctx = new DefaultCamelContext());
camelctx.addRoutes(new RouteBuilder() {

    @Override
    public void configure() throws Exception {

        // Use the file component to read the test data
        from("file:src/test/resources/data?fileName=sfny-test.arff")

        // Remove the class attribute
        .to("weka:filter?apply=Remove -R last")

        // Add the 'prediction' placeholder attribute
        .to("weka:filter?apply=Add -N predicted -T NOM -L 0,1")

        // Rename the relation
        .to("weka:filter?apply=RenameRelation -modify sfny-predicted")

        // Load an already existing model
        .to("weka:model?loadFrom=src/test/resources/data/sfny-j48.model")

        // Use a processor to do the prediction
        .process(new Processor() {
            public void process(Exchange exchange) throws Exception {
                Dataset dataset = exchange.getMessage().getBody(Dataset.class);
                dataset.applyToInstances(new NominalPredictor());
            }
        })

        // Write the data file
        .to("weka:write?path=src/test/resources/data/sfny-predicted.arff")
    }
});
camelctx.start();

7. Feedback

For feedback, suggestions, request, etc. please go to …​

8. Resources