weka.clusterers
Class EM

java.lang.Object
  |
  +--weka.clusterers.Clusterer
        |
        +--weka.clusterers.DistributionClusterer
              |
              +--weka.clusterers.EM
All Implemented Interfaces:
java.lang.Cloneable, OptionHandler, java.io.Serializable

public class EM
extends DistributionClusterer
implements OptionHandler

Simple EM (estimation maximisation) class.

EM assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters. EM can decide how many clusters to create by cross validation, or you may specify apriori how many clusters to generate.

Valid options are:

-V
Verbose.

-N
Specify the number of clusters to generate. If omitted, EM will use cross validation to select the number of clusters automatically.

-I
Terminate after this many iterations if EM has not converged.

-S
Specify random number seed.

-M
Set the minimum allowable standard deviation for normal density calculation.

Author:
Mark Hall (mhall@cs.waikato.ac.nz)
See Also:
Serialized Form

Constructor Summary
EM()
          Constructor.
 
Method Summary
 void buildClusterer(Instances data)
          Generates a clusterer.
 double densityForInstance(Instance inst)
          Computes the density for a given instance.
 double[] distributionForInstance(Instance inst)
          Predicts the cluster memberships for a given instance.
 boolean getDebug()
          Get debug mode
 int getMaxIterations()
          Get the maximum number of iterations
 double getMinStdDev()
          Get the minimum allowable standard deviation.
 int getNumClusters()
          Get the number of clusters
 java.lang.String[] getOptions()
          Gets the current settings of EM.
 int getSeed()
          Get the random number seed
 java.lang.String globalInfo()
          Returns a string describing this clusterer
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
static void main(java.lang.String[] argv)
          Main method for testing this class.
 java.lang.String maxIterationsTipText()
          Returns the tip text for this property
 java.lang.String minStdDevTipText()
          Returns the tip text for this property
 int numberOfClusters()
          Returns the number of clusters.
 java.lang.String numClustersTipText()
          Returns the tip text for this property
protected  void resetOptions()
          Reset to default options
 java.lang.String seedTipText()
          Returns the tip text for this property
 void setDebug(boolean v)
          Set debug mode - verbose output
 void setMaxIterations(int i)
          Set the maximum number of iterations to perform
 void setMinStdDev(double m)
          Set the minimum value for standard deviation when calculating normal density.
 void setNumClusters(int n)
          Set the number of clusters (-1 to select by CV).
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setSeed(int s)
          Set the random number seed
 java.lang.String toString()
          Outputs the generated clusters into a string.
protected  double[] weightsForInstance(Instance inst)
          Returns the weights (indicating cluster membership) for a given instance
 
Methods inherited from class weka.clusterers.DistributionClusterer
clusterInstance
 
Methods inherited from class weka.clusterers.Clusterer
forName, makeCopies
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

EM

public EM()
Constructor.
Method Detail

globalInfo

public java.lang.String globalInfo()
Returns a string describing this clusterer
Returns:
a description of the evaluator suitable for displaying in the explorer/experimenter gui

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Valid options are:

-V
Verbose.

-N
Specify the number of clusters to generate. If omitted, EM will use cross validation to select the number of clusters automatically.

-I
Terminate after this many iterations if EM has not converged.

-S
Specify random number seed.

-M
Set the minimum allowable standard deviation for normal density calculation.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options.
Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

minStdDevTipText

public java.lang.String minStdDevTipText()
Returns the tip text for this property
Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setMinStdDev

public void setMinStdDev(double m)
Set the minimum value for standard deviation when calculating normal density. Reducing this value can help prevent arithmetic overflow resulting from multiplying large densities (arising from small standard deviations) when there are many singleton or near singleton values.
Parameters:
m - minimum value for standard deviation

getMinStdDev

public double getMinStdDev()
Get the minimum allowable standard deviation.
Returns:
the minumum allowable standard deviation

seedTipText

public java.lang.String seedTipText()
Returns the tip text for this property
Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setSeed

public void setSeed(int s)
Set the random number seed
Parameters:
s - the seed

getSeed

public int getSeed()
Get the random number seed
Returns:
the seed

numClustersTipText

public java.lang.String numClustersTipText()
Returns the tip text for this property
Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setNumClusters

public void setNumClusters(int n)
                    throws java.lang.Exception
Set the number of clusters (-1 to select by CV).
Parameters:
n - the number of clusters
Throws:
java.lang.Exception - if n is 0

getNumClusters

public int getNumClusters()
Get the number of clusters
Returns:
the number of clusters.

maxIterationsTipText

public java.lang.String maxIterationsTipText()
Returns the tip text for this property
Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setMaxIterations

public void setMaxIterations(int i)
                      throws java.lang.Exception
Set the maximum number of iterations to perform
Parameters:
i - the number of iterations
Throws:
java.lang.Exception - if i is less than 1

getMaxIterations

public int getMaxIterations()
Get the maximum number of iterations
Returns:
the number of iterations

setDebug

public void setDebug(boolean v)
Set debug mode - verbose output
Parameters:
v - true for verbose output

getDebug

public boolean getDebug()
Get debug mode
Returns:
true if debug mode is set

getOptions

public java.lang.String[] getOptions()
Gets the current settings of EM.
Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions()

resetOptions

protected void resetOptions()
Reset to default options

toString

public java.lang.String toString()
Outputs the generated clusters into a string.
Overrides:
toString in class java.lang.Object

numberOfClusters

public int numberOfClusters()
                     throws java.lang.Exception
Returns the number of clusters.
Overrides:
numberOfClusters in class Clusterer
Returns:
the number of clusters generated for a training dataset.
Throws:
java.lang.Exception - if number of clusters could not be returned successfully

buildClusterer

public void buildClusterer(Instances data)
                    throws java.lang.Exception
Generates a clusterer. Has to initialize all fields of the clusterer that are not being set via options.
Overrides:
buildClusterer in class Clusterer
Parameters:
data - set of instances serving as training data
Throws:
java.lang.Exception - if the clusterer has not been generated successfully

densityForInstance

public double densityForInstance(Instance inst)
                          throws java.lang.Exception
Computes the density for a given instance.
Overrides:
densityForInstance in class DistributionClusterer
Parameters:
inst - the instance to compute the density for
Returns:
the density.
Throws:
java.lang.Exception - if the density could not be computed successfully

distributionForInstance

public double[] distributionForInstance(Instance inst)
                                 throws java.lang.Exception
Predicts the cluster memberships for a given instance.
Overrides:
distributionForInstance in class DistributionClusterer
Parameters:
data - set of test instances
instance - the instance to be assigned a cluster.
Returns:
an array containing the estimated membership probabilities of the test instance in each cluster (this should sum to at most 1)
Throws:
java.lang.Exception - if distribution could not be computed successfully

weightsForInstance

protected double[] weightsForInstance(Instance inst)
                               throws java.lang.Exception
Returns the weights (indicating cluster membership) for a given instance
Parameters:
inst - the instance to be assigned a cluster
Returns:
an array of weights
Throws:
java.lang.Exception - if weights could not be computed

main

public static void main(java.lang.String[] argv)
Main method for testing this class.
Parameters:
argv - should contain the following arguments:

-t training file [-T test file] [-N number of clusters] [-S random seed]