-
Notifications
You must be signed in to change notification settings - Fork 16
Coding Hints
These are not step-by-step instructions. This document is intended to give those with some background in machine learning an idea of where to find the important classes and an introduction to their use. This document should be consider a very high level map pointing to which part of the code to browse.
If you're using MinorThird you're probably doing one (or a mix of several) of:
- Creating new learning algorithms
- Searching for better features for learning on text
- Using existing learners & feature extractors to classify documents or extract information
If you're doing something else let us know! In all of the above possibilities you'll be going through the following set of choices:
- Pick a task - annotation, classification, etc.
- Choose or develop a learner
- Choose or develop a feature extractor if working with text
- Pick some data
- Decide on a testing method - cross-validation or a fixed test set
- Evaluate or use the results
If your task doesn't involve text (or if you've already converted the input to features) then you'll only need the classify package. Learners are in classify.algorithms
or classify.sequence
. Data needs to be in a Dataset
object, implemented by BasicDataset
. You can add one Example
at a time, or load from a file using DatasetLoader
. To run an experiment, call one of the static methods in classify.experiments.Tester
. If you want to use an already trained classifier as part of a larger system (i.e., not experimenting), you'll want to have that classifier serialized and call its classification method directly; more on that later.
If your task involves text then data will need to be in a TextBase
/TextLabels
pair. You'll also use code from text.learn.experiments
to run any experiments; classification of a text document uses the text.learn.experiments.ClassifyExperiment
code and learners from the classify package; annotation / extraction uses the TextLabelsExperiment
code.
Teacher - The ClassifierTeacher
and AnnotatorTeacher
provide a protocol for training learners. They provide a mechanism for batch learning, online learning, and active learning.
ClassifierLearner - The interface which all classification learners must implement. It accepts Example
instances and generates a Classifier
, plus other methods for active learning and constraints on Example
. There are a few abstract implementation classes of note: BatchClassifierLearner
for standard training and batch learning, OnlineClassifierLearner
for online learning algorithms, and binary extensions of each.
AnnotatorLearner - The AnnotatorLearner
interface and abstract implementations are analogous, but work with spans for documents return an Annotator
and can be queried for the SpanFeatureExtractor
used.
SpanFeatureExtractor - This is an interface for extracting an Instance
from a Span
of text.
SpanFE - An abstract class implementing SpanFeatureExtractor
which provides a number of methods for a declarative style of extracting features / instances from text. Subclasses will call a number of methods such as from
(get tokens from a span), lc
(lower case those tokens), etc. There are examples all over the code.
SampleFE - SampleFE
implements a few typical feature extractors as singleton instances. If you're starting out, use an extractor from here, or use it as sample code for producing your own.
Mixup - Mixup is a regular expression-like language for labeling text. Mixup can be used as a non-learning annotator in MinorThird. It can also be used to label text with tags which will be used as features (such as part-of-speech). You can say that regular expressions work at on a character level while Mixup works on a token level. See Mixup Language for further details.
DatasetLoader - Loads and saves datasets. Three formats are supported:
- MinorThird style:
type subpopulationid label feature1=weight1 feature2=weight2 ...
- SVM-Light style:
label feature1:weight1 feature2:weight2 ...
- Sequence datasets; each
Example
in a sequence is saved on a separate line in the first style with an asterisk (*
) alone on a line separating the sequences.
TextBaseLoader - The TextBaseLoader
is configured via the constructor based on your data format. Calling load(File)
will return a TextBase
object. There are also two static methods for loading our most common data formats:
- Load from a single file counting each line as one document
- Load from a directory where each file is a document and there are XML labels embedded in the text
The code has extensive Javadoc on all the format options supported and how to configure for them.
TextLabelsLoader - Loads a labeling from a file. There is one parameter to set the closure policy on types. Labels can be loaded from a serialized object or from a list of type operations. The class also allows for saving the types in an XML format.
SimpleTextLoader - Provides two methods for retrieving TextLabels
(which also points to a TextBase
) for standard data formats.
Splitter - The Splitter
interface provides an Iterator
over training and test examples. It is used by all the experimentation infrastructure to control the data available.
CrossValidationSplitter - Handles a cross validation experiment. It can be set by the number of folds or by the percentage of data to use for training.
FixedTestSplitter - All the experiment code requires a Splitter
implementation. For classification a fixed test set can be used by wrapping it with a FixedTestSplitter
. The FixedTestSplitter
requires an iterator over examples.
classify.*.Tester - Tester
provides a few static methods for running a classification or sequence classification experiment. The evaluate methods all return an Evaluation
object.
text.*.ClassifyExperiment - The main method of ClassifyExperiment
takes several parameters (see usage()
) and runs an experiment, displaying the results in a GUI. There are several constructors for configuring an experiment which can the be run with the evaluate()
method.
text.*.TextLabelsExperiment - Similar to classification, TextLabelsExperiment
has a main method which takes parameters to do an annotation experiment. The guts are in doExperiment()
.
classify.*.Evaluation - The Evaluation
class stores and calculates several metrics for an experiment. Through the toGUI()
method it can produce a dialog with several panels to display the results of that experiment.
Here I attempt to take an aerial photo of the code to give you a map of where things are.
First, MinorThird has three main modules: Classify, Text, and Util. The separation of modules is based on "allowed to use" principles. Util is used by any of the others, but cannot use any other module. Classify uses Util, but not Text. Text uses both Classify and Util.
A small number of classes providing basic functions (math, string helpers, and enhanced iterators).
The classify
package provides all the classes needed do classification experiments. Representation of datasets and methods for loading/saving them are provided. Representation of instances, examples, and features are also present. Interfaces for teachers and learners are placed here.
Sub-packages include a set of algorithms implementing the learning interface. Sequence learner interfaces and infrastructure are under the sequential package. Finally, the experiments package contains infrastructure for evaluation of results and splitter implementations.
The text
package provides code needed to handle machine learning on text. TextBase
(containing the underlying text of the data) and TextLabels
(the labeling or annotations on that text) interfaces are here, along with a base implementation for both. There are classes for text tokens and spans (a set of contiguous tokens, up to a full document). The TextBaseLoader
can be configured to load data into a TextBase
. The Annotator
interface defines the methods provided by any class doing extraction from text.
The mixup
package contains the parser for the Mixup language. The Mixup
and MixupProgram
classes contain documentation on creating Mixup code, including an example, BNF, and brief discussion of semantics.
The gui
package holds a number of classes used for displaying spans, text bases, editing labels, experiment results, etc.
The learn
package holds several algorithms for learned annotators. The SampleLearners
class contains a few singleton leaner instances. The abstract SpanFE
class provides facilities for a declarative style of extracting important features. The learn package has a sub-package experiments, which is analogous to classify.learn.experiments
. It holds classes for running a labeling experiment, a sequential labeling experiment, and for running classification on a full document.
To make a class serializable, I recommend using the following template:
import java.io.Serializable;
public class MyClass implements Serializable {
private static final long serialVersionUID = 1;
private int serialVersion = 1;
// all your code here
}
MinorThird uses log4j. To use it, I recommend doing something like this:
import org.apache.log4j.*;
public class MyClass {
private static Logger log = Logger.getLogger(MyClass.class);
public void myMethod(MyOtherClass fooBar) {
if (somethingLooksOdd()) log.warn("this is a warning");
log.info("myMethod called with "+fooBar);
for (int i=0; i<1000; i++) {
// note: the 'if' statement here is important for efficiency,
// otherwise the string-contatenation argument to log.debug
// will be evaluated each iteration, which can be expensive!
if (log.isDebugEnabled()) log.debug("on iteration "+i+" state is: "+fooBar);
// do something here
}
}
public String toString() { "[MyClass: "+aShortStringForDebugging+"]"; }
}
You can set the output level my modifying your copy of log4j.properties
(somewhere on your classpath, usually in minorthird/config
); for example:
log4j.logger.edu.cmu.minorthird.somepackage.MyClass=DEBUG
You can also use INFO
, WARN
, or ERROR
as values. Or you can read all about log4j on the web and learn how to redirect the logs to files, reformat them, and all sorts of other stuff.
If you want to use the MinorThird progress counters, do it like this (the numExamples
argument is optional):
ProgressCounter pc = new ProgressCounter("training ultimateLearner1","example",numExamples);
for (int i=0; i<numExamples; i++) {
// do something
pc.progress();
}
pc.finished();
The implementation of pc.progress()
checks a clock and returns unless more than one second has passed since the last user output was printed, so it's okay to put it in a tight inner loop: most of the time, it just amounts to a call on System.currentTimeMillis()
.