Skip to content
linfrank edited this page Aug 8, 2012 · 3 revisions

Helpful Hints for Coding with MinorThird

These are not step-by-step instructions. This document is intended to give those with some background in machine learning an idea of where to find the important classes and an introduction to their use. This document should be consider a very high level map pointing to which part of the code to browse.

Introduction

If you're using MinorThird you're probably doing one (or a mix of several) of:

  • Creating new learning algorithms
  • Searching for better features for learning on text
  • Using existing learners & feature extractors to classify documents or extract information

If you're doing something else let us know! In all of the above possibilities you'll be going through the following set of choices:

  • Pick a task - annotation, classification, etc.
  • Choose or develop a learner
  • Choose or develop a feature extractor if working with text
  • Pick some data
  • Decide on a testing method - cross-validation or a fixed test set
  • Evaluate or use the results

If your task doesn't involve text (or if you've already converted the input to features) then you'll only need the classify package. Learners are in classify.algorithms or classify.sequence. Data needs to be in a Dataset object, implemented by BasicDataset. You can add one Example at a time, or load from a file using DatasetLoader. To run an experiment, call one of the static methods in classify.experiments.Tester. If you want to use an already trained classifier as part of a larger system (i.e., not experimenting), you'll want to have that classifier serialized and call its classification method directly; more on that later.

If your task involves text then data will need to be in a TextBase/TextLabels pair. You'll also use code from text.learn.experiments to run any experiments; classification of a text document uses the text.learn.experiments.ClassifyExperiment code and learners from the classify package; annotation / extraction uses the TextLabelsExperiment code.

Learning

Teacher - The ClassifierTeacher and AnnotatorTeacher provide a protocol for training learners. They provide a mechanism for batch learning, online learning, and active learning.

ClassifierLearner - The interface which all classification learners must implement. It accepts Example instances and generates a Classifier, plus other methods for active learning and constraints on Example. There are a few abstract implementation classes of note: BatchClassifierLearner for standard training and batch learning, OnlineClassifierLearner for online learning algorithms, and binary extensions of each.

AnnotatorLearner - The AnnotatorLearner interface and abstract implementations are analogous, but work with spans for documents return an Annotator and can be queried for the SpanFeatureExtractor used.

Feature Extraction

SpanFeatureExtractor - This is an interface for extracting an Instance from a Span of text.

SpanFE - An abstract class implementing SpanFeatureExtractor which provides a number of methods for a declarative style of extracting features / instances from text. Subclasses will call a number of methods such as from (get tokens from a span), lc (lower case those tokens), etc. There are examples all over the code.

SampleFE - SampleFE implements a few typical feature extractors as singleton instances. If you're starting out, use an extractor from here, or use it as sample code for producing your own.

Mixup - Mixup is a regular expression-like language for labeling text. Mixup can be used as a non-learning annotator in MinorThird. It can also be used to label text with tags which will be used as features (such as part-of-speech). You can say that regular expressions work at on a character level while Mixup works on a token level. See Mixup Language for further details.

Loading Data

DatasetLoader - Loads and saves datasets. Three formats are supported:

  1. MinorThird style:
type subpopulationid label feature1=weight1 feature2=weight2 ...
  1. SVM-Light style:
label feature1:weight1 feature2:weight2 ...
  1. Sequence datasets; each Example in a sequence is saved on a separate line in the first style with an asterisk (*) alone on a line separating the sequences.

TextBaseLoader - The TextBaseLoader is configured via the constructor based on your data format. Calling load(File) will return a TextBase object. There are also two static methods for loading our most common data formats:

  1. Load from a single file counting each line as one document
  2. Load from a directory where each file is a document and there are XML labels embedded in the text

The code has extensive Javadoc on all the format options supported and how to configure for them.

TextLabelsLoader - Loads a labeling from a file. There is one parameter to set the closure policy on types. Labels can be loaded from a serialized object or from a list of type operations. The class also allows for saving the types in an XML format.

SimpleTextLoader - Provides two methods for retrieving TextLabels (which also points to a TextBase) for standard data formats.

Testing

Splitter - The Splitter interface provides an Iterator over training and test examples. It is used by all the experimentation infrastructure to control the data available.

CrossValidationSplitter - Handles a cross validation experiment. It can be set by the number of folds or by the percentage of data to use for training.

FixedTestSplitter - All the experiment code requires a Splitter implementation. For classification a fixed test set can be used by wrapping it with a FixedTestSplitter. The FixedTestSplitter requires an iterator over examples.

Running Experiments

classify.*.Tester - Tester provides a few static methods for running a classification or sequence classification experiment. The evaluate methods all return an Evaluation object.

text.*.ClassifyExperiment - The main method of ClassifyExperiment takes several parameters (see usage()) and runs an experiment, displaying the results in a GUI. There are several constructors for configuring an experiment which can the be run with the evaluate() method.

text.*.TextLabelsExperiment - Similar to classification, TextLabelsExperiment has a main method which takes parameters to do an annotation experiment. The guts are in doExperiment().

classify.*.Evaluation - The Evaluation class stores and calculates several metrics for an experiment. Through the toGUI() method it can produce a dialog with several panels to display the results of that experiment.

Code Structure Overview

Here I attempt to take an aerial photo of the code to give you a map of where things are.

First, MinorThird has three main modules: Classify, Text, and Util. The separation of modules is based on "allowed to use" principles. Util is used by any of the others, but cannot use any other module. Classify uses Util, but not Text. Text uses both Classify and Util.

Util

A small number of classes providing basic functions (math, string helpers, and enhanced iterators).

Classify

The classify package provides all the classes needed do classification experiments. Representation of datasets and methods for loading/saving them are provided. Representation of instances, examples, and features are also present. Interfaces for teachers and learners are placed here.

Sub-packages include a set of algorithms implementing the learning interface. Sequence learner interfaces and infrastructure are under the sequential package. Finally, the experiments package contains infrastructure for evaluation of results and splitter implementations.

Text

The text package provides code needed to handle machine learning on text. TextBase (containing the underlying text of the data) and TextLabels (the labeling or annotations on that text) interfaces are here, along with a base implementation for both. There are classes for text tokens and spans (a set of contiguous tokens, up to a full document). The TextBaseLoader can be configured to load data into a TextBase. The Annotator interface defines the methods provided by any class doing extraction from text.

The mixup package contains the parser for the Mixup language. The Mixup and MixupProgram classes contain documentation on creating Mixup code, including an example, BNF, and brief discussion of semantics.

The gui package holds a number of classes used for displaying spans, text bases, editing labels, experiment results, etc.

The learn package holds several algorithms for learned annotators. The SampleLearners class contains a few singleton leaner instances. The abstract SpanFE class provides facilities for a declarative style of extracting important features. The learn package has a sub-package experiments, which is analogous to classify.learn.experiments. It holds classes for running a labeling experiment, a sequential labeling experiment, and for running classification on a full document.

General Coding Suggestions

Serializability

To make a class serializable, I recommend using the following template:

import java.io.Serializable;
public class MyClass implements Serializable {
  private static final long serialVersionUID = 1;
  private int serialVersion = 1;
  // all your code here
}

Logging

MinorThird uses log4j. To use it, I recommend doing something like this:

import org.apache.log4j.*;
public class MyClass {
  private static Logger log = Logger.getLogger(MyClass.class);

  public void myMethod(MyOtherClass fooBar) {
    if (somethingLooksOdd()) log.warn("this is a warning");
    log.info("myMethod called with "+fooBar);
    for (int i=0; i<1000; i++) {
      // note: the 'if' statement here is important for efficiency,
      // otherwise the string-contatenation argument to log.debug 
      // will be evaluated each iteration, which can be expensive!
      if (log.isDebugEnabled()) log.debug("on iteration "+i+" state is: "+fooBar);
      // do something here
    }
  }	
  public String toString() { "[MyClass: "+aShortStringForDebugging+"]"; }
}

You can set the output level my modifying your copy of log4j.properties (somewhere on your classpath, usually in minorthird/config); for example:

log4j.logger.edu.cmu.minorthird.somepackage.MyClass=DEBUG

You can also use INFO, WARN, or ERROR as values. Or you can read all about log4j on the web and learn how to redirect the logs to files, reformat them, and all sorts of other stuff.

Progress Counters

If you want to use the MinorThird progress counters, do it like this (the numExamples argument is optional):

ProgressCounter pc = new ProgressCounter("training ultimateLearner1","example",numExamples);
for (int i=0; i<numExamples; i++) {
  // do something 
  pc.progress();
}
pc.finished();

The implementation of pc.progress() checks a clock and returns unless more than one second has passed since the last user output was printed, so it's okay to put it in a tight inner loop: most of the time, it just amounts to a call on System.currentTimeMillis().