Skip to content

Labeling and Loading Data Tutorial

linfrank edited this page Aug 16, 2012 · 1 revision

Labeling and Loading Data Tutorial

Labeling Data

UI programs load a collection of text documents as data. Documents may be a collection of files in a directory or one document per line in a single file. The collection of documents is stored in a TextBase. Annotations of these documents are stored in a corresponding TextLabels object. Each annotation asserts a category or property for a word, a document, or a subsequence of words (a Span). TextLabels stores information from many sources; they might hold annotations produced by human labelers (perhaps using a GUI tool like the TextBaseEditor), annotations produced by a hand-written program, or annotations produced by a learned program.

TextLabels can be loaded in two ways:

  1. A labels file
  2. Embedded XML tags

Example of text labels in a labels file:

addToType doc1 184 3 name
addToType doc1 189 11 name
addToType doc2 205 3 name

In a labels file the second word is the document name, the third word is the starting token of the span, the fourth word is the length of the span, and the last word is the label.

Labels files are not meant to be created by hand. There are a few GUI tools in MinorThird that allow users to graphically see documents, add and/or edit labels, and save their work in a labels file.

To label whole documents, such as when labeling emails spam or real, try using the TextBaseLabeler:

$ java –Xmx500M edu.cmu.minorthird.text.gui.TextBaseLabeler DATA_DIRECTORY DATA.labels

where DATA_DIRECTORY is where your documents are stored and DATA.labels is where you would like to save you labels. If you would like MinorThird to automatically load your hand-edited labels the name of the labels file should be the same as the directory. For example if you have a directory named foo, you will want to name you labels file foo.labels.

A window that looks like this will appear:

To select a document, click on it in the top panel and the text from that document will appear in the bottom panel. To label the currently selected document, pick a label from the pull down menu (if the label you would like is on that menu) or type in a new label next to New class:. Once you have picked your label, simply press the Accept class: button and go to the next document you would like to label.

If you look at the documentation index, you will see that there are two other classes for editing a TextBase called DebugMixup and EditLabels. These are useful when you have some results and are interested in hand-correcting, but will not help you label unlabeled documents.

If you would like to hand label unlabeled extraction data (such as names or places), inserting embedded XML tags is probably you best option.

Example of a document with embedded labels:

The <location>Pittsburgh</location> Steelers headed by coach
<name>Bill Cowher</name> are going to <location>Cleveland</location>
to play the Browns.

In this type of document a labeled span lies between the < > and </ > markers. The label is the word between the marks.

How to Load Your Labeled Data

  • Simple loading – specify the location of all data in the GUI or by typing –labels PATH/DATA_DIRECTORY in the command line. Note: if you name a .labels file the same as your directory, the labels it the file will automatically be loaded. XML-style tags will also be automatically loaded.
  • More advanced users (who would have a lot of data) may want to use a repository structure. To do this you must create a data.properties file in the config directory which points to a directory called repository. The repository directory must then contain three folders: data, labels, and loaders.