Spark for Unstructured Information, provides a thin abstraction layer for UIMA
on top of Spark.
SUIM leverages on Spark resilient distributed dataset (RDD) to run UIMA pipelines using uimaFIT, SUIM pipelines are
distributed across the nodes on a cluster and can be operated on in parallel [1].
SUIM allows you to run analytical pipelines on the resulting (or intermediate) CAS
to execute furhter text analytics or
machine learning algorithms.
Using the RoomAnnotator
from the UIMA tutorial:
val typeSystem = TypeSystemDescriptionFactory.createTypeSystemDescription()
val params = Seq(FileSystemCollectionReader.PARAM_INPUTDIR, "data")
val rdd = makeRDD(createCollectionReader(classOf[FileSystemCollectionReader], params: _*), sc)
val rnum = createEngineDescription(classOf[RoomNumberAnnotator])
val rooms = rdd.map(process(_, rnum)).flatMap(scas => JCasUtil.select(scas.jcas, classOf[RoomNumber]))
val counts = rooms.map(room => room.getBuilding()).map((_,1)).reduceByKey(_ + _)
counts.foreach(println(_))
If the collection is to large to fit in memory, or you already have a collection of SCAS
es use an HDFS RDD:
val rdd = sequenceFile(reateCollectionReader(classOf[FileSystemCollectionReader], params: _*),
"hdfs://localhost:9000/documents", sc)
Use DKPro Core [2] to tokenize and Spark to do token level analytics.
val typeSystem = TypeSystemDescriptionFactory.createTypeSystemDescription()
val rdd = makeRDD(createCollectionReader(classOf[TextReader],
ResourceCollectionReaderBase.PARAM_SOURCE_LOCATION, "data",
ResourceCollectionReaderBase.PARAM_LANGUAGE, "en",
ResourceCollectionReaderBase.PARAM_PATTERNS, Array("[+]*.txt")), sc)
val seg = createPrimitiveDescription(classOf[BreakIteratorSegmenter])
val tokens = rdd.map(process(_, seg)).flatMap(scas => JCasUtil.select(scas.jcas, classOf[Token]))
val counts = tokens.map(token => token.getCoveredText())
.filter(filter(_))
.map((_,1)).reduceByKey(_ + _)
.map(pair => (pair._2, pair._1)).sortByKey(true)
counts.foreach(println(_))
To build:
mvn compile
To run:
mvn scala:run
To test:
mvn test
To create standalone with dependencies:
mvn package
java -jar target/spark-uima-tools-0.0.1-SNAPSHOT-jar-with-dependencies.jar