How to handle duplicate data entries #9

jennyzhang-petuum · 2021-04-06T21:51:43Z

Is your feature request related to a problem? Please describe.
When we choose to perform tokenize for multiple processors in the same pipeline, there would be the same entry type generated for the result which is ft.onto.base_ontology.Token. This would result in duplicate data. For example, the processor later in the pipeline would get access to these mixed entries of data for counting or prediction and get the confused result.

Describe the solution you'd like
Multiple solutions could help with solving this problem
The current solution in consideration is to create subclass ontology to diferentiate between these entries. For example, to create ontology entries of Token subclass separately for multiple processors that perform tokenize.

Describe alternatives you've considered
Create another field entry record in the datapack file for separation but it's a bit bulky for the datapack file.

The text was updated successfully, but these errors were encountered:

hunterhector · 2021-04-06T21:54:22Z

The same problem happens for all entry types, for example, duplicated Sentence. So we can use the same idea to handle all of them.

hunterhector assigned jennyzhang-petuum Apr 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle duplicate data entries #9

How to handle duplicate data entries #9

jennyzhang-petuum commented Apr 6, 2021 •

edited

Loading

hunterhector commented Apr 6, 2021

How to handle duplicate data entries #9

How to handle duplicate data entries #9

Comments

jennyzhang-petuum commented Apr 6, 2021 • edited Loading

hunterhector commented Apr 6, 2021

jennyzhang-petuum commented Apr 6, 2021 •

edited

Loading