Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring Data Store Structure #882

Merged
merged 27 commits into from
Aug 22, 2022
Merged

Conversation

Pushkar-Bhuse
Copy link
Collaborator

@Pushkar-Bhuse Pushkar-Bhuse commented Jul 18, 2022

This PR fixes #874

Description of changes

This PR attempts to transform how data is stored in the DataStore entry. The main idea behind this new format is that all attributes of an entry other than its tid and type_name should be considered as data class attributes. As a result of this, the DataStore format of any entry can be visualized as : [tid, type_name, ....(dataclass attributes)...].

For example, the type_attrubutes of Sentence can be seen as

{
            "attributes": {
                "begin": 2,
                "end": 3,
                "payload_idx": 4,
                "speaker": 5,
                "part_id": 6,
                "sentiment": 7,
                "classification": 8,
                "classifications": 9,
            },
            "parent_class": set(),
        }

The way this is implemented is by creating a class variable for Entry called cached_attribute_data. This is a dict that stores the initial values of dataclass attributes. The implementation makes sure that before initializing a data store entry for a given entry object, the cached_attribute_data dict store all data class attributes and their initial values. There are two ways in which a dataclass attribute can be added to cached_attribute_data

  1. Attribute has a entry_setter property: In this case, the entry will automatically be added to cached_attribute_data.
  2. Attribute does not have a entry_setter property: In this case, attributes values are store in the entry object. These are fetched before the creation of the data store entry to populate cached_attribute_data.

Possible influences of this PR.

  1. Since all attributes are now dataclass attributes, we do not need to rely on constants. Instead, we use the function get_datastore_attr_idx to fetch the position in the datastore where a given attributes is stored.
  2. The new format makes DataStore more scalable now since any new attribute can be added to the entry as well as its datastore but declaring it as a dataclass attribute

Test Conducted

The main aim of this PR was to keep the outermost interface unchanged and still be able to pass the data_store_test, data_pack_test and multi_pack_test

@Pushkar-Bhuse
Copy link
Collaborator Author

This PR does apply changes to CV Ontologies since it is currently getting updated itself.

Copy link
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a large PR, maybe too large so I am not too confident in reviewing this. There are currently many changes in interfaces that won't be merged too. Maybe we should actually break down this work into multiple PRs instead of doing it at once.

  1. We need to understand the DataPack interface layer and DataStore interface layer, we don't want to make any changes to public PRs of DataPack
  2. For safety, let's do not add any type: ignore in this PR since a lot of them are caused by real errors
  3. Please double-check whether serialization/deserialization data packs work after this PR.

forte/common/constants.py Outdated Show resolved Hide resolved
forte/data/ontology/core.py Outdated Show resolved Hide resolved
forte/data/base_pack.py Outdated Show resolved Hide resolved
forte/data/container.py Outdated Show resolved Hide resolved
forte/data/data_pack.py Outdated Show resolved Hide resolved
forte/data/data_store.py Show resolved Hide resolved
forte/data/data_store.py Outdated Show resolved Hide resolved
forte/data/data_store.py Outdated Show resolved Hide resolved
forte/data/data_store.py Outdated Show resolved Hide resolved
forte/data/data_store.py Outdated Show resolved Hide resolved
@Pushkar-Bhuse Pushkar-Bhuse force-pushed the structure branch 5 times, most recently from c29928d to db414ad Compare July 30, 2022 19:54
@codecov
Copy link

codecov bot commented Aug 1, 2022

Codecov Report

Merging #882 (c5b3af8) into master (d6b137d) will increase coverage by 0.04%.
The diff coverage is 93.02%.

@@            Coverage Diff             @@
##           master     #882      +/-   ##
==========================================
+ Coverage   80.91%   80.95%   +0.04%     
==========================================
  Files         254      254              
  Lines       19551    19569      +18     
==========================================
+ Hits        15819    15843      +24     
+ Misses       3732     3726       -6     
Impacted Files Coverage Δ
forte/data/extractors/relation_extractor.py 24.32% <0.00%> (ø)
tests/forte/data/data_pack_test.py 98.85% <ø> (ø)
forte/data/ontology/top.py 76.43% <90.78%> (-1.74%) ⬇️
forte/data/base_pack.py 76.17% <93.33%> (+0.51%) ⬆️
forte/data/data_store.py 93.24% <93.50%> (-0.07%) ⬇️
forte/common/constants.py 100.00% <100.00%> (ø)
forte/data/base_store.py 75.34% <100.00%> (-0.34%) ⬇️
forte/data/data_pack.py 85.90% <100.00%> (+0.02%) ⬆️
forte/data/entry_converter.py 88.31% <100.00%> (+5.41%) ⬆️
forte/data/ontology/core.py 76.95% <100.00%> (+0.08%) ⬆️
... and 6 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@mylibrar
Copy link
Collaborator

mylibrar commented Aug 3, 2022

The way this is implemented is by creating a class variable for Entry called cached_attribute_data. This is a dict that stores the initial values of dataclass attributes. The implementation makes sure that before initializing a data store entry for a given entry object, the cached_attribute_data dict store all data class attributes and their initial values. There are two ways in which a dataclass attribute can be added to cached_attribute_data

  1. Attribute has a entry_setter property: In this case, the entry will automatically be added to cached_attribute_data.
  2. Attribute does not have a entry_setter property: In this case, attributes values are store in the entry object. These are fetched before the creation of the data store entry to populate cached_attribute_data.

I'm thinking we might want to make the behavior consistent. Right now we are maintaining two distinctive approaches to set dataclass attributes before and after registering property function. But it's out of the scope of this PR and it's not of high priority.

forte/common/constants.py Outdated Show resolved Hide resolved
forte/common/constants.py Outdated Show resolved Hide resolved
forte/data/base_pack.py Show resolved Hide resolved
forte/data/base_pack.py Outdated Show resolved Hide resolved
forte/data/base_pack.py Outdated Show resolved Hide resolved
forte/data/entry_converter.py Outdated Show resolved Hide resolved
forte/data/ontology/core.py Outdated Show resolved Hide resolved
forte/data/ontology/top.py Outdated Show resolved Hide resolved
forte/data/ontology/top.py Outdated Show resolved Hide resolved
tests/forte/data/data_store_serialization_test.py Outdated Show resolved Hide resolved
forte/data/base_pack.py Outdated Show resolved Hide resolved
forte/data/base_pack.py Outdated Show resolved Hide resolved
forte/data/ontology/core.py Outdated Show resolved Hide resolved
forte/data/base_pack.py Show resolved Hide resolved
forte/data/base_pack.py Outdated Show resolved Hide resolved
forte/data/data_store.py Show resolved Hide resolved
forte/data/data_store.py Outdated Show resolved Hide resolved
forte/data/data_store.py Show resolved Hide resolved
forte/data/data_store.py Show resolved Hide resolved
tests/forte/data/data_store_test.py Outdated Show resolved Hide resolved
@mylibrar mylibrar marked this pull request as ready for review August 17, 2022 22:16
forte/data/data_pack.py Outdated Show resolved Hide resolved
forte/data/data_store.py Outdated Show resolved Hide resolved
forte/data/data_store.py Outdated Show resolved Hide resolved
forte/data/data_store.py Outdated Show resolved Hide resolved
tests/forte/remote_processor_test.py Show resolved Hide resolved
forte/data/ontology/top.py Outdated Show resolved Hide resolved
forte/data/ontology/top.py Show resolved Hide resolved
forte/data/ontology/top.py Outdated Show resolved Hide resolved
forte/data/data_pack.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change Data Store Structure for Annotation Type Entries
3 participants