Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster indexing for learned sparse retrieval #2080

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

thongnt99
Copy link

@thongnt99 thongnt99 commented Mar 23, 2023

Related to #1890
On-going work: Using FeatureField to directly index terms and weights

The indexing works and returns the same metrics as the token repeating method, but three tests (for the repeating method) are currently failing. Please let me know how to fix the tests or create new tests.

Indexing:

./anserini-lsr/target/appassembler/bin/IndexCollection \
-collection JsonTermWeightCollection \
-input collections/msmarco-passage/lsr_collection_jsonl \
-index indexes/msmarco-passage/lsr-index-msmarco \
-generator TermWeightDocumentGenerator \
-threads 60 -impact -pretokenized

Retrieval:

./anserini-lsr/target/appassembler/bin/SearchCollection \
-index path_to_index \
-topics path_to_topic \
-topicreader TsvString \
-output path_to_output_file \
-impact -pretokenized -hits 1000 -parallelism 60 

@lintool
Copy link
Member

lintool commented Mar 24, 2023

Hi @thongnt99 very interesting and thanks for the PR!

Can you provide a sense of the performance improvement?

@thongnt99
Copy link
Author

thongnt99 commented Mar 25, 2023

Hi @lintool ,

These are some comparison points I collected from our recent reproduction attempt with LSR methods.
The degree of speed up would depend on the magnitude of term weights, but at least twice faster than the term repeating method. We saw, for example, a huge improvement for indexing EPIC since EPIC does not use sparse regularizers during training, therefore produces generally larger weights (the term repeating method has to repeat more).

LSR method Old New
QMLP_DMLM 0:10:25 0:04:09
EPIC (top_k=400) 1:23:53 0:04:02
Splade (0.01, 0.08) 0:17:41 0:03:52
uniCOIL 0:05:11 0:02:18

@MXueguang
Copy link
Member

MXueguang commented Mar 25, 2023

Hi @lintool ,

These are some comparison points I collected from our recent reproduction attempt with LSR methods. The degree of speed up would depend on the magnitude of term weights, but at least twice faster than the term repeating method. We saw, for example, a huge improvement for indexing EPIC since EPIC does not use sparse regularizers during training, therefore produces generally larger weights (the term repeating method has to repeat more).

LSR method Old New
QMLP_DMLM 0:10:25 0:04:09
EPIC (top_k=400) 1:23:53 0:04:02
Splade (0.01, 0.08) 0:17:41 0:03:52
uniCOIL 0:05:11 0:02:18

@thongnt99 this is cool!

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial comments.

@lintool
Copy link
Member

lintool commented Mar 25, 2023

Instead of TermWeightDocument... why not just call it VectorDocument? Vector as Map<String,Float> seems pretty intuitive?

@thongnt99
Copy link
Author

thongnt99 commented Mar 25, 2023

Instead of TermWeightDocument... why not just call it VectorDocument? Vector as Map<String,Float> seems pretty intuitive?

Yes, I also think that TermWeightDocument isn't an ideal name. Probably SparseVectorDocument is more suitable than VectorDocument? The formers says that we should store indices/terms and values (similar to SparseMatrix vs DenseMatrix format).

@lintool
Copy link
Member

lintool commented Mar 25, 2023

I like SparseVectorDocument!

@thongnt99
Copy link
Author

@lintool
I changed class names and fixed issues in your previous comments.

./anserini-lsr/target/appassembler/bin/IndexCollection \
-collection JsonSparseVectorCollection \
-input collections/msmarco-passage/lsr_collection_jsonl \
-index indexes/msmarco-passage/lsr-index-msmarco \
-generator SparseVectorDocumentGenerator \
-threads 60 -impact -pretokenized

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about some tests?

src/main/java/io/anserini/search/SearchCollection.java Outdated Show resolved Hide resolved
@thongnt99
Copy link
Author

How about some tests?

@lintool I am gonna add the tests after ECIR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants