Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDTManager::generateHDT is not deterministic. #88

Open
jmkeil opened this issue Jan 18, 2019 · 3 comments
Open

HDTManager::generateHDT is not deterministic. #88

jmkeil opened this issue Jan 18, 2019 · 3 comments
Labels
Milestone

Comments

@jmkeil
Copy link

jmkeil commented Jan 18, 2019

I recognized that HDTManager::generateHDT does not generate the same output for the same input.
This is not just caused by different timestamps in the header. For the same local file I got in five different runs this three different header data:

_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "121121" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "150874" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "1976817" .
_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "121096" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "150849" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "1975553" .
_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "121071" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "150824" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "1974289" .

Beside different headers, there is also some difference in the serialization of the dictionary.

What is the reason for this?

@mielvds mielvds added the bug label Apr 16, 2021
@D063520 D063520 added this to the 3.1.0 milestone Mar 21, 2022
@ate47
Copy link
Contributor

ate47 commented Apr 8, 2022

Do you remember if you were using an RDF file/stream with blank nodes? @jmkeil

@jmkeil
Copy link
Author

jmkeil commented Apr 8, 2022

I do not recall exactly, but it is likely I tested it using https://github.com/HajoRijgersberg/OM/blob/d5a3326e2f0f15f69272f3ce147b469fd90a1dc2/om-2.0.rdf. Does that fit to the hdt:originalSize value?

@ate47
Copy link
Contributor

ate47 commented Apr 8, 2022

My guess was that it was due to the randomness of the BNodes naming, so it does explain the difference in term of size, but to compute the size, we are using the random bnode names, but it’s usually _:anUuid, so a fixed size. So it’s not only that I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants