Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

search wikidata images by their checksums (content hashes) #14

Open
jhpoelen opened this issue May 8, 2023 · 5 comments
Open

search wikidata images by their checksums (content hashes) #14

jhpoelen opened this issue May 8, 2023 · 5 comments

Comments

@jhpoelen
Copy link
Contributor

jhpoelen commented May 8, 2023

Internally, Wiki Commons uses sha1 hashes to alert users whether duplicate digital data is already available via Wiki Commons.

However, as far as I can tell, these sha1 hashes are not yet exposed via structured data by default.

And, methods already exist to annotate digital content with their checksums.

For example, see https://www.wikidata.org/wiki/Q34852 were https://www.wikidata.org/wiki/Property:P4092 is used to document sha-2 hash 8de979cbb1db728ef99debac8a516405a2088e4fa2816fda2769856a54029bcd49913a45494ce1cae4096413c49ae7da36f7bc2d20899fb216195b9eb365e55c associated with digital content .

image

@jhpoelen
Copy link
Contributor Author

jhpoelen commented May 8, 2023

Accordingly, I've manually annotated a wikimedia commons entry

https://commons.wikimedia.org/wiki/File:Agapostemon_texanus_killed_by_Peucetia_viridans_-_iNaturalist_56389401.jpg

with their associated checksums in sha1, sha-256 and md-5 speak.

Screenshot from 2023-05-08 09-23-58

@jhpoelen
Copy link
Contributor Author

jhpoelen commented May 8, 2023

a sample query

SELECT ?item ?image WHERE {
  ?item wdt:P4092 "85379b346e61c06033a12720155f3bf13d2c6f5946625600f34edace55cb159d693a15aefab9e15691ff2402887985d559951327974206ccf85495e27b9ee56d";
        wdt:P18|wdt:P117 ?image .
}
LIMIT 10

with results obtained via https://query.wikidata.org/#SELECT%20%3Fitem%20%3Fimage%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP4092%20%2285379b346e61c06033a12720155f3bf13d2c6f5946625600f34edace55cb159d693a15aefab9e15691ff2402887985d559951327974206ccf85495e27b9ee56d%22%3B%0A%20%20%20%20%20%20%20%20wdt%3AP18%7Cwdt%3AP117%20%3Fimage%20.%0A%7D%0ALIMIT%2010

Screenshot from 2023-05-08 09-44-59

@jhpoelen
Copy link
Contributor Author

jhpoelen commented May 8, 2023

Note that structured queries against objects in wikimedia commons are still under development. See for instance, https://diff.wikimedia.org/2020/10/29/sparql-in-the-shadow-of-structured-data-on-commons/ and referenced https://commons.wikimedia.org/wiki/Commons:Structured_data .

Also, note that annotating checksum properties (see https://www.wikidata.org/wiki/Property:P4092 ) on image properties in wikidata objects doesn't seem to come natural because qualifiers on qualifiers appears to be too much nesting for the wikidata model.

For instance, adding a checksum (or content hash) for an image that supports a physical interaction ( https://www.wikidata.org/wiki/Q2747101#P129 ) for a specific taxon https://www.wikidata.org/wiki/Q2747101 appears to be tricky with existing UI editing tools. E.g., is it currently hard to add a "determined by" quality SHA-1 algorithm for the checksum qualifier for the image related to the physical interaction property.

image

image

@jhpoelen
Copy link
Contributor Author

jhpoelen commented May 8, 2023

It appears that the wikimedia commons entities are a more natural fit . . . and some patience in needed before being able to access this structure commons data for reasons stated earlier.

image

@jhpoelen
Copy link
Contributor Author

jhpoelen commented May 8, 2023

So, as far as I can tell, querying wikimedia commons images by their checksums is possible, and a dedicated service / data product would have to be create to help answer questions like:

What are the check sums (or content hashes) associated with this wikimedia commons entity?

and

Please provide content associated with this content id (or checksum) if you have it. Otherwise, say "mweh, don't have it."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant