Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-populate degree dictionary #34

Open
cthoyt opened this issue Oct 10, 2022 · 5 comments · Fixed by #36
Open

Pre-populate degree dictionary #34

cthoyt opened this issue Oct 10, 2022 · 5 comments · Fixed by #36

Comments

@cthoyt
Copy link
Collaborator

cthoyt commented Oct 10, 2022

Using a SPARQL query to get all subclasses of academic title (Q3529618) would be a nice way to pre-populate degrees.json. The following SPARQL query (run at https://w.wiki/5o9H) gets the job done:

SELECT ?itemLabel ?item
WHERE {
  ?item wdt:P279* wd:Q3529618 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Caveats:

  • This should be extended to multiple languages
  • Some labels are empty, those should be filtered out either in SPARQL or in post-processing (I realize this was likely due to there not being english labels)
  • There might be other terms besides academic title that are relevant, but this seems like a pretty good start

Alternate Multi-lingual SPARQL

SELECT DISTINCT ?label ?item
WHERE {
  ?item wdt:P279* wd:Q3529618 .
  ?item rdfs:label ?label .
}

Note that DISTINCT doesn't collapse entries tagged with multiple languages, but still have the same text.

@lubianat
Copy link
Owner

We need a way to deal with duplicates, e.g.:
"Master of Arts": "Q6785149",
"Master of Arts": "Q2091008",

Both are valid, by the way

@cthoyt
Copy link
Collaborator Author

cthoyt commented Oct 12, 2022

are there meaningful differences between these?

  1. if not, they can be merged in wikidata
  2. if they do have differences, then how do we decide which is right? Maybe coming up with a way of pruning country-specific duplicates (as https://www.wikidata.org/wiki/Q6785149 appears to be) would be helpful in making this list smaller

@lubianat
Copy link
Owner

@cthoyt this is only an example, there are many cases like that.
They are different, yes, as one is specific to Scotland.

No duplicates, but specializations, and we can always use a more general term. That is what I do when manually curating these keys.

Pruning it automatically may prove itself an endless task due to the variety of possible items.

While we don't have an workflow for curating this duplicates, I'd rather roll back to the manually curated only version of the file.

@cthoyt
Copy link
Collaborator Author

cthoyt commented Oct 12, 2022

It's fine for me if you want to roll back, but I am optimistic that creating rules for processing data would be possible. Maybe you can start by assessing how big the overlap really is by adjusting the data structure that's returned from being a dict to being more of TSV-like data

@lubianat
Copy link
Owner

@cthoyt actually I think the duplicates appeared when I merged my curations with the automatic dict.
The current code overrides the "Master of Arts" and adds only the Scottish version. It should be kept in a development branch, as it is dangerous as-is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants