Pre-populate degree dictionary #34

cthoyt · 2022-10-10T12:31:03Z

Using a SPARQL query to get all subclasses of academic title (Q3529618) would be a nice way to pre-populate degrees.json. The following SPARQL query (run at https://w.wiki/5o9H) gets the job done:

SELECT ?itemLabel ?item
WHERE {
  ?item wdt:P279* wd:Q3529618 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Caveats:

This should be extended to multiple languages
Some labels are empty, those should be filtered out either in SPARQL or in post-processing (I realize this was likely due to there not being english labels)
There might be other terms besides academic title that are relevant, but this seems like a pretty good start

Alternate Multi-lingual SPARQL

SELECT DISTINCT ?label ?item
WHERE {
  ?item wdt:P279* wd:Q3529618 .
  ?item rdfs:label ?label .
}

Note that DISTINCT doesn't collapse entries tagged with multiple languages, but still have the same text.

The text was updated successfully, but these errors were encountered:

lubianat · 2022-10-12T11:26:49Z

We need a way to deal with duplicates, e.g.:
"Master of Arts": "Q6785149",
"Master of Arts": "Q2091008",

Both are valid, by the way

cthoyt · 2022-10-12T11:42:59Z

are there meaningful differences between these?

if not, they can be merged in wikidata
if they do have differences, then how do we decide which is right? Maybe coming up with a way of pruning country-specific duplicates (as https://www.wikidata.org/wiki/Q6785149 appears to be) would be helpful in making this list smaller

lubianat · 2022-10-12T11:48:19Z

@cthoyt this is only an example, there are many cases like that.
They are different, yes, as one is specific to Scotland.

No duplicates, but specializations, and we can always use a more general term. That is what I do when manually curating these keys.

Pruning it automatically may prove itself an endless task due to the variety of possible items.

While we don't have an workflow for curating this duplicates, I'd rather roll back to the manually curated only version of the file.

cthoyt · 2022-10-12T11:50:13Z

It's fine for me if you want to roll back, but I am optimistic that creating rules for processing data would be possible. Maybe you can start by assessing how big the overlap really is by adjusting the data structure that's returned from being a dict to being more of TSV-like data

lubianat · 2022-10-12T12:22:48Z

@cthoyt actually I think the duplicates appeared when I merged my curations with the automatic dict.
The current code overrides the "Master of Arts" and adds only the Scottish version. It should be kept in a development branch, as it is dangerous as-is

cthoyt mentioned this issue Oct 11, 2022

Automatically add additional roles #36

Merged

lubianat closed this as completed in 2a957e0 Oct 11, 2022

lubianat reopened this Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-populate degree dictionary #34

Pre-populate degree dictionary #34

cthoyt commented Oct 10, 2022 •

edited

Loading

lubianat commented Oct 12, 2022

cthoyt commented Oct 12, 2022

lubianat commented Oct 12, 2022

cthoyt commented Oct 12, 2022

lubianat commented Oct 12, 2022

Pre-populate degree dictionary #34

Pre-populate degree dictionary #34

Comments

cthoyt commented Oct 10, 2022 • edited Loading

Alternate Multi-lingual SPARQL

lubianat commented Oct 12, 2022

cthoyt commented Oct 12, 2022

lubianat commented Oct 12, 2022

cthoyt commented Oct 12, 2022

lubianat commented Oct 12, 2022

cthoyt commented Oct 10, 2022 •

edited

Loading