Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MARC Author name importing #9797

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Improve MARC Author name importing #9797

wants to merge 11 commits into from

Conversation

hornc
Copy link
Collaborator

@hornc hornc commented Aug 25, 2024

Adds more tests and improves string handling on MARC imports.

It seems 7xx has no direct relation to a contributor 2nd-class author. Most of the existing examples represent equal co-authors.

Also, it looks like contributor: str is a deprecated field given internetarchive/openlibrary-client@7e45d51

It's not yet clear to me how to identify a statement of responsibility as either an Author or Contributor. The current Author schema supports "roles", which matches source data better, e.g. 700$e " Relator term".

Initial Questions, and this PR's answers:

  • Who should be in authors? : 1xx (non-repeatable) and 7xx (repeatable) values (without duplicates)
  • Who should be in contributons? ... no one, deprecate this field. contributions: str field isn't used by the UI(?), and should be replaced by the more informative contributor -- see internetarchive/openlibrary-client@7e45d51
    • I question whether this new field should be pluralised because it is a list.
    • also, what makes a name a contributor? Authors already have roles, and source data doesn't appear to make author/contributor distinctions either.
  • In what script should the main entries be? : Main name in original script, transliterations/variations in alternate_names
  • Are these requirements consistent, and implemented consistently? I think this PR makes things more consistent than before.

Technical

Testing

Screenshot

Stakeholders

@tfmorris

@hornc hornc force-pushed the MARC_tests branch 3 times, most recently from 77b4d4c to 4f18edd Compare August 25, 2024 22:47
@hornc hornc force-pushed the MARC_tests branch 4 times, most recently from 3171df3 to 3d44aab Compare August 26, 2024 02:48
@hornc hornc force-pushed the MARC_tests branch 4 times, most recently from 9fc7c52 to f5dc6c5 Compare September 16, 2024 02:09
@hornc hornc changed the title Improve MARC string importing Improve MARC Author name importing Sep 16, 2024
@hornc hornc force-pushed the MARC_tests branch 13 times, most recently from ba029fb to 08c92bd Compare September 19, 2024 01:47
@hornc
Copy link
Collaborator Author

hornc commented Sep 19, 2024

mypy is wasting my time on this.

it's ridiculous that these tests can't run locally in Docker, which is how everything else in the project runs, mypy is available in the container via requirements_test.txt , but it's not functional. It's presence in the container seems to be an accident.

This project uses Docker to take advantage of a controlled dev and test environment, I do not use venv for this project, because there is no need to. I'm not even a great fan of mypy type checking, it's a bit of a disaster on this project which still has tons of legacy code which has barely been upgraded from Python 2. It's forcing to put bandaids on unrelated code that functions, when much of it needs a deep refactor. It's frustrating to have to type hint everything when you've added a few hints to code you understand and are actively changin. I have backed out type hints on some methods before because it opened a can of worms of errors I just couldn't untangle all the way back to source.

There seems to be no integration between the Docker env, the git pre-commit env, and the local test env (which apparently is something that is required to be outside of Docker), it's not documented how to run the full test suite locally, and I think having a separate test/dev environment to run half the tests is a mistake, even if it were documented.

@hornc
Copy link
Collaborator Author

hornc commented Sep 19, 2024

mypy's error messages aren't even very good.

The last error is:

openlibrary/catalog/marc/parse.py:501: error: Incompatible types in assignment
(expression has type "list[str]", target has type "str")  [assignment]
                author['alternate_names'] = [alt_name]
                                            ^~~~~~~~~~

It seems to be incorrectly inferring target has type "str" because I can't find anything that is explicitly claiming that.

author['alternate_names'] is supposed to be list[str] Python can handle that, but mypy has a problem. We have a JSON schema to set this all out, and Python dicts used to be good enough to represent structures like this, are we supposed to refactor everything to use statically typed objects now to represent these json-like objects we've inherited from the codebase?

@hornc hornc force-pushed the MARC_tests branch 2 times, most recently from 6c1813b to 55c8002 Compare September 19, 2024 03:02
@hornc
Copy link
Collaborator Author

hornc commented Sep 19, 2024

idk, now these tests are failing non-deterministically because of 500s from some external Vercel web app I've never heard of because ... javascript bundle size. What kind of testing hell is this? How many more gates?

@hornc hornc marked this pull request as ready for review September 19, 2024 22:54
@hornc
Copy link
Collaborator Author

hornc commented Sep 19, 2024

I have rebased against the main branch and that has fixed the javascript tests. I have also squashed the pre-commit CI test noise changes, and now all tests are passing.

@hornc hornc requested review from cclauss and removed request for cclauss September 19, 2024 23:08
Copy link
Contributor

@tfmorris tfmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick pass through and it's definitely an improvement. The two issues that I see are:

  • figuring out which 7xx's are authors and which aren't
  • normalizing contributor roles / relators

I don't claim to have magic solutions for either, so just highlighting the issues.

"name": "Homer",
"entity_type": "person"
},
{
"name": "Buckley, Theodore William Aldis",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The by_statement indicates that this is a translator, who probably shouldn't be promoted to author, but I'm not sure how this case can be detected with at natural language processing of the by statement.

"birth_date": "1849",
"name": "Cowles, Calvin D.",
"entity_type": "person",
"role": "comp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standard MARC relator is "com". Is this an OpenLibrary specific code or are these not being normalized?

"birth_date": "1787",
"death_date": "1855",
"entity_type": "person",
"role": "tr. [and] ed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these to be translatable, they're going to need to be standardized/normalized. MARC relators are "trl" and "edt"

"title": "gaon",
"personal_name": "Yehudai ben Na\u1e25man",
"personal_name": "Yehudai ben Naḥman",
"role": "supposed author",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the standard relator for this might be "att" for "Attributed name"

"name": "Schlosberg, Leon",
"date": "d. 1899",
"entity_type": "person",
"role": "ed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edt

"birth_date": "1849",
"name": "Cowles, Calvin D.",
"entity_type": "person",
"role": "comp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

com

"birth_date": "1787",
"death_date": "1855",
"entity_type": "person",
"role": "tr. [and] ed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trl
edt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants