Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MARC string importing, part A #9806

Merged
merged 8 commits into from
Sep 5, 2024
Merged

Improve MARC string importing, part A #9806

merged 8 commits into from
Sep 5, 2024

Conversation

hornc
Copy link
Collaborator

@hornc hornc commented Aug 26, 2024

Adds more tests and improves string handling on MARC imports.

Splits the work I have started in #9797 because that is getting a bit large.
This closes 2 issues completely, and sets up remaining work on #9789 and #7723.

Technical

This should be merged before #9797, which then will be rebased.

Testing

Screenshot

Stakeholders

it is more convenient, less typing, and less disruptive to my workflow
to make this tiny change manually rather than re-intergrate trivial
bot changes on a WIP PR.
closes #8204 by confirming expectation:
The series string should contain both the series name and the current volume/entry number
by adding the provided example as a test case.
@hornc hornc requested a review from scottbarnes August 27, 2024 21:41
Copy link
Collaborator

@scottbarnes scottbarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Contributor

@tfmorris tfmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love seeing these improvements and the richness of the data being captured!

I don't know if your tooling restricts you to ASCII for some reason, but the JSON Unicode strings would be a lot easier to read encoded as UTF-8 rather than escaped Unicode code points.

It wasn't clear to me whether the author's native name is one of the parts that's intentionally left out (or even how it would get incorporated), but flagged it just in case.

"authors": [
{
"entity_type": "org",
"name": "Shou du shi fan da xue (Beijing, China). Zhongguo shi ge yan jiu zhong xin"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing the alternative (ie native) name of 首都师范大学 (Beijing, China). 中国诗歌硏究中心.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tfmorris, thanks for your review and feedback. I started working on all the 7XX / 1XX issues in one PR #9797, but it started getting a bit tangled.

I'm trying to close out some of the basic string improvement issues with this PR for a stable base, and then I'll deal with making original / alternate script handling consistent in the various name fields.

Once this is merged, I'll update the tests to be consistent and make sure the correct script form of author names is where we want it. I think the current behaviour is inconsistent, and sorting it out properly is going to delay some of these more direct fixes. In #9797 I have changed the comma rearranging test expectation to use the native name, but there is still an inconsistency between 1XX and 7XX fields. This PR fixes the comma rearrangement (in the current chosen script). A follow up PR will make that script choice consistent across all name fields.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about having the readable UTF-8 in the test expectations too. I'll endeavour to convert to UTF-8 as I touch the various files. I think some of the original test data is generated by Python output, which can display it in either format. I think the current mix is accidental, and UTF-8 is the way to go.

re. tooling, I do struggle a bit with mixing RTL Hebrew and Arabic string values in otherwise LTR JSON, most tools seem to auto-align or rearrange things in a way that doesn't help. It's the text direction rather than the character sets that causes the problem there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I wasn't really sure where the dividing lines were, so I erred on the side of over-commenting, figuring that you could just ignore anything that wasn't relevant.

Comment on lines +22 to +25
"series": [
"The Science Council of Japan. Division of Economics, Commerce & Business Administration. Economic series no. 46",
"Economic series (Nihon Gakujutsu Kaigi. Dai 3-bu) -- no. 46"
],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's important given OpenLibrary's current liberal cataloging practices, but there's an extensive discussion of MARC 490 vs 830 at Yale https://web.library.yale.edu/cataloging/CIP/editing-490-830 It may make sense at some point to drop the 490 1 and only use the corresponding 8xx for series which are "traced". Of course 490 0 records would always be imported as is.

Of course the nice thing about the 490 is that it's what appears on the volume, so I can see a place for both.

hornc and others added 2 commits August 30, 2024 15:18
Co-authored-by: Tom Morris <[email protected]>
Co-authored-by: Tom Morris <[email protected]>
@cdrini cdrini assigned scottbarnes and unassigned cdrini Aug 30, 2024
@hornc hornc merged commit d8f3a8e into master Sep 5, 2024
5 checks passed
@hornc hornc deleted the MARC_testsA branch September 5, 2024 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants