Improve MARC string importing, part A #9806

hornc · 2024-08-26T22:21:45Z

Adds more tests and improves string handling on MARC imports.

Splits the work I have started in #9797 because that is getting a bit large.
This closes 2 issues completely, and sets up remaining work on #9789 and #7723.

Fixes Parsing ] in MARC datafield 260 (Imprint) #8165 -- Parsing ] in MARC datafield 260 (Imprint)
Closes Series volume number being split from some marc entries that use 830 field #8204 -- Series volume number being split from some marc entries that use 830 field by adding a test confirming expected behaviour is current
Adds tests for the main part of Some org author names are being incorrectly rearranged around commas on import #9789 Some org author names are being incorrectly rearranged around commas on import The org splitting was already fixed in Sort out Author types on import #9601

Technical

This should be merged before #9797, which then will be rebased.

Testing

Screenshot

Stakeholders

closes #8165

relates to #9789 fixed in #9601

it is more convenient, less typing, and less disruptive to my workflow to make this tiny change manually rather than re-intergrate trivial bot changes on a WIP PR.

closes #8204 by confirming expectation: The series string should contain both the series name and the current volume/entry number by adding the provided example as a test case.

for more information, see https://pre-commit.ci

scottbarnes

Looks good to me.

tfmorris

Love seeing these improvements and the richness of the data being captured!

I don't know if your tooling restricts you to ASCII for some reason, but the JSON Unicode strings would be a lot easier to read encoded as UTF-8 rather than escaped Unicode code points.

It wasn't clear to me whether the author's native name is one of the parts that's intentionally left out (or even how it would get incorporated), but flagged it just in case.

openlibrary/catalog/marc/tests/test_data/bin_expect/lesnoirsetlesrou0000garl_meta.json

tfmorris · 2024-08-28T17:59:10Z

openlibrary/catalog/marc/tests/test_data/bin_expect/710_org_name_in_direct_order.json

+  "authors": [
+    {
+      "entity_type": "org",
+      "name": "Shou du shi fan da xue (Beijing, China). Zhongguo shi ge yan jiu zhong xin"


This is missing the alternative (ie native) name of 首都师范大学 (Beijing, China). 中国诗歌硏究中心.

@tfmorris, thanks for your review and feedback. I started working on all the 7XX / 1XX issues in one PR #9797, but it started getting a bit tangled.

I'm trying to close out some of the basic string improvement issues with this PR for a stable base, and then I'll deal with making original / alternate script handling consistent in the various name fields.

Once this is merged, I'll update the tests to be consistent and make sure the correct script form of author names is where we want it. I think the current behaviour is inconsistent, and sorting it out properly is going to delay some of these more direct fixes. In #9797 I have changed the comma rearranging test expectation to use the native name, but there is still an inconsistency between 1XX and 7XX fields. This PR fixes the comma rearrangement (in the current chosen script). A follow up PR will make that script choice consistent across all name fields.

I agree about having the readable UTF-8 in the test expectations too. I'll endeavour to convert to UTF-8 as I touch the various files. I think some of the original test data is generated by Python output, which can display it in either format. I think the current mix is accidental, and UTF-8 is the way to go.

re. tooling, I do struggle a bit with mixing RTL Hebrew and Arabic string values in otherwise LTR JSON, most tools seem to auto-align or rearrange things in a way that doesn't help. It's the text direction rather than the character sets that causes the problem there.

Thanks for the feedback. I wasn't really sure where the dividing lines were, so I erred on the side of over-commenting, figuring that you could just ignore anything that wasn't relevant.

tfmorris · 2024-08-28T18:08:59Z

openlibrary/catalog/marc/tests/test_data/bin_expect/830_series.json

+  "series": [
+    "The Science Council of Japan. Division of Economics, Commerce & Business Administration. Economic series no. 46",
+    "Economic series (Nihon Gakujutsu Kaigi. Dai 3-bu) -- no. 46"
+  ],


I don't think it's important given OpenLibrary's current liberal cataloging practices, but there's an extensive discussion of MARC 490 vs 830 at Yale https://web.library.yale.edu/cataloging/CIP/editing-490-830 It may make sense at some point to drop the 490 1 and only use the corresponding 8xx for series which are "traced". Of course 490 0 records would always be imported as is.

Of course the nice thing about the 490 is that it's what appears on the volume, so I can see a place for both.

openlibrary/catalog/marc/tests/test_data/bin_expect/710_org_name_in_direct_order.json

openlibrary/catalog/marc/tests/test_data/bin_expect/lesnoirsetlesrou0000garl_meta.json

Co-authored-by: Tom Morris <[email protected]>

hornc added 5 commits August 26, 2024 09:27

improved square bracket handling for publish places

5986189

closes #8165

add tests for 710 org name reordering

6613930

relates to #9789 fixed in #9601

remove duplicate character from strip string

2ca2be5

fix line lengths

80ce84e

it is more convenient, less typing, and less disruptive to my workflow to make this tiny change manually rather than re-intergrate trivial bot changes on a WIP PR.

adds a test for 830 series and attached volume number

4f18edd

closes #8204 by confirming expectation: The series string should contain both the series name and the current volume/entry number by adding the provided example as a test case.

hornc added the Theme: MARC records label Aug 26, 2024

github-actions bot assigned cdrini Aug 26, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

8f835e3

for more information, see https://pre-commit.ci

hornc requested a review from scottbarnes August 27, 2024 21:41

scottbarnes approved these changes Aug 28, 2024

View reviewed changes

tfmorris reviewed Aug 28, 2024

View reviewed changes

hornc and others added 2 commits August 30, 2024 15:18

utf8 string

cc3daa5

Co-authored-by: Tom Morris <[email protected]>

utf8 string

ab476ab

Co-authored-by: Tom Morris <[email protected]>

cdrini assigned scottbarnes and unassigned cdrini Aug 30, 2024

hornc mentioned this pull request Sep 4, 2024

Import Author roles from import source data #9844

Open

hornc merged commit d8f3a8e into master Sep 5, 2024
5 checks passed

hornc deleted the MARC_testsA branch September 5, 2024 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MARC string importing, part A #9806

Improve MARC string importing, part A #9806

hornc commented Aug 26, 2024 •

edited

Loading

scottbarnes left a comment

tfmorris left a comment

tfmorris Aug 28, 2024

hornc Sep 5, 2024

hornc Sep 5, 2024

tfmorris Sep 5, 2024

tfmorris Aug 28, 2024

Improve MARC string importing, part A #9806

Improve MARC string importing, part A #9806

Conversation

hornc commented Aug 26, 2024 • edited Loading

Technical

Testing

Screenshot

Stakeholders

scottbarnes left a comment

Choose a reason for hiding this comment

tfmorris left a comment

Choose a reason for hiding this comment

tfmorris Aug 28, 2024

Choose a reason for hiding this comment

hornc Sep 5, 2024

Choose a reason for hiding this comment

hornc Sep 5, 2024

Choose a reason for hiding this comment

tfmorris Sep 5, 2024

Choose a reason for hiding this comment

tfmorris Aug 28, 2024

Choose a reason for hiding this comment

hornc commented Aug 26, 2024 •

edited

Loading