ES-PV Feedback #609

matyaskopp · 2023-02-28T20:51:08Z

Common taxonomies are modified

common taxonomies

The common taxonomies should be used without modifications - just translations.
E.g. in your parla.legislature taxonomy, you don't have category parla.bi

You can use this procedure to extract taxonomies and listPerson and listOrg outside and include them with include:

# factorize taxonomies and list(Person|Org)
make factorize-teiHeader-INPLACE-ES-PV
# add new files into the repository (taxonomies and list of persons and organizations)
git add Data/ParlaMint-SI/ParlaMint-ES-PV-taxonomy-*.xml
git add Data/ParlaMint-SI/ParlaMint-taxonomy-*.xml
git add Data/ParlaMint-SI/ParlaMint-ES-PV-list*.xml

and then commit and push updated files

meeting corresp

meeting/@correst

https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L12-L14

<meeting ana="#parla.lower #parla.term" n="10">10. agintaldia</meeting>
<meeting ana="#parla.lower #parla.term" n="11">11. agintaldia</meeting>
<meeting ana="#parla.lower #parla.term" n="12">12. agintaldia</meeting>

should be (DZ is an id of corresponding parliament)

<meeting corresp="#DZ" ana="#parla.uni #parla.term" n="10">10. agintaldia</meeting>
<meeting corresp="#DZ" ana="#parla.uni #parla.term" n="11">11. agintaldia</meeting>
<meeting corresp="#DZ" ana="#parla.uni #parla.term" n="12">12. agintaldia</meeting>

BTW, DZ is used in ParlaMint-SI corpus. This id does not make sense in Basque parliament.

idno type

idno type and subtype

idno type is usually URI and there can be som subtypes: https://clarin-eric.github.io/ParlaMint/#TEI.idno

https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L399

<idno type="wikimedia" xml:lang="eu">https://eu.wikipedia.org/wiki/Eusko_Legebiltzarra</idno>

should be

<idno type="URI" subtype="wikimedia" xml:lang="eu">https://eu.wikipedia.org/wiki/Eusko_Legebiltzarra</idno>

there are other strange types:
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L1539

<idno type="eaj-pnv-web" xml:lang="eu">https://euskolegebiltzarra.eaj-pnv.eus/eu/legebiltzar-taldea/9801/kerman-orbegozo</idno>

wrong term date

current term did not end

https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L416-L420

<event xml:id="DZ.12" from="2020-08-03" to="2022-07-01"> <!-- remove to="2022-07-01" -->
<!-- ... -->
  <label xml:lang="en">Term 12</label>
</event>

country code

country code in key - root files
country code in key - component files

https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L376

<name key="EU" type="country">Basque Country, Spain</name>

should be

<name key="ES-PV" type="country">Basque Country, Spain</name>

different setting in component and root files

different component and root files setting

            <setting>
               <name type="city">Gasteiz</name>
               <name type="country" key="ES">Spain</name>
               <date ana="#parla.sitting" when="2015-02-05">2015-02-05</date> >!-- file date -->
            </setting>

vs.

            <setting>
              <name type="org">Eusko Legebiltzarra</name>
               <name type="address">Becerro de Bengoa kalea</name>
               <name type="city">Vitoria-Gasteiz</name>
               <name key="EU" type="country">Basque Country, Spain</name>
               <date from="2015-02-01" to="2022-07-01"> 2015eko otsailetik 2022eko uztailera</date> <!-- corpus timespan -->
            </setting>

unicameral parliament

only parla.uni

don't use #parla.lower in unicameral parliamentary system

typo aupport vs support

support
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L21

<resp xml:lang="en">Technical aupport and conversion to XML</resp>
<!-- ... -->
<resp xml:lang="en">Technical aupport and support for bilingual analysis</resp>

no virtual persons

remove anonymous person

https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L550

              <person xml:id="Anonymous">
                 <persName>
                    <forename>An</forename>
                    <surname>Onymus</surname>
                 </persName>
              </person>

remove annotation with parliamentary organization

remove ana="#DZ"

In @ana references, we use events or categories, not organizations, if someone is member that corresponds to some term, you can use a term id (e.g. #DZ.10)

langUsage usage

percentage of usage of language in corpus (natural number)

https://clarin-eric.github.io/ParlaMint/#TEI.language

There is no definition how to calculate this number - I use percentage of words in given language in ParlaMint-UA

https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L3973-L3983

Remove Czceh named entity taxonomy

remove cnec

https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV-taxonomy-NER.cnec2.0.ana.xml
it is used only in Czech data

file including:
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.ana.xml#L370-L371

prefix definition:
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.ana.xml#L377-L379

fix application info

application info

https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.ana.xml#L381-L390

         <appInfo>
            <application ident="UDPipe" version="2">
               <label>UDPipe 2 (czech-pdt-ud-2.10-220711 model)</label>
               <desc>POS tagging, lemmatization and dependency parsing done with UDPipe 2 (<ref target="http://ufal.mff.cuni.cz/udpipe/2">http://ufal.mff.cuni.cz/udpipe/2</ref>) with czech-pdt-ud-2.10-220711 model</desc>
            </application>
            <application ident="NameTag" version="2">
               <label>NameTag 2 (czech-cnec2.0-200831 model)</label>
               <desc>Name entity recognition done with NameTag 2 (<ref target="http://ufal.mff.cuni.cz/nametag/2">http://ufal.mff.cuni.cz/nametag/2</ref>) with czech-cnec2.0-200831 model (<ref target="http://hdl.handle.net/11234/1-3443">http://hdl.handle.net/11234/1-3443</ref>). Posprocessing: nested named entities has been merged to four categories (PER, LOC, ORG, MISC).</desc>
            </application>
         </appInfo>

You have used different models in UDPipe
and as far as I know you haven't used NameTag but some other tool for NE

add parla.sitting

TEI/@ana

https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV_2015-02-05.xml#L2

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="ParlaMint-ES-PV_2015-02-05" xml:lang="eu" ana="#reference">

should be

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="ParlaMint-ES-PV_2015-02-05" xml:lang="eu" ana="#reference #parla.sitting">

component file meeting

add sitting (and corresp)

https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV_2015-02-05.xml#L8

<meeting ana="#parla.session" n="1049-1054">Osoko Bilkura 2015-02-05 1049-1054</meeting>

can be

<meeting ana="#parla.sitting" n="2015-02-05">2015-02-05</meeting>
<meeting ana="#parla.session" n="1049-1054">Osoko Bilkura 1049-1054</meeting>

word count

word count

I don't understand this, wrong language, strange description (aprox)
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV_2015-02-05.xml#L26

<measure unit="words" quantity="42758" xml:lang="eu">42758 words (aprox)</measure>

notes

speaker notes

https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV_2015-02-05.xml#L83

<u who="#TejeriaOtermin" xml:id="ParlaMint-ES-PV_2015-02-05.u0" ana="#chair">
  <note>LEHENDAKARIAK (Tejeria Otermin):</note>
  <seg xml:id="ParlaMint-ES-PV_2015-02-05.seg0" xml:lang="eu">Egun on guztioi. Osoko bilkurari hasiera emango diogu.</seg>

should be (outside utterance + type):

<note type="speaker>LEHENDAKARIAK (Tejeria Otermin):</note>
<u who="#TejeriaOtermin" xml:id="ParlaMint-ES-PV_2015-02-05.u0" ana="#chair">
  <seg xml:id="ParlaMint-ES-PV_2015-02-05.seg0" xml:lang="eu">Egun on guztioi. Osoko bilkurari hasiera emango diogu.</seg>

strange spacing in notes

https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV_2015-02-05.xml#L1415-L1416

          <note>: Ondorioz, onartuta geratu da erdibideko zuzenketa.
			:</note>

notes inside text

there are a lot of text inside (), I am not sure what is comment/or incident...

... de cualquier propiedad (material político o de cualquier otra índole), para atacar ...
... garantistak ez diren beste estatuetan (Alemaniako edo Estatu Batuetako sistemetan, adibidez), ...
... 4. puntuak.  (Geldiunea)  Bozketa eginda, ...

no named entities

named entities

The text was updated successfully, but these errors were encountered:

miruskieta · 2023-03-16T00:32:42Z

Hi Matyas,

I think each "idno" has his own type, but still have the following error:
Error: /home/runner/work/ParlaMint-ES-PV/ParlaMint-ES-PV/ParlaMint/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml:3951:51: error: element "idno" missing required attribute "type"

Do you know how I can find the error?

Best,
m

TomazErjavec · 2023-03-16T07:54:51Z

Error: /home/runner/work/ParlaMint-ES-PV/ParlaMint-ES-PV/ParlaMint/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml:3951:51: error: element "idno" missing required attribute "type"

If I indetify the line correctly, you have:
<idno type="wikimedia" xml:lang="eu">https://es.wikipedia.org/wiki/Alfredo_Retortillo</idno>

The error message is indeed strange, but, regardless, it is not correct, cf. the definition of idno. It should be

<idno type="URI" subtype="wikimedia" xml:lang="eu">https://es.wikipedia.org/wiki/Alfredo_Retortillo</idno>

miruskieta · 2023-03-16T09:14:29Z

Yes, I found yesterday that error and carefully check if there is any other error, but the error is there. Here is the line and it looks good.
<idno type="URI" subtype="wikimedia" xml:lang="eu">https://es.wikipedia.org/wiki/Alfredo_Retortillo</idno>

TomazErjavec · 2023-03-16T19:50:22Z

Indeed it does look good. No ideas what is wrong here. Maybe just correct everything else, and do another pull request, it might be then easier to debug.

matyaskopp · 2023-03-16T19:55:45Z

There is opened <p> but not closed, this can make unexpected mess...
https://github.com/miruskieta/ParlaMint-ES-PV/blob/d33ed6a05ca58b772a205d8088fb940972bb7778/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L65

miruskieta · 2023-03-17T13:05:29Z

Once I correct this, there are a lot of errors. I validated externally (https://codebeautify.org/xmlvalidator) and it seems it is ok.
Error: /home/runner/work/ParlaMint-ES-PV/ParlaMint-ES-PV/ParlaMint/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml:66:33: error: element "p" not allowed here; expected the element end-tag, text or element "ref"
Error: /home/runner/work/ParlaMint-ES-PV/ParlaMint-ES-PV/ParlaMint/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml:67:33: error: element "p" not allowed here; expected the element end-tag, text or element "ref"

matyaskopp · 2023-03-29T12:41:37Z

@miruskieta

broken token order/nesting after NER annotation

broken token order/nesting after NER annotation

<?xml version="1.0" encoding="UTF-8"?>
<seg xml:id="ParlaMint-ES-PV_2015-02-05.seg5" xml:lang="es">
   <s xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1">
      <w xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1.w1" lemma="el" pos="DET" msd="UPosTag=DET|Definite=Def|Gender=Fem|Number=Sing|PronType=Art">
         La
<!-- INCORECT TOKEN PLACEMENT: -->
         <w xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1.w6" lemma="Ararteko" pos="PROPN" msd="UPosTag=PROPN">Ararteko</w>
      </w>
      <w xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1.w2" lemma="institución" pos="NOUN" msd="UPosTag=NOUN|Gender=Fem|Number=Sing">institución</w>
      <w xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1.w3">
         del
         <w xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1.w4" lemma="de" pos="ADP" msd="UPosTag=ADP" norm="de" />
         <w xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1.w5" lemma="el" pos="DET" msd="UPosTag=DET|Definite=Def|Gender=Masc|Number=Sing|PronType=Art" norm="el" />
      </w>
<!-- SHOULD BE HERE -->
      <w xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1.w7" lemma="ser" pos="AUX" msd="UPosTag=AUX|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin">es</w>
     <!-- ... --->
   </s>
</seg>

I have annotated your TEI sample with udpipe on my side, and it works, so incorrect tokens are produced by the NER tool.
I have used this:

# copy my script into the current directory
svn checkout https://github.com/ufal/ParCzech/trunk/src/udpipe2
svn checkout https://github.com/ufal/ParCzech/trunk/src/lib
# copy your sample data into the current directory
svn checkout https://github.com/miruskieta/ParlaMint/trunk/Data/ParlaMint-ES-PV
# create an output directory 
mkdir ParlaMint-ES-PV.ana
# create a list of files to annotate
ls ParlaMint-ES-PV|grep 'ParlaMint-ES-PV_' | grep -v '.ana.xml' > files2annotate.fl
# run script
perl -I lib udpipe2/udpipe2.pl --colon2underscore \
                             --model "eu:basque-bdt-ud-2.10-220711" \
                             --model "es:spanish-ancora-ud-2.10-220711" \
                             --elements "seg" \
                             --debug \
                             --try2continue-on-error \
                             --filelist files2annotate.fl \
                             --input-dir ParlaMint-ES-PV \
                             --output-dir ParlaMint-ES-PV.ana

If you want to remove pc/@lemma quickly you can use:

# inplace remove lemma from <pc> elements in file
xmlstarlet edit --inplace --delete "//_:pc/@lemma"  FILE.xml

matyaskopp · 2023-05-24T08:42:14Z

main corpus title

fix main title

https://github.com/miruskieta/ParlaMint-ES-PV/blob/303caf4d2a491782657d4906a1807674c259cce3/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L6-L8

<title type="main" xml:lang="eu">Eusko Legebiltzarreko corpusa ParlaMint-ES-PV</title>
<title type="main" xml:lang="es">Corpus del Parlamento Vasco ParlaMint-ES-PV</title>
<title type="main" xml:lang="en">Basque parliamentary corpus ParlaMint-ES-PV</title>

should be

<title type="main" xml:lang="eu">Eusko Legebiltzarreko corpusa ParlaMint-ES-PV [ParlaMint SAMPLE]</title>
<title type="main" xml:lang="es">Corpus del Parlamento Vasco ParlaMint-ES-PV [ParlaMint SAMPLE]</title>
<title type="main" xml:lang="en">Basque parliamentary corpus ParlaMint-ES-PV [ParlaMint SAMPLE]</title>

meeting of

add a reference to term event

Error: ERROR: cannot determine of which body the component ParlaMint-ES-PV_2015-02-27.xml is a meeting of!

https://github.com/miruskieta/ParlaMint-ES-PV/blob/303caf4d2a491782657d4906a1807674c259cce3/Data/ParlaMint-ES-PV/ParlaMint-ES-PV_2015-02-27.xml#L8-L10

<meeting ana="#parla.sitting" n="2015-02-27">2015-02-27</meeting>
<meeting ana="#parla.session" n="1086-1108">Osoko Bilkura 1086-1108</meeting>
<meeting ana="#parla.term" n="X">Legegintzaldia X</meeting>

should be

<meeting corresp="#ES-PV" ana="#parla.sitting" n="2015-02-27">2015-02-27</meeting>
<meeting corresp="#ES-PV" ana="#parla.session" n="1086-1108">Osoko Bilkura 1086-1108</meeting>
<meeting corresp="#ES-PV" ana="#parla.term #ES-PV.10" n="X">Legegintzaldia X</meeting>

The script uses the #ES-PV.10 value to determine the body where the meeting takes place.

miruskieta · 2023-05-24T10:45:05Z

I think now is correct, but there is an error in the validation process.
I check the ES-PG and they have #parla.uni", is that also necessary?

matyaskopp · 2023-05-24T10:55:04Z

I think now is correct, but there is an error in the validation process. I check the ES-PG and they have #parla.uni", is that also necessary?

Not necessary, but better to have. It is a constant value for the whole corpus in your case, so no big deal :-)

matyaskopp · 2023-05-24T16:34:49Z

@miruskieta
I have updated the ticks, and there are still some left.

Would you like me to help you with validation errors? Would you like more explanations?

matyaskopp assigned miruskieta Feb 28, 2023

matyaskopp linked a pull request Mar 14, 2023 that will close this issue

Data ES-PV #510

Open

TomazErjavec added this to the Future milestone Mar 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ES-PV Feedback #609

ES-PV Feedback #609

matyaskopp commented Feb 28, 2023 •

edited

Loading

miruskieta commented Mar 16, 2023

TomazErjavec commented Mar 16, 2023

miruskieta commented Mar 16, 2023 •

edited

Loading

TomazErjavec commented Mar 16, 2023

matyaskopp commented Mar 16, 2023

miruskieta commented Mar 17, 2023 •

edited

Loading

matyaskopp commented Mar 29, 2023 •

edited

Loading

matyaskopp commented May 24, 2023 •

edited

Loading

miruskieta commented May 24, 2023 •

edited

Loading

matyaskopp commented May 24, 2023

matyaskopp commented May 24, 2023

ES-PV Feedback #609

ES-PV Feedback #609

Comments

matyaskopp commented Feb 28, 2023 • edited Loading

Common taxonomies are modified

meeting corresp

idno type

wrong term date

country code

different setting in component and root files

unicameral parliament

typo aupport vs support

no virtual persons

remove annotation with parliamentary organization

langUsage usage

Remove Czceh named entity taxonomy

fix application info

add parla.sitting

component file meeting

word count

notes

no named entities

miruskieta commented Mar 16, 2023

TomazErjavec commented Mar 16, 2023

miruskieta commented Mar 16, 2023 • edited Loading

TomazErjavec commented Mar 16, 2023

matyaskopp commented Mar 16, 2023

miruskieta commented Mar 17, 2023 • edited Loading

matyaskopp commented Mar 29, 2023 • edited Loading

broken token order/nesting after NER annotation

matyaskopp commented May 24, 2023 • edited Loading

main corpus title

meeting of

miruskieta commented May 24, 2023 • edited Loading

matyaskopp commented May 24, 2023

matyaskopp commented May 24, 2023

matyaskopp commented Feb 28, 2023 •

edited

Loading

miruskieta commented Mar 16, 2023 •

edited

Loading

miruskieta commented Mar 17, 2023 •

edited

Loading

matyaskopp commented Mar 29, 2023 •

edited

Loading

matyaskopp commented May 24, 2023 •

edited

Loading

miruskieta commented May 24, 2023 •

edited

Loading