-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ES-PV Feedback #609
Comments
Hi Matyas, I think each "idno" has his own type, but still have the following error: Do you know how I can find the error? Best, |
If I indetify the line correctly, you have: The error message is indeed strange, but, regardless, it is not correct, cf. the definition of idno. It should be
|
Yes, I found yesterday that error and carefully check if there is any other error, but the error is there. Here is the line and it looks good. |
Indeed it does look good. No ideas what is wrong here. Maybe just correct everything else, and do another pull request, it might be then easier to debug. |
There is opened |
Once I correct this, there are a lot of errors. I validated externally (https://codebeautify.org/xmlvalidator) and it seems it is ok. |
broken token order/nesting after NER annotation
<?xml version="1.0" encoding="UTF-8"?>
<seg xml:id="ParlaMint-ES-PV_2015-02-05.seg5" xml:lang="es">
<s xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1">
<w xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1.w1" lemma="el" pos="DET" msd="UPosTag=DET|Definite=Def|Gender=Fem|Number=Sing|PronType=Art">
La
<!-- INCORECT TOKEN PLACEMENT: -->
<w xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1.w6" lemma="Ararteko" pos="PROPN" msd="UPosTag=PROPN">Ararteko</w>
</w>
<w xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1.w2" lemma="institución" pos="NOUN" msd="UPosTag=NOUN|Gender=Fem|Number=Sing">institución</w>
<w xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1.w3">
del
<w xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1.w4" lemma="de" pos="ADP" msd="UPosTag=ADP" norm="de" />
<w xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1.w5" lemma="el" pos="DET" msd="UPosTag=DET|Definite=Def|Gender=Masc|Number=Sing|PronType=Art" norm="el" />
</w>
<!-- SHOULD BE HERE -->
<w xml:id="ParlaMint-ES-PV_2015-02-05.seg5.s1.w7" lemma="ser" pos="AUX" msd="UPosTag=AUX|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin">es</w>
<!-- ... --->
</s>
</seg> I have annotated your TEI sample with udpipe on my side, and it works, so incorrect tokens are produced by the NER tool. # copy my script into the current directory
svn checkout https://github.com/ufal/ParCzech/trunk/src/udpipe2
svn checkout https://github.com/ufal/ParCzech/trunk/src/lib
# copy your sample data into the current directory
svn checkout https://github.com/miruskieta/ParlaMint/trunk/Data/ParlaMint-ES-PV
# create an output directory
mkdir ParlaMint-ES-PV.ana
# create a list of files to annotate
ls ParlaMint-ES-PV|grep 'ParlaMint-ES-PV_' | grep -v '.ana.xml' > files2annotate.fl
# run script
perl -I lib udpipe2/udpipe2.pl --colon2underscore \
--model "eu:basque-bdt-ud-2.10-220711" \
--model "es:spanish-ancora-ud-2.10-220711" \
--elements "seg" \
--debug \
--try2continue-on-error \
--filelist files2annotate.fl \
--input-dir ParlaMint-ES-PV \
--output-dir ParlaMint-ES-PV.ana If you want to remove # inplace remove lemma from <pc> elements in file
xmlstarlet edit --inplace --delete "//_:pc/@lemma" FILE.xml |
main corpus title
<title type="main" xml:lang="eu">Eusko Legebiltzarreko corpusa ParlaMint-ES-PV</title>
<title type="main" xml:lang="es">Corpus del Parlamento Vasco ParlaMint-ES-PV</title>
<title type="main" xml:lang="en">Basque parliamentary corpus ParlaMint-ES-PV</title> should be <title type="main" xml:lang="eu">Eusko Legebiltzarreko corpusa ParlaMint-ES-PV [ParlaMint SAMPLE]</title>
<title type="main" xml:lang="es">Corpus del Parlamento Vasco ParlaMint-ES-PV [ParlaMint SAMPLE]</title>
<title type="main" xml:lang="en">Basque parliamentary corpus ParlaMint-ES-PV [ParlaMint SAMPLE]</title> meeting of
<meeting ana="#parla.sitting" n="2015-02-27">2015-02-27</meeting>
<meeting ana="#parla.session" n="1086-1108">Osoko Bilkura 1086-1108</meeting>
<meeting ana="#parla.term" n="X">Legegintzaldia X</meeting> should be <meeting corresp="#ES-PV" ana="#parla.sitting" n="2015-02-27">2015-02-27</meeting>
<meeting corresp="#ES-PV" ana="#parla.session" n="1086-1108">Osoko Bilkura 1086-1108</meeting>
<meeting corresp="#ES-PV" ana="#parla.term #ES-PV.10" n="X">Legegintzaldia X</meeting> The script uses the |
I think now is correct, but there is an error in the validation process. |
Not necessary, but better to have. It is a constant value for the whole corpus in your case, so no big deal :-) |
@miruskieta Would you like me to help you with validation errors? Would you like more explanations? |
Common taxonomies are modified
The common taxonomies should be used without modifications - just translations.
E.g. in your
parla.legislature
taxonomy, you don't have categoryparla.bi
You can use this procedure to extract taxonomies and
listPerson
andlistOrg
outside and include them with include:and then commit and push updated files
meeting corresp
meeting/@correst
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L12-L14
should be (
DZ
is an id of corresponding parliament)BTW,
DZ
is used in ParlaMint-SI corpus. This id does not make sense in Basque parliament.idno type
idno type is usually
URI
and there can be som subtypes: https://clarin-eric.github.io/ParlaMint/#TEI.idnohttps://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L399
should be
there are other strange types:
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L1539
wrong term date
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L416-L420
country code
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L376
should be
different setting in component and root files
setting
vs.
unicameral parliament
parla.uni
don't use
#parla.lower
in unicameral parliamentary systemtypo aupport vs support
support
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L21
no virtual persons
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L550
remove annotation with parliamentary organization
ana="#DZ"
In
@ana
references, we use events or categories, not organizations, if someone is member that corresponds to some term, you can use a term id (e.g.#DZ.10
)langUsage usage
https://clarin-eric.github.io/ParlaMint/#TEI.language
There is no definition how to calculate this number - I use percentage of words in given language in ParlaMint-UA
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.xml#L3973-L3983
Remove Czceh named entity taxonomy
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV-taxonomy-NER.cnec2.0.ana.xml
it is used only in Czech data
file including:
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.ana.xml#L370-L371
prefix definition:
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.ana.xml#L377-L379
fix application info
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV.ana.xml#L381-L390
You have used different models in UDPipe
and as far as I know you haven't used NameTag but some other tool for NE
add parla.sitting
TEI/@ana
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV_2015-02-05.xml#L2
should be
component file meeting
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV_2015-02-05.xml#L8
can be
word count
I don't understand this, wrong language, strange description
(aprox)
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV_2015-02-05.xml#L26
notes
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV_2015-02-05.xml#L83
should be (outside utterance + type):
https://github.com/miruskieta/ParlaMint-ES-PV/blob/b47c93f7b7f475bab247bd5b15ce0904b01c43af/Data/ParlaMint-ES-PV/ParlaMint-ES-PV_2015-02-05.xml#L1415-L1416
there are a lot of text inside (), I am not sure what is comment/or incident...
no named entities
The text was updated successfully, but these errors were encountered: