Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsenexus doesn't accept TaxaLabels names that begin with integers #39

Open
isaacovercast opened this issue Jan 11, 2021 · 2 comments
Open
Labels

Comments

@isaacovercast
Copy link

Hello folks, very cool module, thanks for your efforts. Parsing a bunch of heterogeneous nexus files I discovered a couple potential issues, which you can guess from the title. The parsenexus function appears to split on integers at the head of taxon names in the TaxaLabels field. It also splits on periods (.) in taxon names. Within the function parsetaxa(token, state, tokens, taxa) this results in a taxa dictionary with more entries than are declared in the ntaxa field of the nexus file, raising the warning at line 436.

Steps to reproduce:

using Phylo
nex = """#NEXUS

Begin taxa;
    Dimensions ntax=3;
    TaxLabels
        _2109_Nesoenas_picturata_Reunion
        2108_Nesoenas_picturata_Reunion
        AY529948.1
        ;
End;
Begin trees;
    Translate
        1 _2109_Nesoenas_picturata_Reunion,
        2 2108_Nesoenas_picturata_Reunion,
        3 AY529948.1
        ;
    tree TREE1 = [&R] (2:0.2311195,(1:0.18127275,3:0.18127275)3:0.049846749999999995);
End;
"""
open("/tmp/tmp.tre","w") do io
   println(io, nex)
end

ts = open(parsenexus, Phylo.path("/tmp/tmp.tre"))open("/tmp/tmp.tre","w") do io
   println(io, nex)
end

ts = open(parsenexus, Phylo.path("/tmp/tmp.tre"))

I updated newick.jl (~line 436) to show the taxa dictionary, so the output of the above call is more informative:

    if length(taxa) != ntax
        @warn "$taxa"
        @warn "Taxa list length ($(length(taxa))) and ntax ($ntax) do not match"
    end

And the result is:

┌ Warning: Dict("AY529948" => "AY529948","_Nesoenas_picturata_Reunion" => "_Nesoenas_picturata_Reunion",".1" => ".1","2108" => "2108","_2109_Nesoenas_picturata_Reunion" => "_2109_Nesoenas_picturata_Reunion")
└ @ Phylo /home/isaac/tmp/julia/Phylo.jl/src/newick.jl:443
┌ Warning: Taxa list length (5) and ntax (3) do not match
└ @ Phylo /home/isaac/tmp/julia/Phylo.jl/src/newick.jl:444
...
TreeSet with 0 trees, each with 0 tips.
Tree names are 

Expects 3 and is getting 5 because 2108_Nesoenas_picturata_Reunion and AY529948.1 are getting split.

I think ape doesn't allow integers to lead taxon names either, so maybe this is a feature and not a bug, but I don't think it's a constraint of the nexus format, as other packages will handle this fine (e.g. toytree and dendropy).

I can work around it, so not a big deal but thought I'd report it.

Thanks again for all your work.
-isaac

@richardreeve
Copy link
Member

Hmm, interesting. Thanks for pointing this out. This is a parser issue because I'm using the built-in julia parser, which doesn't allow character tokens to begin with a number or contain a period. I'll have a think if there's some easy way around it, but nothing immediately springs to mind unfortunately.

@richardreeve
Copy link
Member

richardreeve commented Apr 7, 2021

Okay, I've finally had time to look at this, and I think I have a solution (for this issue and #42). When I wrote the original parser I couldn't use Automa.jl because it couldn't handle recursive finite state machines (so nested trees within trees within trees, like we have for phylogenies). I actually can't tell whether it can now - it doesn't mention it and the examples are non-recursive. However, it now exposes its tokeniser directly, so I think at the very least I can rewrite the parser so that it uses Automa's tokeniser. It'll be a pain, but a lot easier than writing my own tokeniser! I'm afraid it may take a bit of time before it bubbles to the top of my list though...

[edit: It definitely can't handle recursive grammars still, but they are thinking about it...]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants