Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rename uniprot to swissprot #4

Open
cmungall opened this issue Nov 23, 2022 · 5 comments · May be fixed by #7
Open

rename uniprot to swissprot #4

cmungall opened this issue Nov 23, 2022 · 5 comments · May be fixed by #7

Comments

@cmungall
Copy link

the uniprot obo file is actually just swissprot

grep -c '^id: uniprot:' ../obo-db-ingest/export/uniprot/2022_02/uniprot.obo
567483

which is useful in its own right, but it should be called swissprot

uniprot has another 229m entries from trembl, which might be harder to get by github size limits

another useful slice is all the reference proteomes. For human this more or less equates to swissprot but for other organisms it gives a representative entry for each gene

@cthoyt
Copy link
Member

cthoyt commented Feb 3, 2023

Not sure what to do about this, I want files in this repo to correspond to semantic spaces. UniProt is definitely an issue given it's so big and I don't want to include trembl

@cthoyt
Copy link
Member

cthoyt commented Feb 3, 2023

Is there a downstream use case that merits me spending brain power on this?

@cthoyt
Copy link
Member

cthoyt commented Feb 3, 2023

potential solution: create subspace relatonship in bioregistry

@cmungall
Copy link
Author

cmungall commented Feb 3, 2023

but the subspace idea makes sense. E.g. when I run the ingest, I would get something like:

uniprot/
    uniprot-swissprot.obo
    uniprot-swissprot.owl

this makes it clear you are only ingesting a subset

this means that if people do want to do a run ingesting all of treambl they can do this in a compatible way

I am not sure if the subsets need to be registered in bioregistry. there are a lot of ways to subdivide a large resource.

are you looking for use cases that require more than swissprot? For many non-human organisms, swissprot coverage is not complete (in fact it's not even 100% complete for all human genes). The most useful subset of uniprot for an organism is often the gene-centric reference proteome subset, which will be a mix of swissprot and trembl (but not all of trembl - just one representative per gene)

cthoyt added a commit that referenced this issue Aug 7, 2023
@cthoyt cthoyt linked a pull request Aug 7, 2023 that will close this issue
@cmungall
Copy link
Author

Is there a downstream use case that merits me spending brain power on this?

This ingest is currently causing a lot of confusion - people read it and think it's all of uniprot, but in fact it's just swissprot (i.e reviewed subset). I think the immediate action is just to rename this from uniprot to swissprot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants