-
Notifications
You must be signed in to change notification settings - Fork 133
PubmedOrcidGraph
##Motivation
Creates a graph from Pubmed and Authors' Orcid identifiers
##Compilation
- java 1.8 http://www.oracle.com/technetwork/java/index.html (NOT the old java 1.7 or 1.6) . Please check that this java is in the
${PATH}
. Setting JAVA_HOME is not enough : (e.g: https://github.com/lindenb/jvarkit/issues/23 ) - GNU Make > 3.81
- curl/wget
- git
- apache ant is only required to compile htsjdk
- xsltproc http://xmlsoft.org/XSLT/xsltproc2.html
$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make pubmedorcidgraph
by default, the libraries are not included in the jar file, so you shouldn't move them (https://github.com/lindenb/jvarkit/issues/15#issuecomment-140099011 ). You can create a bigger but standalone executable jar by addinging standalone=yes
on the command line:
$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make pubmedorcidgraph standalone=yes
The required libraries will be downloaded and installed in the dist
directory.
The a file local.mk can be created edited to override/add some paths.
For example it can be used to set the HTTP proxy:
http.proxy.host=your.host.com
http.proxy.port=124567
##Synopsis
$ java -jar dist/pubmedorcidgraph.jar [options] (stdin|file)
- -o|--output (OUTPUT-FILE) Output file. Default:stdout.
- -f|--format (VALUE) Optional: 'txt', 'xml' , 'json' Default value : "txt".
- -h|--help print help
- -version|--version show version and exit
##Source Code
Main code is: https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/pubmed/PubmedOrcidGraph.java
ORCIDORCID provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized.""
You can download the papers containing some orcid Identifiers using the entrez query http://www.ncbi.nlm.nih.gov/pubmed/?term=orcid[AUID]. I've used one of my tools pubmeddump to download the articles asXML and I wrote PubmedOrcidGraph to extract the author's orcid.
java -jar dist/pubmeddump.jar "orcid[AUID]" |\
java -jar dist/pubmedorcidgraph.jar
#PubmedArticle
PMID 24549058
doi 10.1038/ejhg.2014.1
Year 2014
Journal Eur. J. Hum. Genet.
Title Using ancestry-informative markers to identify fine structure across 15 populations of European origin.
Author 0000-0002-6619-873X
#Author
ORCID 0000-0002-6619-873X
ForeName Patrick F
LastName Sullivan
Initials PF
Affiliation University of North Carolina, Chapel Hill, NC, USA.
#PubmedArticle
PMID 23872634
doi 10.1038/ng.2712
Year 2013
Journal Nat. Genet.
Title Common variants at SCN5A-SCN10A and HEY2 are associated with Brugada syndrome, a rare disease with high risk of sudden cardiac death.
Author 0000-0001-7751-2280
Author 0000-0002-7722-7348
#Author
ORCID 0000-0001-7751-2280
ForeName Richard
LastName Redon
Initials R
#Author
ORCID 0000-0002-7722-7348
ForeName Christian
LastName Dina
Initials C
java -jar dist/pubmeddump.jar "orcid[AUID]" |\
java -jar dist/pubmedorcidgraph.jar -f xml | xmllint --format -
<PubmedArticle pmid="26301497" doi="10.1038/ng.3383">
<year>2015</year>
<journal>Nat. Genet.</journal>
<title>Genetic association analyses highlight biological pathways underlying mitral valve prolapse.</title>
<Author orcid="0000-0001-5925-381X">
<foreName>Xavier</foreName>
<lastName>Jeunemaitre</lastName>
<initials>X</initials>
<affiliation>AP-HP, Department of Genetics, Hôpital Européen Georges Pompidou, Paris, France.</affiliation>
</Author>
<Author orcid="0000-0002-2067-0533">
<foreName>Patrick T</foreName>
<lastName>Ellinor</lastName>
<initials>PT</initials>
<affiliation>Program in Medical and Population Genetics, Broad Institute, Cambridge, Massachusetts, USA.</affiliation>
</Author>
<Author orcid="0000-0002-4227-7062">
<foreName>Jorge</foreName>
<lastName>Solis</lastName>
<initials>J</initials>
<affiliation>Centro Nacional de Investigaciones Cardiovasculares, Carlos III (CNIC), Madrid, Spain.</affiliation>
</Author>
<Author orcid="0000-0002-7722-7348">
<foreName>Christian</foreName>
<lastName>Dina</lastName>
<initials>C</initials>
<affiliation>Centre Hospitalier Universitaire (CHU) Nantes, Université de Nantes, Nantes, France.</affiliation>
</Author>
<Author orcid="0000-0003-4076-2336">
<foreName>Emelia J</foreName>
<lastName>Benjamin</lastName>
<initials>EJ</initials>
<affiliation>Department of Medicine (Cardiovascular Division), Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA.</affiliation>
</Author>
</PubmedArticle>
<PubmedArticle pmid="23872634" doi="10.1038/ng.2712">
<year>2013</year>
<journal>Nat. Genet.</journal>
<title>Common variants at SCN5A-SCN10A and HEY2 are associated with Brugada syndrome, a rare disease with high risk of sudden cardiac death.</title>
<Author orcid="0000-0001-7751-2280">
<foreName>Richard</foreName>
<lastName>Redon</lastName>
<initials>R</initials>
</Author>
<Author orcid="0000-0002-7722-7348">
<foreName>Christian</foreName>
<lastName>Dina</lastName>
<initials>C</initials>
</Author>
</PubmedArticle>
java -jar dist/pubmeddump.jar "orcid[AUID]" |\
java -jar dist/pubmedorcidgraph.jar -f json | python -m json.tool
[
(...)
{
"authors": [
{
"affiliation": "AP-HP, Department of Genetics, H\u00f4pital Europ\u00e9en Georges Pompidou, Paris, France.",
"foreName": "Xavier",
"initials": "X",
"lastName": "Jeunemaitre",
"orcid": "0000-0001-5925-381X"
},
{
"affiliation": "Program in Medical and Population Genetics, Broad Institute, Cambridge, Massachusetts, USA.",
"foreName": "Patrick T",
"initials": "PT",
"lastName": "Ellinor",
"orcid": "0000-0002-2067-0533"
},
{
"affiliation": "Centro Nacional de Investigaciones Cardiovasculares, Carlos III (CNIC), Madrid, Spain.",
"foreName": "Jorge",
"initials": "J",
"lastName": "Solis",
"orcid": "0000-0002-4227-7062"
},
{
"affiliation": "Centre Hospitalier Universitaire (CHU) Nantes, Universit\u00e9 de Nantes, Nantes, France.",
"foreName": "Christian",
"initials": "C",
"lastName": "Dina",
"orcid": "0000-0002-7722-7348"
},
{
"affiliation": "Department of Medicine (Cardiovascular Division), Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA.",
"foreName": "Emelia J",
"initials": "EJ",
"lastName": "Benjamin",
"orcid": "0000-0003-4076-2336"
}
],
"doi": "10.1038/ng.3383",
"journal": "Nat. Genet.",
"pmid": "26301497",
"title": "Genetic association analyses highlight biological pathways underlying mitral valve prolapse.",
"year": "2015"
},
{
"authors": [
{
"foreName": "Richard",
"initials": "R",
"lastName": "Redon",
"orcid": "0000-0001-7751-2280"
},
{
"foreName": "Christian",
"initials": "C",
"lastName": "Dina",
"orcid": "0000-0002-7722-7348"
}
],
"doi": "10.1038/ng.2712",
"journal": "Nat. Genet.",
"pmid": "23872634",
"title": "Common variants at SCN5A-SCN10A and HEY2 are associated with Brugada syndrome, a rare disease with high risk of sudden cardiac death.",
"year": "2013"
}
]
Now, I want to insert those data into a sqlite3 database. I use the XSLT stylesheet below to convert the XML into some SQL statements:
<?xml version="1.0"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0"
xmlns:xalan="http://xml.apache.org/xalan"
xmlns:str="xalan://com.github.lindenb.xslt.strings.Strings"
exclude-result-prefixes="xalan str"
>
<xsl:output method="text"/>
<xsl:variable name="q">'</xsl:variable>
<xsl:template match="/">
create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
<xsl:apply-templates select="PubmedArticleSet/PubmedArticle"/>
commit;
</xsl:template>
<xsl:template match="PubmedArticle">
<xsl:for-each select="Author">
<xsl:variable name="o1" select="@orcid"/>insert or ignore into author(orcid,name,affiliation) values ('<xsl:value-of select="$o1"/>','<xsl:value-of select="translate(concat(lastName,' ',foreName),$q,' ')"/>','<xsl:value-of select="translate(affiliation,$q,' ')"/>');
<xsl:for-each select="following-sibling::Author">insert or ignore into collab(orcid1,orcid2) values('<xsl:value-of select='$o1'/>','<xsl:value-of select='@orcid'/>');
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml
create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4078-7413','Bussotti Giovanni','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-4449-1863');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4449-1863','Leonardi Tommaso','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4449-1863','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-6090-3100','Enright Anton J','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
(...)
and those sql statetements are loaded into sqlite3:
xsltproc orcid2sqlite.xsl orcid.xml |\
sqlite3 orcid.sqlite
The next step is to produce a gexf+xml file to play with the orcid graph in http://gephi.org . I use the following bash script to convert the sqlite3 database to gexf+xml.
DB=orcid.sqlite
cat << EOF
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" xmlns:viz="http://www.gexf.net/1.1draft/viz" version="1.2">
<meta>
<creator>Pierre Lindenbaum</creator>
<description>Orcid Graph</description>
</meta>
<graph defaultedgetype="undirected" mode="static">
<attributes class="node">
<attribute type="string" title="affiliation" id="0"/>
</attributes>
<nodes>
EOF
sqlite3 -separator ' ' -noheader ${DB} 'select orcid,name,affiliation from author' |\
sed -e 's/&/&/g' -e "s/</\</g" -e "s/>/\>/g" -e "s/'/\'/g" -e 's/"/\"/g' |\
awk -F ' ' '{printf("<node id=\"%s\" label=\"%s\"><attvalue for=\"0\" value=\"%s\"/></node>\n",$1,$2,$3);}'
echo "</nodes><edges>"
sqlite3 -separator ' ' -noheader ${DB} 'select orcid1,orcid2 from collab' |\
awk -F ' ' '{printf("<edge source=\"%s\" target=\"%s\"/>\n",$1,$2);}'
echo "</edges></graph></gexf>"
- Issue Tracker: http://github.com/lindenb/jvarkit/issues
- Source Code: http://github.com/lindenb/jvarkit
The project is licensed under the MIT license.
Should you cite pubmedorcidgraph ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md
The current reference is:
http://dx.doi.org/10.6084/m9.figshare.1425030
Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030