-
Notifications
You must be signed in to change notification settings - Fork 133
PubmedOrcidGraph
##Motivation
Creates a graph from Pubmed and Authors' Orcid identifiers
##Compilation
- java 1.8 http://www.oracle.com/technetwork/java/index.html (NOT the old java 1.7 or 1.6) . Please check that this java is in the
${PATH}
. Setting JAVA_HOME is not enough : (e.g: https://github.com/lindenb/jvarkit/issues/23 ) - GNU Make > 3.81
- curl/wget
- git
- apache ant is only required to compile htsjdk
- xsltproc http://xmlsoft.org/XSLT/xsltproc2.html
$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make pubmedorcidgraph
by default, the libraries are not included in the jar file, so you shouldn't move them (https://github.com/lindenb/jvarkit/issues/15#issuecomment-140099011 ). You can create a bigger but standalone executable jar by addinging standalone=yes
on the command line:
$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make pubmedorcidgraph standalone=yes
The required libraries will be downloaded and installed in the dist
directory.
The a file local.mk can be created edited to override/add some paths.
For example it can be used to set the HTTP proxy:
http.proxy.host=your.host.com
http.proxy.port=124567
##Synopsis
$ java -jar dist/pubmedorcidgraph.jar [options] (stdin|file)
- -o|--output (OUTPUT-FILE) Output file. Default:stdout.
- -f|--format (VALUE) Optional: 'txt', 'xml' , 'json' Default value : "txt".
- -h|--help print help
- -version|--version show version and exit
##Source Code
Main code is: https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/pubmed/PubmedOrcidGraph.java
ORCIDORCID provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized.""
You can download the papers containing some orcid Identifiers using the entrez query http://www.ncbi.nlm.nih.gov/pubmed/?term=orcid[AUID]. I've used one of my tools pubmeddump to download the articles asXML and I wrote PubmedOrcidGraph to extract the author's orcid.
java -jar dist/pubmeddump.jar "orcid[AUID] Dina[AU]" |\
java -jar dist/pubmedorcidgraph.jar -f xml | xmllint --format -
ava -jar dist/pubmeddump.jar "orcid[AUID] Dina[AU]" |\
java -jar dist/pubmedorcidgraph.jar -f json | python -m json.tool
Now, I want to insert those data into a sqlite3 database. I use the XSLT stylesheet below to convert the XML into some SQL statements:
<?xml version="1.0"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0"
xmlns:xalan="http://xml.apache.org/xalan"
xmlns:str="xalan://com.github.lindenb.xslt.strings.Strings"
exclude-result-prefixes="xalan str"
>
<xsl:output method="text"/>
<xsl:variable name="q">'</xsl:variable>
<xsl:template match="/">
create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
<xsl:apply-templates select="PubmedArticleSet/PubmedArticle"/>
commit;
</xsl:template>
<xsl:template match="PubmedArticle">
<xsl:for-each select="Author">
<xsl:variable name="o1" select="@orcid"/>insert or ignore into author(orcid,name,affiliation) values ('<xsl:value-of select="$o1"/>','<xsl:value-of select="translate(concat(lastName,' ',foreName),$q,' ')"/>','<xsl:value-of select="translate(affiliation,$q,' ')"/>');
<xsl:for-each select="following-sibling::Author">insert or ignore into collab(orcid1,orcid2) values('<xsl:value-of select='$o1'/>','<xsl:value-of select='@orcid'/>');
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml
create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4078-7413','Bussotti Giovanni','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-4449-1863');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4449-1863','Leonardi Tommaso','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4449-1863','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-6090-3100','Enright Anton J','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
(...)
and those sql statetements are loaded into sqlite3:
xsltproc orcid2sqlite.xsl orcid.xml |\
sqlite3 orcid.sqlite
The next step is to produce a gexf+xml file to play with the orcid graph in http://gephi.org . I use the following bash script to convert the sqlite3 database to gexf+xml.
DB=orcid.sqlite
cat << EOF
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" xmlns:viz="http://www.gexf.net/1.1draft/viz" version="1.2">
<meta>
<creator>Pierre Lindenbaum</creator>
<description>Orcid Graph</description>
</meta>
<graph defaultedgetype="undirected" mode="static">
<attributes class="node">
<attribute type="string" title="affiliation" id="0"/>
</attributes>
<nodes>
EOF
sqlite3 -separator ' ' -noheader ${DB} 'select orcid,name,affiliation from author' |\
sed -e 's/&/&/g' -e "s/</\</g" -e "s/>/\>/g" -e "s/'/\'/g" -e 's/"/\"/g' |\
awk -F ' ' '{printf("<node id=\"%s\" label=\"%s\"><attvalue for=\"0\" value=\"%s\"/></node>\n",$1,$2,$3);}'
echo "</nodes><edges>"
sqlite3 -separator ' ' -noheader ${DB} 'select orcid1,orcid2 from collab' |\
awk -F ' ' '{printf("<edge source=\"%s\" target=\"%s\"/>\n",$1,$2);}'
echo "</edges></graph></gexf>"
- Issue Tracker: http://github.com/lindenb/jvarkit/issues
- Source Code: http://github.com/lindenb/jvarkit
The project is licensed under the MIT license.
Should you cite pubmedorcidgraph ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md
The current reference is:
http://dx.doi.org/10.6084/m9.figshare.1425030
Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030