PubmedOrcidGraph

##Motivation

Creates a graph from Pubmed and Authors' Orcid identifiers

##Compilation

Requirements / Dependencies

java 1.8 http://www.oracle.com/technetwork/java/index.html (NOT the old java 1.7 or 1.6) . Please check that this java is in the ${PATH}. Setting JAVA_HOME is not enough : (e.g: https://github.com/lindenb/jvarkit/issues/23 )
GNU Make > 3.81
curl/wget
git
apache ant is only required to compile htsjdk
xsltproc http://xmlsoft.org/XSLT/xsltproc2.html

Download and Compile

$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make pubmedorcidgraph

by default, the libraries are not included in the jar file, so you shouldn't move them (https://github.com/lindenb/jvarkit/issues/15#issuecomment-140099011 ). You can create a bigger but standalone executable jar by addinging standalone=yes on the command line:

$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make pubmedorcidgraph standalone=yes

The required libraries will be downloaded and installed in the dist directory.

edit 'local.mk' (optional)

The a file local.mk can be created edited to override/add some paths.

For example it can be used to set the HTTP proxy:

http.proxy.host=your.host.com
http.proxy.port=124567

##Synopsis

$ java -jar dist/pubmedorcidgraph.jar  [options] (stdin|file)

Options

-o|--output (OUTPUT-FILE) Output file. Default:stdout.
-f|--format (VALUE) Optional: 'txt', 'xml' , 'json' Default value : "txt".
-h|--help print help
-version|--version show version and exit

##Source Code

Main code is: https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/pubmed/PubmedOrcidGraph.java

About Orcid

ORCIDORCID provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized.""

You can download the papers containing some orcid Identifiers using the entrez query http://www.ncbi.nlm.nih.gov/pubmed/?term=orcid[AUID]. I've used one of my tools pubmeddump to download the articles asXML and I wrote PubmedOrcidGraph to extract the author's orcid.

Example

java -jar dist/pubmeddump.jar "orcid[AUID]" |\
java -jar dist/pubmedorcidgraph.jar 
#PubmedArticle
PMID	24549058
doi	10.1038/ejhg.2014.1
Year	2014
Journal	Eur. J. Hum. Genet.
Title	Using ancestry-informative markers to identify fine structure across 15 populations of European origin.
Author	0000-0002-6619-873X

#Author
ORCID	0000-0002-6619-873X
ForeName	Patrick F
LastName	Sullivan
Initials	PF
Affiliation	University of North Carolina, Chapel Hill, NC, USA.

#PubmedArticle
PMID	23872634
doi	10.1038/ng.2712
Year	2013
Journal	Nat. Genet.
Title	Common variants at SCN5A-SCN10A and HEY2 are associated with Brugada syndrome, a rare disease with high risk of sudden cardiac death.
Author	0000-0001-7751-2280
Author	0000-0002-7722-7348

#Author
ORCID	0000-0001-7751-2280
ForeName	Richard
LastName	Redon
Initials	R

#Author
ORCID	0000-0002-7722-7348
ForeName	Christian
LastName	Dina
Initials	C

java -jar dist/pubmeddump.jar "orcid[AUID]" |\
java -jar dist/pubmedorcidgraph.jar -f xml | xmllint --format -

  <PubmedArticle pmid="26301497" doi="10.1038/ng.3383">
    <year>2015</year>
    <journal>Nat. Genet.</journal>
    <title>Genetic association analyses highlight biological pathways underlying mitral valve prolapse.</title>
    <Author orcid="0000-0001-5925-381X">
      <foreName>Xavier</foreName>
      <lastName>Jeunemaitre</lastName>
      <initials>X</initials>
      <affiliation>AP-HP, Department of Genetics, Hôpital Européen Georges Pompidou, Paris, France.</affiliation>
    </Author>
    <Author orcid="0000-0002-2067-0533">
      <foreName>Patrick T</foreName>
      <lastName>Ellinor</lastName>
      <initials>PT</initials>
      <affiliation>Program in Medical and Population Genetics, Broad Institute, Cambridge, Massachusetts, USA.</affiliation>
    </Author>
    <Author orcid="0000-0002-4227-7062">
      <foreName>Jorge</foreName>
      <lastName>Solis</lastName>
      <initials>J</initials>
      <affiliation>Centro Nacional de Investigaciones Cardiovasculares, Carlos III (CNIC), Madrid, Spain.</affiliation>
    </Author>
    <Author orcid="0000-0002-7722-7348">
      <foreName>Christian</foreName>
      <lastName>Dina</lastName>
      <initials>C</initials>
      <affiliation>Centre Hospitalier Universitaire (CHU) Nantes, Université de Nantes, Nantes, France.</affiliation>
    </Author>
    <Author orcid="0000-0003-4076-2336">
      <foreName>Emelia J</foreName>
      <lastName>Benjamin</lastName>
      <initials>EJ</initials>
      <affiliation>Department of Medicine (Cardiovascular Division), Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA.</affiliation>
    </Author>
  </PubmedArticle>
  <PubmedArticle pmid="23872634" doi="10.1038/ng.2712">
    <year>2013</year>
    <journal>Nat. Genet.</journal>
    <title>Common variants at SCN5A-SCN10A and HEY2 are associated with Brugada syndrome, a rare disease with high risk of sudden cardiac death.</title>
    <Author orcid="0000-0001-7751-2280">
      <foreName>Richard</foreName>
      <lastName>Redon</lastName>
      <initials>R</initials>
    </Author>
    <Author orcid="0000-0002-7722-7348">
      <foreName>Christian</foreName>
      <lastName>Dina</lastName>
      <initials>C</initials>
    </Author>
  </PubmedArticle>

java -jar dist/pubmeddump.jar "orcid[AUID]"  |\
java -jar dist/pubmedorcidgraph.jar -f json | python -m json.tool



[
(...)
    {
        "authors": [
            {
                "affiliation": "AP-HP, Department of Genetics, H\u00f4pital Europ\u00e9en Georges Pompidou, Paris, France.", 
                "foreName": "Xavier", 
                "initials": "X", 
                "lastName": "Jeunemaitre", 
                "orcid": "0000-0001-5925-381X"
            }, 
            {
                "affiliation": "Program in Medical and Population Genetics, Broad Institute, Cambridge, Massachusetts, USA.", 
                "foreName": "Patrick T", 
                "initials": "PT", 
                "lastName": "Ellinor", 
                "orcid": "0000-0002-2067-0533"
            }, 
            {
                "affiliation": "Centro Nacional de Investigaciones Cardiovasculares, Carlos III (CNIC), Madrid, Spain.", 
                "foreName": "Jorge", 
                "initials": "J", 
                "lastName": "Solis", 
                "orcid": "0000-0002-4227-7062"
            }, 
            {
                "affiliation": "Centre Hospitalier Universitaire (CHU) Nantes, Universit\u00e9 de Nantes, Nantes, France.", 
                "foreName": "Christian", 
                "initials": "C", 
                "lastName": "Dina", 
                "orcid": "0000-0002-7722-7348"
            }, 
            {
                "affiliation": "Department of Medicine (Cardiovascular Division), Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA.", 
                "foreName": "Emelia J", 
                "initials": "EJ", 
                "lastName": "Benjamin", 
                "orcid": "0000-0003-4076-2336"
            }
        ], 
        "doi": "10.1038/ng.3383", 
        "journal": "Nat. Genet.", 
        "pmid": "26301497", 
        "title": "Genetic association analyses highlight biological pathways underlying mitral valve prolapse.", 
        "year": "2015"
    }, 
    {
        "authors": [
            {
                "foreName": "Richard", 
                "initials": "R", 
                "lastName": "Redon", 
                "orcid": "0000-0001-7751-2280"
            }, 
            {
                "foreName": "Christian", 
                "initials": "C", 
                "lastName": "Dina", 
                "orcid": "0000-0002-7722-7348"
            }
        ], 
        "doi": "10.1038/ng.2712", 
        "journal": "Nat. Genet.", 
        "pmid": "23872634", 
        "title": "Common variants at SCN5A-SCN10A and HEY2 are associated with Brugada syndrome, a rare disease with high risk of sudden cardiac death.", 
        "year": "2013"
    }
]

Now, I want to insert those data into a sqlite3 database. I use the XSLT stylesheet below to convert the XML into some SQL statements:

<?xml version="1.0"?>
<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 version="1.0"
    xmlns:xalan="http://xml.apache.org/xalan"
    xmlns:str="xalan://com.github.lindenb.xslt.strings.Strings"
    exclude-result-prefixes="xalan str"
 >
<xsl:output method="text"/>
<xsl:variable name="q">'</xsl:variable>

<xsl:template match="/">
create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
<xsl:apply-templates select="PubmedArticleSet/PubmedArticle"/>
commit;
</xsl:template>

<xsl:template match="PubmedArticle">
<xsl:for-each select="Author">
<xsl:variable name="o1" select="@orcid"/>insert or ignore into author(orcid,name,affiliation) values ('<xsl:value-of select="$o1"/>','<xsl:value-of select="translate(concat(lastName,' ',foreName),$q,' ')"/>','<xsl:value-of select="translate(affiliation,$q,' ')"/>');
<xsl:for-each select="following-sibling::Author">insert or ignore into collab(orcid1,orcid2) values('<xsl:value-of select='$o1'/>','<xsl:value-of select='@orcid'/>');
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml

create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4078-7413','Bussotti Giovanni','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-4449-1863');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4449-1863','Leonardi Tommaso','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4449-1863','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-6090-3100','Enright Anton J','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
(...)

and those sql statetements are loaded into sqlite3:

xsltproc orcid2sqlite.xsl orcid.xml |\
 sqlite3 orcid.sqlite

The next step is to produce a gexf+xml file to play with the orcid graph in http://gephi.org . I use the following bash script to convert the sqlite3 database to gexf+xml.


DB=orcid.sqlite

cat << EOF
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" xmlns:viz="http://www.gexf.net/1.1draft/viz" version="1.2">
<meta>
<creator>Pierre Lindenbaum</creator>
<description>Orcid Graph</description>
</meta>
<graph defaultedgetype="undirected" mode="static">

<attributes class="node">
<attribute type="string" title="affiliation" id="0"/>
</attributes>
<nodes>
EOF

sqlite3 -separator ' ' -noheader  ${DB} 'select orcid,name,affiliation from author' |\
 sed  -e 's/&/&/g' -e "s/</\</g" -e "s/>/\>/g" -e "s/'/\'/g"  -e 's/"/\"/g' |\
 awk -F ' ' '{printf("<node id=\"%s\" label=\"%s\"><attvalue for=\"0\" value=\"%s\"/></node>\n",$1,$2,$3);}'

echo "</nodes><edges>"
sqlite3 -separator ' ' -noheader  ${DB} 'select orcid1,orcid2 from collab' |\
 awk -F ' ' '{printf("<edge source=\"%s\" target=\"%s\"/>\n",$1,$2);}'
echo "</edges></graph></gexf>"