diff --git a/_datalad_buildsupport/setup.py b/_datalad_buildsupport/setup.py index 27e0821..b853b20 100644 --- a/_datalad_buildsupport/setup.py +++ b/_datalad_buildsupport/setup.py @@ -123,7 +123,7 @@ def run(self): dist = self.distribution #homepage = dist.get_url() #appname = self._parser.prog - appname = 'datalad' + appname = 'datalad-dataverse' cfg = read_configuration( opj(dirname(dirname(__file__)), 'setup.cfg'))['metadata'] diff --git a/docs/source/_static/tutorial/dv_add_dataset.png b/docs/source/_static/tutorial/dv_add_dataset.png new file mode 100644 index 0000000..3047ef3 Binary files /dev/null and b/docs/source/_static/tutorial/dv_add_dataset.png differ diff --git a/docs/source/_static/tutorial/dv_add_dataset_2.png b/docs/source/_static/tutorial/dv_add_dataset_2.png new file mode 100644 index 0000000..d00cb00 Binary files /dev/null and b/docs/source/_static/tutorial/dv_add_dataset_2.png differ diff --git a/docs/source/_static/tutorial/dv_dataset_annex.png b/docs/source/_static/tutorial/dv_dataset_annex.png new file mode 100644 index 0000000..3feec43 Binary files /dev/null and b/docs/source/_static/tutorial/dv_dataset_annex.png differ diff --git a/docs/source/_static/tutorial/dv_dataset_filetree.png b/docs/source/_static/tutorial/dv_dataset_filetree.png new file mode 100644 index 0000000..c923aaa Binary files /dev/null and b/docs/source/_static/tutorial/dv_dataset_filetree.png differ diff --git a/docs/source/_static/tutorial/dv_obtain_doi.png b/docs/source/_static/tutorial/dv_obtain_doi.png new file mode 100644 index 0000000..8b5a5b7 Binary files /dev/null and b/docs/source/_static/tutorial/dv_obtain_doi.png differ diff --git a/docs/source/_static/tutorial/dv_publish_ds.png b/docs/source/_static/tutorial/dv_publish_ds.png new file mode 100644 index 0000000..7e711a1 Binary files /dev/null and b/docs/source/_static/tutorial/dv_publish_ds.png differ diff --git a/docs/source/_static/tutorial/dv_token.png b/docs/source/_static/tutorial/dv_token.png new file mode 100644 index 0000000..dfdea1d Binary files /dev/null and b/docs/source/_static/tutorial/dv_token.png differ diff --git a/docs/source/index.rst b/docs/source/index.rst index 5928f40..603d3c9 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -21,11 +21,6 @@ from `OHBM BrainHack 2022 `__, and is the result of a wonderful collaboration between `many awesome people `__. If you want to get in touch or on board as well, please see our :ref:`contributing guidelines `. -.. attention:: **This extension is undergoing continous development and is in alpha stage!** - - Nevertheless, thanks for your interest in this piece of software! If you want to work with it - productively, we recommend that you come back in a few weeks, when we had some post-hackathon - time to package it up properly and complete documentation and tutorials. Documentation overview ====================== @@ -35,6 +30,7 @@ Documentation overview intro settingup + tutorial contributing glossary diff --git a/docs/source/intro.rst b/docs/source/intro.rst index f6c5a6d..7b4a8e3 100644 --- a/docs/source/intro.rst +++ b/docs/source/intro.rst @@ -48,15 +48,18 @@ The primary use case for dataverse siblings is dataset deposition, where only on Compared to workflows which use repository hosting services, this solution will be less flexible for collaboration (because it's not able to utilise features for controlling dataset history offered by repository hosting services, such as pull requests and conflict resolution), and might be slower (when it comes to file transfer). What it offers, however, is the ability to make the published dataset browsable like regular directories and amendable with metadata on the Dataverse instance while being cloneable through DataLad. +.. _usecases: + What can I use this extension for? ---------------------------------- You can use this extension to publish and share your dataset via Dataverse_, and you can use it to clone published DataLad datasets from Dataverse_. Here is some inspiration on what you could do: -- Publish your study (including its version history, data, code, results, and provenance) as a DataLad dataset to Dataverse to share it with collaborators or get a DOI for it. -- Share a published datasets' URL with colleagues and collaborators to give them easy access to your work with a single ``datalad clone``. -- Clone a friend's DataLad dataset -- from Dataverse! +- **Publish your study** (including its version history, data, code, results, and provenance) as a DataLad dataset to Dataverse to share it with collaborators +- **DOIify** your work by getting a DOI for it from Dataverse. +- **Share a published dataset's URL** with colleagues and collaborators to give them easy access to your work with a single ``datalad clone``. +- **Clone a friend's DataLad dataset** -- from Dataverse! ``datalad-dataverse`` comes with a range of hidden convenience functions for Dataverse interactions. @@ -70,4 +73,5 @@ What can I **not** use this extension for? Please refer to the list of `special remotes`_ as hosted by the `git-annex`_ website for other storage services and how to use them with DataLad. - Dataverse installations may have upload or storage limits - exceeding those limits is not possible with this tool. However, you will be able to at least publish the revision history of your dataset even if annexed files are too large. - The starting point for working with this extension is a (published) DataLad dataset, not a regular Dataverse dataset. - This extension will not transform normal Dataverse datasets projects into DataLad datasets, but expose DataLad datasets as Dataverse datasets. \ No newline at end of file + This extension will not transform normal Dataverse datasets projects into DataLad datasets, but expose DataLad datasets as Dataverse datasets. +- Please see the :ref:`feature support ` section for particulars of what is and is not supported by this extension package. \ No newline at end of file diff --git a/docs/source/links.inc b/docs/source/links.inc index 1f1f23a..ade0149 100644 --- a/docs/source/links.inc +++ b/docs/source/links.inc @@ -19,3 +19,4 @@ .. _Python: https://www.python.org/ .. _Special Remote: https://git-annex.branchable.com/special_remotes/ .. _Special Remotes: https://git-annex.branchable.com/special_remotes/ +.. _version 5.13: https://guides.dataverse.org/en/5.13/ diff --git a/docs/source/settingup.rst b/docs/source/settingup.rst index 534e110..7f55d01 100644 --- a/docs/source/settingup.rst +++ b/docs/source/settingup.rst @@ -18,17 +18,20 @@ The relevant requirements are listed below. If you don't have DataLad_ and its underlying tools (`git`_, `git-annex`_) installed yet, please follow the instructions from `the datalad handbook `_. -Installation -^^^^^^^^^^^^ +.. _feature_support: + +Feature support +^^^^^^^^^^^^^^^^ +``datalad-dataverse`` is developed to be compatible with Dataverse (`version 5.13`_), which +has certain limitations when integrated with DataLad. In particular: -.. attention:: **This extension is undergoing continous development and is in alpha stage!** +- This extension does not support Dataverse versions prior to v5.13 +- This extension does not support unicode in filenames +- Support for handling previously published Dataverse datasets is experimental - Nevertheless, thanks for your interest in this piece of software! If you want to work with it - productively, we recommend that you come back in a few weeks, when we had some post-hackathon - time to package it up properly and complete documentation and tutorials. We didn't quite make it - to the release during the Hackathon, so regard the instructions below as how it will work in the - future. +Installation +^^^^^^^^^^^^ ``datalad-dataverse`` is a Python package available on `pypi `_ and installable via pip_. @@ -43,12 +46,42 @@ Installation Getting started ^^^^^^^^^^^^^^^ -Here's the gist of some of this extension's functionality. -Checkout the Tutorial for more detailed demonstrations. +.. admonition:: Tutorial + + For detailed instructions, please refer to the :ref:`tutorial`. + + +The ``datalad-dataverse`` software allows publishing a DataLad dataset to a Dataverse +instance. First you have to create an empty Dataverse dataset with a dedicated DOI, which +will be used in the code below (see how to do this in the :ref:`tutorial`). + +Next, ensure that your dataset is packaged as a DataLad dataset: + +.. code-block:: bash + + datalad create -d [dataset_location] --force + +Then create a dataverse `sibling` to the DataLad dataset: + +.. code-block:: bash + + datalad add-sibling-dataverse -s dataverse -d [dataset_location] https://demo.dataverse.org doi:10.70122/MYT/ESTDOI + +This command will report both the URL of the dataverse instance and its DOI as well as a long URL starting with ``datalad-annex::``. +This URL is what will be relevant for cloning the dataset from DataVerse. -.. attention:: **This extension is undergoing continous development and is in alpha stage!** +Finally, push the DataLad dataset to Dataverse: + +.. code-block:: bash + + datalad push --to dataverse + +Once the dataset is available on Dataverse, it can also be cloned using the ``datalad-annex::`` URL provided by ``add-sibling-dataverse``: + +.. code-block:: bash + + datalad clone 'datalad-annex::?type=external&externaltype=dataverse&encryption=none&exporttree=no&url=https%3A//demo.dataverse.org&doi=doi:10.70122/MYT/ESTDOI' - Sadly, there is no gist and no tutorial yet - come back a bit later, or help us create one :) .. admonition:: HELP! I'm new to this! diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst new file mode 100644 index 0000000..d22c661 --- /dev/null +++ b/docs/source/tutorial.rst @@ -0,0 +1,233 @@ +.. include:: ./links.inc + +.. _tutorial: + +Tutorial +======== + +The ``datalad-dataverse`` extension provides a single command, ``add-sibling-dataverse``. +This tutorial shows you how it can be used for interactions and publications to Dataverse. +For a high-level overview of what's possible with this extension, see also :ref:`usecases`. + +A full Dataverse interaction requires 5 steps: + +* :ref:`1` +* :ref:`2` +* :ref:`3` +* :ref:`4` +* :ref:`5` + +.. _1: + +1. Create a Dataverse dataset +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If you want to **publish** a dataset to Dataverse, you will need a dedicated location on Dataverse that we will publish our dataset to. +For this, we will use a :term:`Dataverse dataset`. + +Go to your favourite Dataverse instance, log in or create an account, and create a new draft :term:`Dataverse dataset` via the ``Add Data`` header: + +.. image:: ./_static/tutorial/dv_add_dataset.png + +The ``Add Dataset`` button takes you to a configurator for your :term:`Dataverse dataset`. +Provide all relevant details and metadata entries in the form. +Importantly, **don't** upload any of your data files - this will be done by DataLad once we ``datalad push`` later. + +.. image:: ./_static/tutorial/dv_add_dataset_2.png + +Once you have clicked ``Save Dataset``, you'll have a draft :term:`Dataverse dataset`. +It already has a DOI, and you can find it under the ``Metadata`` tab as "Persistent identifier": + +.. image:: ./_static/tutorial/dv_obtain_doi.png + +Finally, make a note of the **URL** of your dataverse instance (e.g., ``https://demo.dataverse.org``), and the **DOI** of your draft dataset. +You will need this information for :ref:`step 3 <3>`. + +.. _2: + +2. Create a DataLad dataset +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Next, you'll need a :term:`DataLad dataset` to push to your :term:`Dataverse dataset`. +If you already have one, skip this step. +If not, use ``datalad create `` to create a new dataset to populate, or transform an existing directory into a DataLad dataset using + +.. code-block:: bash + + $ datalad create -d --force + +In both cases, any files you add into the dataset can be saved using ``datalad save``. +If you have never done this before, its a good idea to give the first pages of the `DataLad handbook `__ a quick read first. + +Here's a toy example dataset with a single saved file: + +.. code-block:: bash + + $ datalad create my-test-dataset + create(ok): /tmp/my-test-dataset (dataset) + $ cd my-test-dataset + $ echo 12345 > my-file + $ datalad save -m "Saving my first file" + add(ok): my-file (file) + save(ok): . (dataset) + action summary: + add (ok: 1) + save (ok: 1) + +.. _3: + +3. Add a Dataverse sibling to your dataset +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Now that you have a draft :term:`Dataverse dataset` on Dataverse and a local :term:`DataLad dataset`, let them get to know each other using the ``datalad add-sibling-dataverse`` command. +This command registers the remote Dataverse Dataset as a known remote location to your Dataset and will allow you to publish the entire Dataset (Git history and annexed data) or parts of it to Dataverse. + +If you run this command for the first time, you will need to provide an API Token to authenticate against the chosen Dataverse instance in an interactive prompt. +This is how this would look: + +.. code-block:: bash + + $ datalad add-sibling-dataverse https://demo.dataverse.org doi:10.70122/FK2/NQPP6A + A dataverse API token is required for access. + Find it at https://demo.dataverse.org by clicking on your name at the top right corner and then clicking on API Token + token: + +You'll find this token if you follow the instructions in the prompt under your user account on your Dataverse instance, and you can copy-paste it into the command line: + +.. image:: ./_static/tutorial/dv_token.png + +If authentication with the token was successful, it will be saved into your system's keyring. +If you have accounts on several different dataverse instances or multiple users with different tokens, you can use and store several tokens with the ``--credential`` parameter of the command. +For example, ``datalad add-sibling-dataverse https://demo.dataverse.org doi:10.70122/FK2/NQPP6A --credential demo-dataverse`` will search for a previously used credential ``demo-dataverse``, or prompt for a token if it can't find one and save it after success. + +The ``datalad add-sibling-dataverse`` command needs at least two pieces of information: The **URL** of your Dataverse instance, and a **persistent identifier** of the draft :term:`Dataverse dataset` created in :ref:`step 1 <1>`. +Depending on what you want to transfer to Dataverse, you also need to configure the command with the correct ``--mode``. +Two popular choices are ``annex`` and ``filetree``. +The former, which is also the default, will prepare the Dataverse dataset to contain both the Git revision history of your dataset as well as its annexed contents (if your Dataverse instance supports this, and your data doesn't exceed file size limits). +The latter will publish a single snapshot of your dataset ("as it currently is", without version history). +Let's illustrate the differences in detail: + +annex mode +********** + +``--mode annex`` is the command's default, and will be used when you don't explicitly provide the ``--mode`` parameter. +It will create a non-human readable representation of your :term:`DataLad dataset` on Dataverse that includes Git history and annexed data: + +.. code-block:: bash + + $ datalad add-sibling-dataverse \ + https://demo.dataverse.org \ + doi:10.70122/FK2/NQPP6A \ + --mode annex + add_sibling_dataverse.storage(ok): . [dataverse-storage: https://demo.dataverse.org (DOI: doi:10.70122/FK2/NQPP6A)] + [INFO ] Configure additional publication dependency on "dataverse-storage" + add_sibling_dataverse(ok): . [dataverse: datalad-annex::?type=external&externaltype=dataverse&encryption=none&exporttree=no&url=https%3A//demo.dataverse.org&doi=doi:10.70122/FK2/NQPP6A (DOI: doi:10.70122/FK2/NQPP6A)] + +As soon as you've created the sibling, you can push: + +.. code-block:: bash + + $ datalad push --to dataverse + copy(ok): my-file (file) [to dataverse-storage...] + publish(ok): . (dataset) [refs/heads/master->dataverse:refs/heads/master [new branch]] + publish(ok): . (dataset) [refs/heads/git-annex->dataverse:refs/heads/git-annex [new branch]] + + action summary: + copy (ok: 1) + publish (ok: 2) + + +And this is the result on Dataverse: + +.. image:: ./_static/tutorial/dv_dataset_annex.png + +filetree mode +************* + +``--mode filetree`` is an export mode, i.e., it will mirror a snapshot of the current state of your :term:`DataLad dataset` to Dataverse. +This is more human readable on Dataverse, but wouldn't include historical versions of your annexed files. +The Git history of your dataset is included in this mode, too. + +.. code-block:: bash + + $ datalad add-sibling-dataverse \ + https://demo.dataverse.org \ + doi:10.70122/FK2/ZS0YL3 \ + --mode filetree + add_sibling_dataverse.storage(ok): . [dataverse-storage: https://demo.dataverse.org (DOI: doi:10.70122/FK2/ZS0YL3)] + [INFO ] Configure additional publication dependency on "dataverse-storage" + add_sibling_dataverse(ok): . [dataverse: datalad-annex::?type=external&externaltype=dataverse&encryption=none&exporttree=yes&url=https%3A//demo.dataverse.org&doi=doi:10.70122/FK2/ZS0YL3 (DOI: doi:10.70122/FK2/ZS0YL3)] + +Now, you can push: + +.. code-block:: bash + + $ datalad push --to dataverse + copy(ok): .datalad/.gitattributes (dataset) + copy(ok): .datalad/config (dataset) + copy(ok): .gitattributes (dataset) + copy(ok): my-file (dataset) + publish(ok): . (dataset) [refs/heads/master->dataverse:refs/heads/master [new branch]] + publish(ok): . (dataset) [refs/heads/git-annex->dataverse:refs/heads/git-annex [new branch]] + action summary: + copy (ok: 4) + publish (ok: 2) + + +And this is the result on Dataverse: + +.. image:: ./_static/tutorial/dv_dataset_filetree.png + +Note that Dataverse has a number of file name requirements that restrict which characters can be used in file or directory names. +DataLad works around this by encoding file or directory names to comply to the allowed character set. +Therefore, your :term:`Dataverse dataset` might display files with slightly different names from what your local :term:`DataLad dataset` displays. +These names will be changed into their original form when the dataset is cloned, though. + +.. _4: + +4. Make your dataset public +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Your dataset on Dataverse will be in draft mode after you've pushed content into it. +Use the webinterface to make it public and share it. + +.. image:: ./_static/tutorial/dv_publish_ds.png + +.. _5: + +5. Clone a dataset from Dataverse +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Finally, you or others can clone your datasets from Dataverse. +They'll need a special type of URL and the ``datalad clone`` command for this. + +The URL required for cloning starts with ``datalad-annex::?`` and is provided to you by the ``datalad add-dataverse-sibling`` command. +Alternatively, you can also copy-paste it from the configuration of your remotes: + +.. code-block:: bash + + $ git remote -v + dataverse datalad-annex::?type=external&externaltype=dataverse&encryption=none&exporttree=yes&url=https%3A//demo.dataverse.org&doi=doi:10.70122/FK2/ZS0YL3 (fetch) + dataverse datalad-annex::?type=external&externaltype=dataverse&encryption=none&exporttree=yes&url=https%3A//demo.dataverse.org&doi=doi:10.70122/FK2/ZS0YL3 (push) + +Once you have this URL, anyone with an account on the Dataverse instance and the correct permissions for the dataset can clone it: + +.. code-block:: bash + + $ datalad clone \ + 'datalad-annex::?type=external&externaltype=dataverse&encryption=none&exporttree=no&url=https%3A//demo.dataverse.org&doi=doi:10.70122/FK2/NQPP6A' \ + my-clone + [INFO ] Remote origin uses a protocol not supported by git-annex; setting annex-ignore + [INFO ] access to 1 dataset sibling dataverse-storage not auto-enabled, enable with: + | datalad siblings -d "/tmp/my-clone" enable -s dataverse-storage + install(ok): /tmp/tmp/my-clone-of-annex-mode (dataset) + +Afterwards, enable the special remote in the clone with the provided command, and retrieve file content using ``datalad get``: + +.. code-block:: bash + + $ cd my-clone + $ datalad siblings -d "/tmp/my-clone" enable -s dataverse-storage + .: dataverse-storage(?) [git] + $ datalad get my-file + get(ok): my-file (file) [from dataverse-storage...]