This repository is under construction.
Clone the repository:
$ git clone https://github.com/petuum/pirlib $ cd pirlib
Install dependencies:
$ conda create -n pirlib python=3.8 $ conda activate pirlib $ pip install -e .
A toy example is provided in examples/multi_backends/example.py
, first install its
dependencies:
$ pip install -r examples/multi_backends/requirements.txt
The example can be run in four different ways:
$ python examples/multi_backends/example.py
It should output the YAML representation of the example pipeline, followed by the outputs of the pipeline itself.
Open up examples/multi_backends/example.py
and see what's inside.
$ bash examples/multi_backends/run_inproc.sh
This script will
- Run the
pircli
command to convert serialize the pipeline intoexample/package_inproc.yml
. - Run the
pircli
command to execute the pipeline locally, feeding in inputs fromexample/inputs
and saving its outputs toexample/outputs
.
Open up examples/multi_backends/run_inproc.sh
and examples/multi_backends/package_inproc.yml
and
see what's inside.
The following steps require a existing docker installation.
$ bash examples/multi_backends/run_docker.sh
This script will
- Automatically dockerize the local environment and serialize the pipeline into
examples/multi_backends/package_docker.yml
. - Generate a docker-compose workflow from the serialized pipeline and save it to
examples/multi_backends/docker-compose.yml
. - Execute the generated docker-compose workflow.
Open up examples/multi_backends/run_docker.sh
, example/multi_backends/package_docker.yml
, and
examples/multi_backends/docker-compose.yml
and see what's inside.
The following steps require and existing installations of Docker, Kubernetes and Argo.
In order for Argo to have access to the docker images, a docker registry needs to be configured. Currently the dockerize module uses Docker Hub as the docker registry and only supports public repositories. Follow the following steps to configure Docker Hub:
$ docker login $ export DOCKER_USER=<username> $ export PIRLIB_REPO=<reponame>
Please ensure that the repository already exists under the user name in Docker Hub
Follow the instructions here and navigate your browser to https://127.0.0.1:2746
.
Finally, execute the example.
$ bash examples/multi_backends/run_argo.sh
You should be able to see the live execution of the different steps of the pipeline in the browser.
Open up examples/multi_backends/package_argo.yml
, examples/multi_backends/argo-train.yml
and see what's inside.
This example covers the usage of Forte to parse Wikipedia dumps as an Argo workflow.
- PIRlib's docerize module is used to generate a computation graph representation of the various steps of the process along with creating a Docker image which has all the necessary dependencies to run the example.
- PIRlib's argo backend converts the computation graph to an Argo Workflow YAML file.
- Finally, the workflow is executed by Argo.
$ conda create -n pirlib-wiki-parser python=3.8 && conda activate pirlib-wiki-parser $ pip install "forte[wikipedia]"
In order to test the pipeline, sample data is provided in inputs/dbpedia_sample/ under this directory. In order to execute the example with this data, just invoke
$ mkdir examples/wiki_parser/outputs $ bash examples/wiki_parser/run_sample_pipeline.sh
You should be able to see the live execution of the different steps of the pipeline in the browser. The resultant files will be generated in the outputs/ directory.
If the previous step runs without any issues, you may now proceed to run the pipeline on the entirety of the data available. Follow the given steps:
$ bash data_dowload.sh
This script will take a while to execute as it downloads around 13GB of Wikipedia dumps and store them under inputs/dbpedia_full. Proceed to the next steps once the downloads are complete.
If you have already run the example with the sample data, you can now directly execute
$ rm -rf examples/wiki_parser/outputs/* $ bash examples/wiki_parser/run_full_pipeline.sh
If you are executing the workflow on the full data without first executing on the sample data, do the following:
$ mkdir examples/wiki_parser/outputs $ bash examples/wiki_parser/run_full_pipeline.sh
The outputs would appear in the outputs/ directory and the Workflow execution can be viewed from the browser.
This example covers the usage of diskcache to implement caching functionality for the output from argo tasks (python functions). In addition to the previous example, caching functionality is explained below.
- pirlib/cache.py: This file is used to implement cache logic using three functions
- cache_directory: Caches a given directory with the given key
- fetch_directory: Retrives a directory given key if exist
- generate_cache_key: Create cache key given an input file
- examples/caching/ml_pipeline.py: In the decorator for each functions, user need to specify whether caching is enabled and the input file from while cache_keys are to be generated.
Update the same pirlib conda environment used in earlier examples
$ conda activate pirlib $ pip install diskcache
In order to test the pipeline, dummy data has been provided in examples/caching/dataset directory.
$ bash examples/caching/run_argo.sh
You should be able to see the live execution of the different steps of the pipeline in the browser. The resultant files will be generated in the outputs/ directory. The output files will be cached in cache_dir.
Same command is invoked again to see the difference in the duration of each steps.
This example will let you know how to use timer feature to record how long time every task (python functions) will take. If set timer feature on, Wall-Clock tiem and Process time will be print on you console. This feature is off by default. Please find the detail in this examples file below.
- examples/caching/ml_pipeline.py: In the decorator for each functions, user need to specify whether timer is enabled. If you want it, you need add decorator like this
@task(timer=True)
.
Update the same pirlib conda environment used in earlier examples
$ conda activate pirlib $ pip install diskcache
Note
Because timer feature use the same example file wit cache feature. So, you need install diskcache model for you pirlib environment also.
In order to test the pipeline, dummy data has been provided in examples/caching/dataset directory.
$ bash examples/caching/run_argo.sh
You should be able to see the live execution of the different steps of the pipeline in the browser. The resultant files will be generated in the outputs/ directory. And you will see the Wall-Clock time and Process time in your log file or console.
- More comprehensive error checking and reporting.
- More pluggable system for input readers and output writers.
- Better thought out config file handling.
- Docker serve backend.
- Supporting factory functions that produce handlers dynamically.
- More comments and any unit tests at all.
- Packaging a pip-installable and registering to pypi.