datalad crawl: Changing behaviour between HCP900/1200 #48

TobiasKadelka · 2019-07-16T12:15:00Z

At the moment, I am trying the datalad-crawler for 1 subject.
At first, I tried it with "HCP" as a prefix (for HCP_500), then ran "datalad crawl" and saved.
After that I changed the prefix-value in crawl.cfg to HCP_900, ran datalad crawl again and it worked.
But when I change the prefix now to HCP_1200 I get an error message for "datalad crawl".
(Also, when I change it between 900 and 1200 and run "datalad crawl" again, the error message changes with it.)

crawl.cfg

(datalad) tkadelka@brainb02 in ~/hcp_test/123420 on git:master
❱ cat .datalad/crawl/crawl.cfg 1 !
[crawl:pipeline]
template = simple_s3
_prefix = HCP_1200/123420/
_bucket = hcp-openaccess
_to_http = False
_skip_problematic = False

datalad --dbg crawl for HCP_900

'''

(datalad) tkadelka@brainb02 in ~/hcp_test/123420 on git:master
❱ datalad --dbg crawl
[INFO ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg
[INFO ] Creating a pipeline for the hcp-openaccess bucket
[INFO ] Running pipeline [<datalad_crawler.nodes.s3.crawl_s3 object at 0x7f97265ee8d0>, switch(default=None, key='datalad_action', mapping=<<{'commit': >, re=False)]
[INFO ] S3 session: Connecting to the bucket hcp-openaccess with authentication
[INFO ] Finished running pipeline: skipped: 16446
[INFO ] Total stats: skipped: 16446, Datasets crawled: 1
Exception ignored in: <function AnnexRepo.del at 0x7f9728424510>
'''

datalad --dbg crawl for HCP_1200

'''

(datalad) tkadelka@brainb02 in ~/hcp_test/123420 on git:master
❱ datalad --dbg crawl
[INFO ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg
[INFO ] Creating a pipeline for the hcp-openaccess bucket
[INFO ] Running pipeline [<datalad_crawler.nodes.s3.crawl_s3 object at 0x7f03297a1a58>, switch(default=None, key='datalad_action', mapping=<<{'commit': >, re=False)]
[INFO ] S3 session: Connecting to the bucket hcp-openaccess with authentication
Traceback (most recent call last):
File "/home/tkadelka/env/datalad/bin/datalad", line 8, in
main()
File "/home/tkadelka/env/datalad/datalad/datalad/cmdline/main.py", line 500, in main
ret = cmdlineargs.func(cmdlineargs)
File "/home/tkadelka/env/datalad/datalad/datalad/interface/base.py", line 643, in call_from_parser
ret = cls.call(**kwargs)
File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/crawl.py", line 130, in call
output = run_pipeline(pipeline, stats=stats)
File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/pipeline.py", line 114, in run_pipeline
output = list(xrun_pipeline(*args, **kwargs))
File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/pipeline.py", line 194, in xrun_pipeline
for idata_out, data_out in enumerate(xrun_pipeline_steps(pipeline, data_in, output=output_sub)):
File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/pipeline.py", line 270, in xrun_pipeline_steps
for data_ in data_in_to_loop:
File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/nodes/s3.py", line 187, in call
versions_sorted = versions_sorted[start:]
UnboundLocalError: local variable 'start' referenced before assignment

/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/nodes/s3.py(187)call()
-> versions_sorted = versions_sorted[start:]
(Pdb)

'''

yarikoptic · 2019-07-16T16:10:23Z

script it and try again while also git rm -rf .datalad/crawl/versions && git commit -m "killing the version history" between switches, which would be the right thing to do, but probably might lead to some other issues. Otherwise you might miss some files, e.g. if there are changes to HCP/ AFTER initial change to HCP_900 for that subject -- then your crawl of HCP_900 will pick up only the date when changes to HCP/ happened, and thus might miss completely files added/changed to HCP_900 before that date (that is why I was thinking about doing it all via branches)

TobiasKadelka changed the title ~~datalad crawl: Changing behaviour between HCP900/1200?~~ datalad crawl: Changing behaviour between HCP900/1200 Jul 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datalad crawl: Changing behaviour between HCP900/1200 #48

datalad crawl: Changing behaviour between HCP900/1200 #48

TobiasKadelka commented Jul 16, 2019

yarikoptic commented Jul 16, 2019

datalad crawl: Changing behaviour between HCP900/1200 #48

datalad crawl: Changing behaviour between HCP900/1200 #48

Comments

TobiasKadelka commented Jul 16, 2019

yarikoptic commented Jul 16, 2019