You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the moment, I am trying the datalad-crawler for 1 subject.
At first, I tried it with "HCP" as a prefix (for HCP_500), then ran "datalad crawl" and saved.
After that I changed the prefix-value in crawl.cfg to HCP_900, ran datalad crawl again and it worked.
But when I change the prefix now to HCP_1200 I get an error message for "datalad crawl".
(Also, when I change it between 900 and 1200 and run "datalad crawl" again, the error message changes with it.)
(datalad) tkadelka@brainb02 in ~/hcp_test/123420 on git:master
❱ datalad --dbg crawl
[INFO ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg
[INFO ] Creating a pipeline for the hcp-openaccess bucket
[INFO ] Running pipeline [<datalad_crawler.nodes.s3.crawl_s3 object at 0x7f97265ee8d0>, switch(default=None, key='datalad_action', mapping=<<{'commit': >, re=False)]
[INFO ] S3 session: Connecting to the bucket hcp-openaccess with authentication
[INFO ] Finished running pipeline: skipped: 16446
[INFO ] Total stats: skipped: 16446, Datasets crawled: 1
Exception ignored in: <function AnnexRepo.del at 0x7f9728424510>
'''
datalad --dbg crawl for HCP_1200
'''
(datalad) tkadelka@brainb02 in ~/hcp_test/123420 on git:master
❱ datalad --dbg crawl
[INFO ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg
[INFO ] Creating a pipeline for the hcp-openaccess bucket
[INFO ] Running pipeline [<datalad_crawler.nodes.s3.crawl_s3 object at 0x7f03297a1a58>, switch(default=None, key='datalad_action', mapping=<<{'commit': >, re=False)]
[INFO ] S3 session: Connecting to the bucket hcp-openaccess with authentication
Traceback (most recent call last):
File "/home/tkadelka/env/datalad/bin/datalad", line 8, in
main()
File "/home/tkadelka/env/datalad/datalad/datalad/cmdline/main.py", line 500, in main
ret = cmdlineargs.func(cmdlineargs)
File "/home/tkadelka/env/datalad/datalad/datalad/interface/base.py", line 643, in call_from_parser
ret = cls.call(**kwargs)
File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/crawl.py", line 130, in call
output = run_pipeline(pipeline, stats=stats)
File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/pipeline.py", line 114, in run_pipeline
output = list(xrun_pipeline(*args, **kwargs))
File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/pipeline.py", line 194, in xrun_pipeline
for idata_out, data_out in enumerate(xrun_pipeline_steps(pipeline, data_in, output=output_sub)):
File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/pipeline.py", line 270, in xrun_pipeline_steps
for data_ in data_in_to_loop:
File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/nodes/s3.py", line 187, in call
versions_sorted = versions_sorted[start:]
UnboundLocalError: local variable 'start' referenced before assignment
The text was updated successfully, but these errors were encountered:
TobiasKadelka
changed the title
datalad crawl: Changing behaviour between HCP900/1200?
datalad crawl: Changing behaviour between HCP900/1200
Jul 16, 2019
script it and try again while also git rm -rf .datalad/crawl/versions && git commit -m "killing the version history" between switches, which would be the right thing to do, but probably might lead to some other issues. Otherwise you might miss some files, e.g. if there are changes to HCP/ AFTER initial change to HCP_900 for that subject -- then your crawl of HCP_900 will pick up only the date when changes to HCP/ happened, and thus might miss completely files added/changed to HCP_900 before that date (that is why I was thinking about doing it all via branches)
At the moment, I am trying the datalad-crawler for 1 subject.
At first, I tried it with "HCP" as a prefix (for HCP_500), then ran "datalad crawl" and saved.
After that I changed the prefix-value in crawl.cfg to HCP_900, ran datalad crawl again and it worked.
But when I change the prefix now to HCP_1200 I get an error message for "datalad crawl".
(Also, when I change it between 900 and 1200 and run "datalad crawl" again, the error message changes with it.)
crawl.cfg
(datalad) tkadelka@brainb02 in ~/hcp_test/123420 on git:master
❱ cat .datalad/crawl/crawl.cfg 1 !
[crawl:pipeline]
template = simple_s3
_prefix = HCP_1200/123420/
_bucket = hcp-openaccess
_to_http = False
_skip_problematic = False
datalad --dbg crawl for HCP_900
'''(datalad) tkadelka@brainb02 in ~/hcp_test/123420 on git:master
❱ datalad --dbg crawl
[INFO ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg
[INFO ] Creating a pipeline for the hcp-openaccess bucket
[INFO ] Running pipeline [<datalad_crawler.nodes.s3.crawl_s3 object at 0x7f97265ee8d0>, switch(default=None, key='datalad_action', mapping=<<{'commit': >, re=False)]
[INFO ] S3 session: Connecting to the bucket hcp-openaccess with authentication
[INFO ] Finished running pipeline: skipped: 16446
[INFO ] Total stats: skipped: 16446, Datasets crawled: 1
Exception ignored in: <function AnnexRepo.del at 0x7f9728424510>
'''
datalad --dbg crawl for HCP_1200
'''
(datalad) tkadelka@brainb02 in ~/hcp_test/123420 on git:master
❱ datalad --dbg crawl
[INFO ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg
[INFO ] Creating a pipeline for the hcp-openaccess bucket
[INFO ] Running pipeline [<datalad_crawler.nodes.s3.crawl_s3 object at 0x7f03297a1a58>, switch(default=None, key='datalad_action', mapping=<<{'commit': >, re=False)]
[INFO ] S3 session: Connecting to the bucket hcp-openaccess with authentication
Traceback (most recent call last):
File "/home/tkadelka/env/datalad/bin/datalad", line 8, in
main()
File "/home/tkadelka/env/datalad/datalad/datalad/cmdline/main.py", line 500, in main
ret = cmdlineargs.func(cmdlineargs)
File "/home/tkadelka/env/datalad/datalad/datalad/interface/base.py", line 643, in call_from_parser
ret = cls.call(**kwargs)
File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/crawl.py", line 130, in call
output = run_pipeline(pipeline, stats=stats)
File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/pipeline.py", line 114, in run_pipeline
output = list(xrun_pipeline(*args, **kwargs))
File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/pipeline.py", line 194, in xrun_pipeline
for idata_out, data_out in enumerate(xrun_pipeline_steps(pipeline, data_in, output=output_sub)):
File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/pipeline.py", line 270, in xrun_pipeline_steps
for data_ in data_in_to_loop:
File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/nodes/s3.py", line 187, in call
versions_sorted = versions_sorted[start:]
UnboundLocalError: local variable 'start' referenced before assignment
'''
The text was updated successfully, but these errors were encountered: