Crawling of stanford dataspace, and simple indexes #11

yarikoptic · 2018-07-31T17:55:06Z

changes from NF: crawl Stanford digital repository datalad#2241 to support crawling of stanford dataspace (to close stanford digital repository datasets.datalad.org#16) -- was done separately in ENH: crawl stanford lib initial crawler #17
support crawling simple indexes (e.g. to close EuroStat datasets.datalad.org#4)
- provide ways to define subdirectories for the path from the recursed indexes -- directory path might be deduced from
  - target file url relative to the initial url (e.g. all the files are on the same website),
  - from the path to the page (relative to the initial url?) which contains the link to the target file (e.g. when we have a website which points to external components or to some generic "keystore")

yarikoptic · 2018-08-01T13:40:11Z

some failures are due to the bug somewhere in twisted or scrapy leading to
TypeError (attrib() got an unexpected keyword argument 'converter')
which I recently observed elsewhere but I think it was resolved via upgrades... so not sure what to do for travis. Will fixup for the rogue pdb now

codecov-io · 2018-08-01T13:41:58Z

Codecov Report

Merging #11 into master will decrease coverage by 15.72%.
The diff coverage is 74.6%.

@@             Coverage Diff             @@
##           master      #11       +/-   ##
===========================================
- Coverage   86.44%   70.71%   -15.73%     
===========================================
  Files          51       51               
  Lines        4130     4180       +50     
===========================================
- Hits         3570     2956      -614     
- Misses        560     1224      +664

Impacted Files	Coverage Δ
datalad_crawler/nodes/crawl_url.py	`78.82% <100%> (-11.43%)`	⬇️
datalad_crawler/pipeline.py	`74.27% <100%> (-8.08%)`	⬇️
...awler/pipelines/tests/test_simple_with_archives.py	`54.83% <26.66%> (-45.17%)`	⬇️
datalad_crawler/pipelines/simple_with_archives.py	`75.55% <75%> (-6.27%)`	⬇️
datalad_crawler/nodes/matches.py	`89.18% <94.11%> (+1.31%)`	⬆️
datalad_crawler/pipelines/tests/test_openfmri.py	`28.08% <0%> (-63.27%)`	⬇️
datalad_crawler/pipelines/balsa.py	`34.73% <0%> (-61.06%)`	⬇️
datalad_crawler/dbs/versions.py	`45.45% <0%> (-52.28%)`	⬇️
datalad_crawler/pipelines/tests/test_balsa.py	`51.78% <0%> (-48.22%)`	⬇️
datalad_crawler/nodes/annex.py	`47.09% <0%> (-34.58%)`	⬇️
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1684a9f...681e1b2. Read the comment docs.

* origin/master: TST: Mark test_simple1 as a known V6 failure TST: travis: Add V6 run TST: Drop stale known_failure_v6's RF: rename simple_with_stanford_lib.py to stanford_lib.py BF: crcns - use new datacite interface BF(workaround): adjust for absent pruning commits due to --incremental BF: need to use "incremental=True" now for aggregate_metadata BF: use legacy.openfmri.org

yarikoptic · 2021-04-27T18:08:38Z

elderly effort. IIRC was working but datasets of interest were broken (broken tarballs iirc) anyways. And with no immediate need - abandoned. So let's let it RiP

yarikoptic added 2 commits July 23, 2018 09:40

ENH: simple_with_archives - allow to follow by matching link text

b20168f

ENH(BK): working on following the indexes etc

a1c1877

yarikoptic added the WiP label Jul 31, 2018

yarikoptic mentioned this pull request Jul 31, 2018

NF: crawl Stanford digital repository datalad/datalad#2241

Closed

1 task

BF: remove debug() helper around fix_url node

199d057

yarikoptic mentioned this pull request Sep 12, 2018

ENH: crawl stanford lib initial crawler #17

Merged

mih mentioned this pull request Dec 3, 2018

EuroStat datalad/datasets.datalad.org#4

Open

2 tasks

yarikoptic closed this Apr 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawling of stanford dataspace, and simple indexes #11

Crawling of stanford dataspace, and simple indexes #11

yarikoptic commented Jul 31, 2018 •

edited

Loading

yarikoptic commented Aug 1, 2018

codecov-io commented Aug 1, 2018 •

edited by codecov bot

Loading

yarikoptic commented Apr 27, 2021

Crawling of stanford dataspace, and simple indexes #11

Crawling of stanford dataspace, and simple indexes #11

Conversation

yarikoptic commented Jul 31, 2018 • edited Loading

yarikoptic commented Aug 1, 2018

codecov-io commented Aug 1, 2018 • edited by codecov bot Loading

Codecov Report

yarikoptic commented Apr 27, 2021

yarikoptic commented Jul 31, 2018 •

edited

Loading

codecov-io commented Aug 1, 2018 •

edited by codecov bot

Loading