Skip to content
This repository has been archived by the owner on Jan 25, 2024. It is now read-only.

Weekly library data scraping #9

Open
mik3caprio opened this issue Apr 5, 2017 · 8 comments
Open

Weekly library data scraping #9

mik3caprio opened this issue Apr 5, 2017 · 8 comments

Comments

@mik3caprio
Copy link
Collaborator

Set up cron in dev for scraping content - Scripts should fire off WEEKLY on weekends

@mik3caprio
Copy link
Collaborator Author

mik3caprio commented May 14, 2017

Now that the dev deployment is complete, we should make this the next thing to put in place. @pdelong42 would you like to take a crack at this? The requirements are basically to run 'python scrape.py' for each of the ElasticSearch indexes within crontab. I think the only modification required for each of the scraper scripts would be to check for an existing index first, remove it if it exists, then run the rest of the script as normal (I can add that code to the existing scripts, you just need to set up the crontab).

@pdelong42
Copy link
Collaborator

@mik3caprio Sure, just give me the path to the scraping script, as well as the way it ought to be called, and I'll drop it into a crontab.

@mik3caprio
Copy link
Collaborator Author

So there are four sets of two scripts, one set for each Library system. The path is /home/apiproject/API-Portal/scrape/ and then the directories containing the Python scripts are dspace, omeka, sierra, and xeac. In each directory there is a scrape.py and a search.py. You would just need to run python scrape.py and python search.py for each Library system, and have them run weekly.

The only other thing in question is what we would do to delete the indexes from ElasticSearch before scraping and re-indexing. I'm assuming the cron would have another CLI command to remove ElasticSearch indices relating to the system first. In other words:

[ES CLIs to delete dspace* indices]
python dspace/scrape.py
python dspace/search.py

And so on.

I think we could/should set this cron up but not turn it on just yet.

@mik3caprio
Copy link
Collaborator Author

Hey @pdelong42 just confirm with me that you've got this set up and I'll close out this ticket.

@pdelong42
Copy link
Collaborator

@mik3caprio, I tried running those scripts while logged-in as the "apiproject" user, but it threw some errors about missing python modules. Try it in dev to see what I mean.

Are these the same scripts that were used to populate the initial data set into Elasticsearch in the first place?

@pdelong42 pdelong42 reopened this Jul 5, 2017
@pdelong42
Copy link
Collaborator

Sorry, I closed it by mistake. Wrong button, oops...

@mik3caprio
Copy link
Collaborator Author

mik3caprio commented Jul 5, 2017 via email

@pdelong42
Copy link
Collaborator

Okay, but let's install as many of these Python modules as RPMs, whenever they're available, and only grab from pip as needed.

Let me the names of the modules that are missing, and I'll make my best effort to find and install RPM packages of them from reputable sources.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants