This repository contains scripts to acquire, clean and process the spending information released by the UK central government.
The scripts have several stages that need to be run in order:
build_index
- will find all related metadata (tagged: spend-transactions) on data.gov.ukretrieve
will then try to fetch all the filesextract
will attempt to parse CSV/XLS/... and load it into a DBscan_columns
will do some initial processing for later stagesmap_columns
will outsource column name comprehension to the usercondense
will try to establish a common column schemaformat
will try to munge numbers and datessuppliers
will query opencorporates.org for supplier name resolutionexport
will write a csv
To run some of the scripts, use nosetests
(the scripts are tests).
Adding -v will give you the names of the individual stages, -x will
stop on the first error and --with-xunit will generate an XML log file.
These scripts are: build_index, retrieve, extract, condense, format
The other scripts can simply be run directly
?
- PDFs
- Zip files containing a bunch of CSVs