[Question] Where should i set the content obtained from http request ? #519

naveen17797 · 2021-09-23T10:14:59Z

naveen17797
Sep 23, 2021

I am extending this module of heritrix org.archive.modules.fetcher.FetchHTTP and overriding the innerProcess method to make a headless browser get the content instead of the builtin heritrix http request


    @Override
    protected void innerProcess(CrawlURI curi) throws InterruptedException { }

i read through the source of FetchHTTP module, but unable to figure out where this method actually sets the content obtained from the request.

    protected void addResponseContent(HttpResponse response, CrawlURI curi) {
        curi.setFetchStatus(response.getStatusLine().getStatusCode());
        Header ct = response.getLastHeader("content-type");
        curi.setContentType(ct == null ? null : ct.getValue());
        
        for (Header h: response.getAllHeaders()) {
            curi.putHttpResponseHeader(h.getName(), h.getValue());
        }
    }

the above method is called when the http request status is success, here i couldnt find any setters to set the content obtained from a URL ( for example, a html page ).

How can i set the html content, so that heritrix can proceed to extract the links from it ?

Answered by ato

Sep 23, 2021

Assuming your content is supplied by a InputStream called stream then something like this will probably work:

Recorder recorder = curi.getRecorder();
recorder.markContentBegin();
recoredr.inputWrap(stream);
recorder.getRecordedInput().readFully();
recorder.closeRecorders();

handleCapturedRequest() in ExtractorChrome may be a relevant example of integrating Heritrix with a headless browser. Although keep in mind that's for recording subrequests on a background thread and so has to jump through a lot more hoops. Whereas since since you're writing a Fetch processor you don't have to setup your own recorder and can use the one already supplied by the ToeThread and similarly don't need to call…

View full answer

ato · 2021-09-23T10:48:23Z

ato
Sep 23, 2021
Maintainer

Assuming your content is supplied by a InputStream called stream then something like this will probably work:

Recorder recorder = curi.getRecorder();
recorder.markContentBegin();
recoredr.inputWrap(stream);
recorder.getRecordedInput().readFully();
recorder.closeRecorders();

handleCapturedRequest() in ExtractorChrome may be a relevant example of integrating Heritrix with a headless browser. Although keep in mind that's for recording subrequests on a background thread and so has to jump through a lot more hoops. Whereas since since you're writing a Fetch processor you don't have to setup your own recorder and can use the one already supplied by the ToeThread and similarly don't need to call the extractors yourself.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Where should i set the content obtained from http request ? #519

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[Question] Where should i set the content obtained from http request ? #519

naveen17797 Sep 23, 2021

Replies: 1 comment

ato Sep 23, 2021 Maintainer

naveen17797
Sep 23, 2021

ato
Sep 23, 2021
Maintainer