[Question] Where should i set the content obtained from http request ? #519
-
I am extending this module of heritrix
i read through the source of FetchHTTP module, but unable to figure out where this method actually sets the content obtained from the request.
the above method is called when the http request status is success, here i couldnt find any setters to set the content obtained from a URL ( for example, a html page ). How can i set the html content, so that heritrix can proceed to extract the links from it ? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Assuming your content is supplied by a InputStream called Recorder recorder = curi.getRecorder();
recorder.markContentBegin();
recoredr.inputWrap(stream);
recorder.getRecordedInput().readFully();
recorder.closeRecorders(); handleCapturedRequest() in ExtractorChrome may be a relevant example of integrating Heritrix with a headless browser. Although keep in mind that's for recording subrequests on a background thread and so has to jump through a lot more hoops. Whereas since since you're writing a Fetch processor you don't have to setup your own recorder and can use the one already supplied by the ToeThread and similarly don't need to call the extractors yourself. |
Beta Was this translation helpful? Give feedback.
Assuming your content is supplied by a InputStream called
stream
then something like this will probably work:handleCapturedRequest() in ExtractorChrome may be a relevant example of integrating Heritrix with a headless browser. Although keep in mind that's for recording subrequests on a background thread and so has to jump through a lot more hoops. Whereas since since you're writing a Fetch processor you don't have to setup your own recorder and can use the one already supplied by the ToeThread and similarly don't need to call…