-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor existing caching mechanism into pluggable system #766
base: master
Are you sure you want to change the base?
Refactor existing caching mechanism into pluggable system #766
Conversation
Create an interface `Cache`, similiar to `storage.Storage` that will allow for pluggable caching extensions. Caching subsystem is created with full access to Request and Response objects, to potentialy make more complex caching decisions based on both request and response headers. Because of the need to access those objects, `Cache` is not created in its own package like `storage`, but in root `colly`. Closes gocolly#103
@asciimoo would you happen to have a second to review this? |
// Init initializes the caching backend | ||
Init() error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need Init()
as a part of the interface? This is not typical in Go. Can't we just assume that whateve Cache
implementation is passed to colly, it's already initialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason I've decided to go with Init
is consistency - it may not be idiomatic for Go, but it is used by colly
all over the place... Collector
, httpBackend
, etc. My other reason was the way I was thinking about backwards compatibility - i.e. passing SetCollector
via NewCollector
, but after some thinking, I may call NewFileSystemCache
just as well. If you'd like me to remove Init
just let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may call NewFileSystemCache just as well. If you'd like me to remove Init just let me know.
Yeah, that sounds much better. Please do.
// Close finalizes caching backend | ||
Close() error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the expected lifecycle of the Cache
implementation? Is it tightly coupled to the Collector
's one, or it can be shared between several collectors, and something else must close it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good question. The backwards-compatible FileSystemCache
is tightly coupled with Collector
, but the way to use other caches would be via Collector.SetCache
only. Unfortunately in current form, where it's .SetCache
who's calling cache.Init
it kind of defeats the purpose. I will remove .Init
from the .SetCache
and document that this method should pass an initialised object, and a caller is responsible for closing. For the backwards-compatibility scenario, it should not matter too much. Collector
does not have a .Close
method, and FileSystemCache
did not do any closing either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this a bit.
I believe if cache is set through the environment variable, collector's Close
should call the cache's Close
as well. If cache is initialized through SetCache
, then it should not be closed. Moreover, if there's a cache previously initialized through the environment variable when SetCache
is called, it should be closed before replacing it with the new one.
Do you agree?
if request.Method != "GET" || request.Header.Get("Cache-Control") == "no-cache" { | ||
return nil, ErrRequestNoCache | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this check has to be inside Cache
implementation rather than somewhere in its caller? I think cacheability of a HTTP request doesn't really depend on where you store its response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason I put that inside a Cache is for each caching to make their own decisions - i.e. override Cache-Control
and cache everything if they want. It did not feel right to have a decision made in two places. OTOH, maybe it'd be good to refactor this to a method like Cache.Filter
or maybe to make a CacheFilter
class with a default implementation, such that other Cache
implementations do not have to reinvent the wheel if they just want to take a default behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, at least let's move this check into unexported function for the time being.
I certainly approve the efforts do decouple caching from the guts of colly. |
Create an interface
Cache
, similiar tostorage.Storage
that will allow for pluggable caching extensions. Caching subsystem is created with full access to Request and Response objects, to potentialy make more complex caching decisions based on both request and response headers. Because of the need to access those objects,Cache
is not created in its own package likestorage
, but in rootcolly
.Closes #103