Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to use TunedModel with precomputed-SVM #1141

Closed
KeishiS opened this issue Sep 13, 2024 · 2 comments
Closed

Failed to use TunedModel with precomputed-SVM #1141

KeishiS opened this issue Sep 13, 2024 · 2 comments

Comments

@KeishiS
Copy link

KeishiS commented Sep 13, 2024

First of all, thank you for the great work you're doing in maintaining this project. I encoutered what seems to be a bug when attempting to use a support vector classifier with a precomputed Gram matrix, while performing hyperparameter tuning using TunedModel. I would like to submit a pull request to address the issue, but I'm unsure which part of the codebase needs modification. Any advice would be greatly appreciated.

Describe the bug
When performing parameter search with TunedModel on an SVM with a precomputed kernel, the data splitting is not carried out properly.

To Reproduce

#%%
using MLJ, MLJBase
using MLJScikitLearnInterface
using LinearAlgebra
SVMClassifier = @load SVMClassifier pkg = MLJScikitLearnInterface

#%% Create toy data
using Random, Distributions
θ₀ = rand(Uniform(0, 2π), 100)
X₀ = 0.5 .* [cos.(θ₀) sin.(θ₀)] .+ (randn(100, 2) .* 0.12)
y₀ = zeros(Int, 100)

θ₁ = rand(Uniform(0, 2π), 100)
X₁ = [cos.(θ₁) sin.(θ₁)] .+ (randn(100, 2) .* 0.12)
y₁ = ones(Int, 100)

n = 200
X = vcat(X₀, X₁)
y = MLJBase.categorical(vcat(y₀, y₁))
gmat = [
    exp(-norm(X[i, :] - X[j, :]) * 0.1)
    for i in 1:n, j in 1:n
]

#%%
model = SVMClassifier(kernel="precomputed")
tuning_model = TunedModel(
    model=model,
    range=range(model, :C; lower=0.01, upper=1000, scale=:log),
    measure=accuracy
)
mach = machine(tuning_model, gmat, y)
fit!(mach)

Expected behavior

During the process of searching for the best params, the Gram matrix gmat is divided into training data and test data. We expect gmat[train_idx, train_idx] and gmat[test_idx, train_idx] to be created. However, the current code splits it into gmat[train_idx, :] and gmat[test_idx, :]. This operation is executed in the fit_and_extract_on_fold function in MLJBase.jl/src/resampling.jl.

Versions

  • julia 1.10.5
  • MLJ v0.20.0
  • MLJBase v1.7.0
  • MLJScikitLearnInterface v0.7.0

I would be grateful for any advice on how to approach solving this issue. Thank you for taking the time to read and consider this matter!

@ablaom
Copy link
Member

ablaom commented Sep 16, 2024

Thanks @KeishiS for the positive feedback and for posting.

I'm afraid, that when MLJTuning (or evaluate!) resamples it has no way of knowing it is supposed to also apply the resampling to some hyperparameter.

It looks like you may have better luck with the LIBSVM version of the model (also provided an MLJ interface). In this case you can pass a kernel function rather than an explicit matrix, which won't suffer this issue, right? Would this suit your purpose?


For the record, it is theoretically possible to fix the sk-learn API. The proper interface point for "metadata" that needs to be resampled is to pass it along with the data. So, a corrected workflow would look something like

mach = machine(SVC(), X, y, kernel)  
evaluate!(mach, resampling=...)

To implement this would require also adding a "data front end" to the MLJ interface, to articulate exactly how the resampling is to be done, because the default resampling of arrays (just resample the rows) doesn't work in this case.

Unfortunately, the MLJ sk-learn interfaces are created with a lot of metaprogramming and are therefore difficult to customise. So a fix here would be complicated.

cc @tylerjthomas9

@KeishiS
Copy link
Author

KeishiS commented Sep 23, 2024

Thank you for your reply! 😄

I wasn't familiar with the concept of a "data front end", so I'll take some time to study the information at the link you provided.

While the example code creates a gram matrix from simple toy data, I'm currently considering using a graph kernel where processing multiple graphs in parallel would be more efficient. That's why I was hoping to use it as a precomputed kernel if possible. I appreciate your suggestion of the LIBSVM. I'll try it.

Based on the information you've provided, I'll think about whether there might be a good alternative approach. For now, I'll close this issue. Thank you very much for taking the time to address my concerns.

@KeishiS KeishiS closed this as completed Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants