Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ASK] Confusion between top_k and by_threshold relevancy_method in python_evaluation.py #2140

Open
mnhqut opened this issue Aug 10, 2024 · 2 comments
Labels
help wanted Need help from developers

Comments

@mnhqut
Copy link

mnhqut commented Aug 10, 2024

Description

What is the different between k and threshold value here if they all get assigned to top_k?
Isn't threshold supposed to be a rating value that the predictions should exceed instead of being the number of items in the top_k list ?
Thanks.

def merge_ranking_true_pred(
    rating_true,
    rating_pred,
    col_user,
    col_item,
    col_prediction,
    relevancy_method,
    k=DEFAULT_K,
    threshold=DEFAULT_THRESHOLD,
    **_,
):
    if relevancy_method == "top_k":
        top_k = k
    elif relevancy_method == "by_threshold":
        top_k = threshold

Other Comments

@mnhqut mnhqut added the help wanted Need help from developers label Aug 10, 2024
@daviddavo
Copy link
Collaborator

I don't really understand it neither. It calls get_top_k_items, which does this:

# Sort dataframe by col_user and (top k) col_rating
if k is None:
top_k_items = dataframe
else:
top_k_items = (
dataframe.sort_values([col_user, col_rating], ascending=[True, False])
.groupby(col_user, as_index=False)
.head(k)
.reset_index(drop=True)
)
# Add ranks
top_k_items["rank"] = top_k_items.groupby(col_user, sort=False).cumcount() + 1
return top_k_items

With k being top_k (threshold). It then calls .head(k), but according to pandas docs, k should be an integer.

The DEFAULT_THRESHOLD variable is just 10, an integer. Have you tried using a float? What happens then? Does it work? Isn't this exactly the same as using k?

I think a user expects by_threshold to just return just items that have a rating above that threshold.

@mnhqut
Copy link
Author

mnhqut commented Aug 15, 2024

Yes it is indeed very confusing how 'by_threshold' works.

Doing a quick test, I passed "relevancy_method = "top_k", k =20 ", and got the exact same result as passing "relevancy_method = "by_threshold", threshold = 20, k = whatever ". So the threshold here does not seem to represent a (float) rating value.

As for the type of 'top_k', if I passed 'threshold' or 'k' as a float, it raised no error and worked as if the number got rounded up, for example 2.1 as 3 or 7.6 as 8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Need help from developers
Projects
None yet
Development

No branches or pull requests

2 participants