feature: new tests added for tsne to expand test coverage #2229

yuejiaointel · 2024-12-17T18:18:24Z

Description

Added additional tests in sklearnex/manifold/tests/test_tsne.py to expand the test coverage for t-SNE algorithm.

PR completeness and readability

I have reviewed my changes thoroughly before submitting this pull request.
I have commented my code, particularly in hard-to-understand areas.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

codecov · 2024-12-17T18:59:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Flag	Coverage Δ
github	`83.18% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

sklearnex/manifold/tests/test_tsne.py

yuejiaointel · 2024-12-18T15:59:14Z

/intelci: run

ethanglaser · 2024-12-19T06:52:25Z

/intelci: run

sklearnex/manifold/tests/test_tsne.py

…nt results, merge previous deleted gpu test to complex test

yuejiaointel · 2025-01-06T23:01:52Z

/intelci: run

sklearnex/manifold/tests/test_tsne.py

david-cortes-intel · 2025-01-07T08:08:58Z

It looks like we don't have any test here nor in daal4py that would be checking that the results from TSNE make sense beyond having the right shape and non-missingness.

Since there's a very particular dataset here for the last test, it'd be helpful to add other assertions there along the lines of checking that the embeddings end up making some points closer than others as would be expected given the input data.

…or parametrization names, removed extra tests

yuejiaointel · 2025-01-08T01:28:07Z

Hi David,
About the last comment, I think that is a good test to add! I spent some time thinking it through and have added a logic check in the final test to evaluate the overlap of close neighbors. Here’s a summary of the steps I implemented:

get a distance array where [i, j] is Euclidean distance of point i and j in original space, same for tsne embedding space
rank distances for each point wrt first column in original space, also for embedding space
get top 5 neighbors of each point in original and embedding space see how many are same by dividing them
get a mean of all fractions it should represent how the original and embedding space are similar for the most 5 closest points
check if that mean is > 0.6
Let me know your thoughts on this approach or if you believe it could be improved further.
Thx a lot :D
Yue

david-cortes-intel · 2025-01-08T08:22:23Z

Hi David, About the last comment, I think that is a good test to add! I spent some time thinking it through and have added a logic check in the final test to evaluate the overlap of close neighbors. Here’s a summary of the steps I implemented:

get a distance array where [i, j] is Euclidean distance of point i and j in original space, same for tsne embedding space

rank distances for each point wrt first column in original space, also for embedding space

get top 5 neighbors of each point in original and embedding space see how many are same by dividing them

get a mean of all fractions it should represent how the original and embedding space are similar for the most 5 closest points

check if that mean is > 0.6
Let me know your thoughts on this approach or if you believe it could be improved further.
Thx a lot :D
Yue

I think given the characteristics of the data that you are passing, it could be done by selecting some hard-coded set of points by index from "Complex Dataset1" that should end up being similar, and some selected set of points that should end up being dissimilar to the earlier ones; with the test then checking that the euclidean distances in the embedding space among each point from the first set are smaller than the distances between each point in the first set and each point in the second set.

Also maybe "Complex Dataset2" is not needed.

…ding space

yuejiaointel · 2025-01-08T19:26:57Z

Hi David, About the last comment, I think that is a good test to add! I spent some time thinking it through and have added a logic check in the final test to evaluate the overlap of close neighbors. Here’s a summary of the steps I implemented:

get a distance array where [i, j] is Euclidean distance of point i and j in original space, same for tsne embedding space

rank distances for each point wrt first column in original space, also for embedding space

get top 5 neighbors of each point in original and embedding space see how many are same by dividing them

get a mean of all fractions it should represent how the original and embedding space are similar for the most 5 closest points

check if that mean is > 0.6
Let me know your thoughts on this approach or if you believe it could be improved further.
Thx a lot :D
Yue

I think given the characteristics of the data that you are passing, it could be done by selecting some hard-coded set of points by index from "Complex Dataset1" that should end up being similar, and some selected set of points that should end up being dissimilar to the earlier ones; with the test then checking that the euclidean distances in the embedding space among each point from the first set are smaller than the distances between each point in the first set and each point in the second set.

Also maybe "Complex Dataset2" is not needed.

Hi David!
I fixed the logic based on your suggestion, and here is my understanding. First get a group A with similar points and group B with different points from group A, then check in embedding space distance b/t any 2 points in group A should be less than that point to any point in group B. I run the CI many times and one problem with this approach is that it fails sometimes for GPU devices, in these cases the embedding did not keep close points close, and it only occur on pipeline runs without problem on local machine. Not sure if I should create another ticket to investigate on that. I also removed complex test 2.
Best,
Yue

…gical test

… array

…r logical test

yuejiaointel · 2025-01-09T04:34:53Z

/intelci: run

sklearnex/manifold/tests/test_tsne.py

david-cortes-intel · 2025-01-09T17:18:09Z

Odd that the CI fails for sklearn1.0 and 1.1. I see that there's many places throughout the code with conditions for sklearn<1.2 though:
https://github.com/uxlfoundation/scikit-learn-intelex/blob/main/daal4py/sklearn/manifold/_t_sne.py
Perhaps something is going wrong there.

yuejiaointel · 2025-01-09T20:25:24Z

Odd that the CI fails for sklearn1.0 and 1.1. I see that there's many places throughout the code with conditions for sklearn<1.2 though: https://github.com/uxlfoundation/scikit-learn-intelex/blob/main/daal4py/sklearn/manifold/_t_sne.py Perhaps something is going wrong there.

Hi David!
The tests are failing when checking 0 embedding when data is constant. But actually after I read the implementation I think it make sense. About the version difference, default initialization in TSNE is changed from random to pca from 1.2 and that is why we see different behavior for different versions. Here is my understanding why we are getting non 0 embedding when we are using self._init = "random":

First we are generating a random embedding at first
X_embedded = 1e-4 * random_state.standard_normal(size=(n_samples, n_components))
Next the joint probabilities are meaningless since the pairwise distance are 0 but there could still be a value so it will still push the embedding in the gradient decent process and in the end we see some degenerated embedding.
P = _joint_probabilities(distances, self.perplexity, self.verbose)

I actually tested using sklearn and see same behavior here is my simple script and I also see a non zero embedding in the end.
from sklearn.manifold import TSNE
import numpy as np
constant_data = np.full((10, 10), fill_value=1)
tsne = TSNE(n_components=2, init='random',random_state=42, perplexity=5)
embedding = tsne.fit_transform(constant_data)
print(embedding)

I removed the check for 0 embedding because we see zero and non zero embedding for different versions, hope this helps!
Yue

david-cortes-intel · 2025-01-10T11:04:06Z

Odd that the CI fails for sklearn1.0 and 1.1. I see that there's many places throughout the code with conditions for sklearn<1.2 though: https://github.com/uxlfoundation/scikit-learn-intelex/blob/main/daal4py/sklearn/manifold/_t_sne.py Perhaps something is going wrong there.

Hi David! The tests are failing when checking 0 embedding when data is constant. But actually after I read the implementation I think it make sense. About the version difference, default initialization in TSNE is changed from random to pca from 1.2 and that is why we see different behavior for different versions. Here is my understanding why we are getting non 0 embedding when we are using self._init = "random":

First we are generating a random embedding at first
X_embedded = 1e-4 * random_state.standard_normal(size=(n_samples, n_components))

Next the joint probabilities are meaningless since the pairwise distance are 0 but there could still be a value so it will still push the embedding in the gradient decent process and in the end we see some degenerated embedding.
P = _joint_probabilities(distances, self.perplexity, self.verbose)

I actually tested using sklearn and see same behavior here is my simple script and I also see a non zero embedding in the end. from sklearn.manifold import TSNE import numpy as np constant_data = np.full((10, 10), fill_value=1) tsne = TSNE(n_components=2, init='random',random_state=42, perplexity=5) embedding = tsne.fit_transform(constant_data) print(embedding)

I removed the check for 0 embedding because we see zero and non zero embedding for different versions, hope this helps! Yue

Then please move it into a separate test, using init="pca" in the constructor.

Also I don't think the test should look for exact zeros everywhere. For constant data, should test for a constant first dimension, and a zero second dimension.

david-cortes-intel · 2025-01-10T11:07:00Z

sklearnex/manifold/tests/test_tsne.py

+@pytest.mark.parametrize("dtype", [np.float32, np.float64])
+def test_tsne_reproducibility(dataframe, queue, dtype):
+    """
+    Test reproducibility


Let's better remove obvious or redundant comments. Same for example for error messages in assertions - you will see the line that failed in a log when that happens, and many of those cases will be pretty obvious from the code line, line "isfinite failed".

feature: new tests added for tsne to expand test coverage

c686edd

yuejiaointel requested review from Alexsandruss, samir-nasibli and icfaust as code owners December 17, 2024 18:18

ethanglaser marked this pull request as draft December 17, 2024 19:02

test: additional test for gpu and golden data embedding test for tsne

f3f5223

icfaust reviewed Dec 18, 2024

View reviewed changes

sklearnex/manifold/tests/test_tsne.py Outdated Show resolved Hide resolved

yue.jiao added 3 commits December 18, 2024 08:10

fix: fix format by running black and isort test_tsne.py

10da764

fix: const test check shape instead of str output

2f3e9fa

fix: test removing raise error test

739a90c

yuejiaointel marked this pull request as ready for review December 19, 2024 00:00

ethanglaser requested review from Vika-F and ethanglaser December 19, 2024 06:53

david-cortes-intel reviewed Dec 19, 2024

View reviewed changes

yue.jiao added 2 commits December 19, 2024 08:37

fix: fix test based on comments

822e614

fix: parametize basic test, use rng for ramdom datasets for independe…

c6bf0bd

…nt results, merge previous deleted gpu test to complex test

david-cortes-intel reviewed Jan 7, 2025

View reviewed changes

sklearnex/manifold/tests/test_tsne.py Outdated Show resolved Hide resolved

sklearnex/manifold/tests/test_tsne.py Outdated Show resolved Hide resolved

sklearnex/manifold/tests/test_tsne.py Outdated Show resolved Hide resolved

david-cortes-intel reviewed Jan 7, 2025

View reviewed changes

sklearnex/manifold/tests/test_tsne.py Outdated Show resolved Hide resolved

sklearnex/manifold/tests/test_tsne.py Outdated Show resolved Hide resolved

sklearnex/manifold/tests/test_tsne.py Show resolved Hide resolved

fix: additional tests for complex and sparse data, use pytest param f…

5d2da20

…or parametrization names, removed extra tests

fix: fix the logic to ensure tsne can keep close point close in embed…

e95f5a3

…ding space

yue.jiao added 2 commits January 8, 2025 13:14

fix: logic test amke group a and b more different

44f3c14

fix: print and more differetn for input on group a and group b for lo…

cba1ce9

…gical test

yue.jiao added 6 commits January 8, 2025 16:00

fix: add check to check for dpctl array check and convert it to numpy…

ba7658e

… array

fix: format fix for tsne tests

dc04722

fix: use _as_numpy to convert to numpy obj

9791ea4

fix: fix tsne format

11f5edc

test: investigate on ci why gpu test is not getting correct result fo…

8c1dc28

…r logical test

test-ci: don't comment other tests

1fbc7f0

fici-testsee changes with smaller preplexity

a57cd08

yuejiaointel requested review from napetrov, homksei and ahuber21 as code owners January 9, 2025 05:23

fix: remove print

28f9815

david-cortes-intel reviewed Jan 9, 2025

View reviewed changes

fix: fix based on comments

0753153

fix: const data can result embedding to 0 or not, removed the test

032cf6b

david-cortes-intel reviewed Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: new tests added for tsne to expand test coverage #2229

feature: new tests added for tsne to expand test coverage #2229

yuejiaointel commented Dec 17, 2024 •

edited by icfaust

Loading

codecov bot commented Dec 17, 2024 •

edited

Loading

yuejiaointel commented Dec 18, 2024

ethanglaser commented Dec 19, 2024

yuejiaointel commented Jan 6, 2025

david-cortes-intel commented Jan 7, 2025

yuejiaointel commented Jan 8, 2025

david-cortes-intel commented Jan 8, 2025

yuejiaointel commented Jan 8, 2025 •

edited

Loading

yuejiaointel commented Jan 9, 2025

david-cortes-intel commented Jan 9, 2025

yuejiaointel commented Jan 9, 2025

david-cortes-intel commented Jan 10, 2025

david-cortes-intel Jan 10, 2025

feature: new tests added for tsne to expand test coverage #2229

Are you sure you want to change the base?

feature: new tests added for tsne to expand test coverage #2229

Conversation

yuejiaointel commented Dec 17, 2024 • edited by icfaust Loading

Description

codecov bot commented Dec 17, 2024 • edited Loading

Codecov Report

yuejiaointel commented Dec 18, 2024

ethanglaser commented Dec 19, 2024

yuejiaointel commented Jan 6, 2025

david-cortes-intel commented Jan 7, 2025

yuejiaointel commented Jan 8, 2025

david-cortes-intel commented Jan 8, 2025

yuejiaointel commented Jan 8, 2025 • edited Loading

yuejiaointel commented Jan 9, 2025

david-cortes-intel commented Jan 9, 2025

yuejiaointel commented Jan 9, 2025

david-cortes-intel commented Jan 10, 2025

david-cortes-intel Jan 10, 2025

Choose a reason for hiding this comment

yuejiaointel commented Dec 17, 2024 •

edited by icfaust

Loading

codecov bot commented Dec 17, 2024 •

edited

Loading

yuejiaointel commented Jan 8, 2025 •

edited

Loading