Skip to content

Failing to test "IntelTensorFlow_for_LLMs" sample in CI #2445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Ankur-singh opened this issue Aug 12, 2024 · 2 comments
Open

Failing to test "IntelTensorFlow_for_LLMs" sample in CI #2445

Ankur-singh opened this issue Aug 12, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Ankur-singh
Copy link
Contributor

Summary

Provide a short summary of the issue. Sections below provide guidance on what
factors are considered important to reproduce an issue.

The "IntelTensorFlow_for_LLMs" sample takes ~5hrs to run on CI. Hence, the sample timeouts in CI.

Environment

OS: Linux

Observed behavior

The sample shows how to finetune a 6B model, which takes ~5hrs on CPU. This makes it hard to test the sample on CI.

Expected behavior

Ideally, the sample should not take more than few minutes to run. We can use environment variable to check if the sample is running on CI and run it for a few batches in CI. This would be more than enough to test the correctness of the code sample.

@Ankur-singh Ankur-singh added the bug Something isn't working label Aug 12, 2024
@fongjiantan
Copy link
Contributor

Hi @Ankur-singh, I'm the team lead for oneAPI_CS_Team4. My team would like to look into this issue for the hackathon.

Per my understanding, increasing the train/eval batch size will result in faster training but lower accuracy, which is a reasonable tradeoff for running it in a CI environment.

I tested by increasing batch size from 64 to 256 in the TrainingArgs class in GPTJ_finetuning.py:

self.per_device_train_batch_size=256
self.per_device_eval_batch_size=256

Here's a sample output:

2025-03-05 09:42:55.939159: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type CPU is enabled.
 5/33 [===>..........................] - ETA: 42:18 - loss: 1.2633 - accuracy: 0.600

Maybe you could test this in your CI environment and see how long it takes?

Also, may I know what is the environment variable to check if the sample is running in CI?

Thanks!

@Ankur-singh
Copy link
Contributor Author

@fongjiantan all tensorflow samples are moved to a separate repo. IMO this is not a priority. You should check with @jimmytwei.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants
@Ankur-singh @fongjiantan and others