Failing to test "IntelTensorFlow_for_LLMs" sample in CI #2445

Ankur-singh · 2024-08-12T18:10:33Z

Summary

Provide a short summary of the issue. Sections below provide guidance on what
factors are considered important to reproduce an issue.

The "IntelTensorFlow_for_LLMs" sample takes ~5hrs to run on CI. Hence, the sample timeouts in CI.

Environment

OS: Linux

Observed behavior

The sample shows how to finetune a 6B model, which takes ~5hrs on CPU. This makes it hard to test the sample on CI.

Expected behavior

Ideally, the sample should not take more than few minutes to run. We can use environment variable to check if the sample is running on CI and run it for a few batches in CI. This would be more than enough to test the correctness of the code sample.

fongjiantan · 2025-03-05T10:04:27Z

Hi @Ankur-singh, I'm the team lead for oneAPI_CS_Team4. My team would like to look into this issue for the hackathon.

Per my understanding, increasing the train/eval batch size will result in faster training but lower accuracy, which is a reasonable tradeoff for running it in a CI environment.

I tested by increasing batch size from 64 to 256 in the TrainingArgs class in GPTJ_finetuning.py:

self.per_device_train_batch_size=256
self.per_device_eval_batch_size=256

Here's a sample output:

2025-03-05 09:42:55.939159: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type CPU is enabled.
 5/33 [===>..........................] - ETA: 42:18 - loss: 1.2633 - accuracy: 0.600

Maybe you could test this in your CI environment and see how long it takes?

Also, may I know what is the environment variable to check if the sample is running in CI?

Thanks!

Ankur-singh · 2025-03-06T01:39:16Z

@fongjiantan all tensorflow samples are moved to a separate repo. IMO this is not a priority. You should check with @jimmytwei.

Ankur-singh added the bug Something isn't working label Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing to test "IntelTensorFlow_for_LLMs" sample in CI #2445

Failing to test "IntelTensorFlow_for_LLMs" sample in CI #2445

Ankur-singh commented Aug 12, 2024

fongjiantan commented Mar 5, 2025

Ankur-singh commented Mar 6, 2025

Failing to test "IntelTensorFlow_for_LLMs" sample in CI #2445

Failing to test "IntelTensorFlow_for_LLMs" sample in CI #2445

Comments

Ankur-singh commented Aug 12, 2024

Summary

Environment

Observed behavior

Expected behavior

fongjiantan commented Mar 5, 2025

Ankur-singh commented Mar 6, 2025