You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Provide a short summary of the issue. Sections below provide guidance on what
factors are considered important to reproduce an issue.
The "IntelTensorFlow_for_LLMs" sample takes ~5hrs to run on CI. Hence, the sample timeouts in CI.
Environment
OS: Linux
Observed behavior
The sample shows how to finetune a 6B model, which takes ~5hrs on CPU. This makes it hard to test the sample on CI.
Expected behavior
Ideally, the sample should not take more than few minutes to run. We can use environment variable to check if the sample is running on CI and run it for a few batches in CI. This would be more than enough to test the correctness of the code sample.
The text was updated successfully, but these errors were encountered:
Hi @Ankur-singh, I'm the team lead for oneAPI_CS_Team4. My team would like to look into this issue for the hackathon.
Per my understanding, increasing the train/eval batch size will result in faster training but lower accuracy, which is a reasonable tradeoff for running it in a CI environment.
I tested by increasing batch size from 64 to 256 in the TrainingArgs class in GPTJ_finetuning.py:
Summary
Provide a short summary of the issue. Sections below provide guidance on what
factors are considered important to reproduce an issue.
The "IntelTensorFlow_for_LLMs" sample takes ~5hrs to run on CI. Hence, the sample timeouts in CI.
Environment
OS: Linux
Observed behavior
The sample shows how to finetune a 6B model, which takes ~5hrs on CPU. This makes it hard to test the sample on CI.
Expected behavior
Ideally, the sample should not take more than few minutes to run. We can use environment variable to check if the sample is running on CI and run it for a few batches in CI. This would be more than enough to test the correctness of the code sample.
The text was updated successfully, but these errors were encountered: