Embedding dimension mismatch with BGE-m3 model #14525

ivanipenburg · 2025-02-19T13:07:04Z

Is there an existing issue for this?

I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I'm using BGE-m3 as a sentence embedding model on Databricks to embed a text column. I used the following instructions:
https://sparknlp.org/2024/02/11/bge_m3_xx.html

Current Behavior

I currently get an IllegalArgumentException.

The full error is as follows:

bge_m3 download started this may take some time.
Approximate size to download 391.8 MB
[OK!]
IllegalArgumentException: requirement failed: Embedding dimension mismatch: expected 15, but found 1
File <command-4667109844782750>, line 6
      2 from sparknlp.annotator.embeddings.sentence_embeddings import SentenceEmbeddings
      3 from pyspark.ml import Pipeline
      5 embeddings = (
----> 6     XlmRoBertaSentenceEmbeddings.pretrained("bge_m3", "xx")
      7     .setInputCols(["embedding_text"])
      8     .setOutputCol("embedding")
      9 )
     11 pipeline = Pipeline().setStages([embeddings])
     12 pipeline_model = pipeline.fit(df)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/sparknlp/annotator/embeddings/xlm_roberta_sentence_embeddings.py:194, in XlmRoBertaSentenceEmbeddings.pretrained(name, lang, remote_loc)
    176 """Downloads and loads a pretrained model.
    177 
    178 Parameters
   (...)
    191     The restored model
    192 """
    193 from sparknlp.pretrained import ResourceDownloader
--> 194 return ResourceDownloader.downloadModel(XlmRoBertaSentenceEmbeddings, name, lang, remote_loc)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/sparknlp/pretrained/resource_downloader.py:95, in ResourceDownloader.downloadModel(reader, name, language, remote_loc, j_dwn)
     93 t1.start()
     94 try:
---> 95     j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
     96 except Py4JJavaError as e:
     97     sys.stdout.write("\n" + str(e))
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/sparknlp/internal/__init__.py:606, in _DownloadModel.__init__(self, reader, name, language, remote_loc, validator)
    605 def __init__(self, reader, name, language, remote_loc, validator):
--> 606     super(_DownloadModel, self).__init__(
    607         "com.johnsnowlabs.nlp.pretrained." + validator + ".downloadModel",
    608         reader,
    609         name,
    610         language,
    611         remote_loc,
    612     )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/sparknlp/internal/extended_java_wrapper.py:27, in ExtendedJavaWrapper.__init__(self, java_obj, *args)
     25 super(ExtendedJavaWrapper, self).__init__(java_obj)
     26 self.sc = SparkContext._active_spark_context
---> 27 self._java_obj = self.new_java_obj(java_obj, *args)
     28 self.java_obj = self._java_obj
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/sparknlp/internal/extended_java_wrapper.py:37, in ExtendedJavaWrapper.new_java_obj(self, java_class, *args)
     36 def new_java_obj(self, java_class, *args):
---> 37     return self._new_java_obj(java_class, *args)
File /databricks/spark/python/pyspark/ml/wrapper.py:85, in JavaWrapper._new_java_obj(java_class, *args)
     83     java_obj = getattr(java_obj, name)
     84 java_args = [_py2java(sc, arg) for arg in args]
---> 85 return java_obj(*java_args)
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1355, in JavaMember.__call__(self, *args)
   1349 command = proto.CALL_COMMAND_NAME +\
   1350     self.command_header +\
   1351     args_command +\
   1352     proto.END_COMMAND_PART
   1354 answer = self.gateway_client.send_command(command)
-> 1355 return_value = get_return_value(
   1356     answer, self.gateway_client, self.target_id, self.name)
   1358 for temp_arg in temp_args:
   1359     if hasattr(temp_arg, "_detach"):
File /databricks/spark/python/pyspark/errors/exceptions/captured.py:230, in capture_sql_exception.<locals>.deco(*a, **kw)
    226 converted = convert_exception(e.java_exception)
    227 if not isinstance(converted, UnknownException):
    228     # Hide where the exception came from that shows a non-Pythonic
    229     # JVM exception message.
--> 230     raise converted from None
    231 else:
    232     raise

Expected Behavior

I expect to not get an error when loading the embedding model, as I'm following the instructions of the documentation.

Steps To Reproduce

You can reproduce using the following code:

from sparknlp.annotator.embeddings.xlm_roberta_sentence_embeddings import XlmRoBertaSentenceEmbeddings

embeddings = (
    XlmRoBertaSentenceEmbeddings.pretrained("bge_m3", "xx")
    .setInputCols(["embedding_text"])
    .setOutputCol("embedding")
)

Spark NLP version and Apache Spark

5.5.2
'3.5.0'

Type of Spark Application

Python Application

Java Version

openjdk version "1.8.0_412" OpenJDK Runtime Environment (Zulu 8.78.0.19-CA-linux64) (build 1.8.0_412-b08) OpenJDK 64-Bit Server VM (Zulu 8.78.0.19-CA-linux64) (build 25.412-b08, mixed mode)

Java Home Directory

/usr/lib/jvm/zulu8-ca-amd64/jre/

Setup and installation

I followed the installation instructions under "Install Spark NLP on Databricks" from here: https://sparknlp.org/docs/en/install

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

ivanipenburg added the question label Feb 19, 2025

ivanipenburg assigned maziyarpanahi Feb 19, 2025

maziyarpanahi assigned ahmedlone127 Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding dimension mismatch with BGE-m3 model #14525

Embedding dimension mismatch with BGE-m3 model #14525

ivanipenburg commented Feb 19, 2025

Embedding dimension mismatch with BGE-m3 model #14525

Embedding dimension mismatch with BGE-m3 model #14525

Comments

ivanipenburg commented Feb 19, 2025

Is there an existing issue for this?

Who can help?

What are you working on?

Current Behavior

Expected Behavior

Steps To Reproduce

Spark NLP version and Apache Spark

Type of Spark Application

Java Version

Java Home Directory

Setup and installation

Operating System and Version

Link to your project (if available)

Additional Information