Skip to content

Embedding dimension mismatch with BGE-m3 model #14525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
ivanipenburg opened this issue Feb 19, 2025 · 0 comments
Open
1 task done

Embedding dimension mismatch with BGE-m3 model #14525

ivanipenburg opened this issue Feb 19, 2025 · 0 comments
Assignees
Labels

Comments

@ivanipenburg
Copy link

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I'm using BGE-m3 as a sentence embedding model on Databricks to embed a text column. I used the following instructions:
https://sparknlp.org/2024/02/11/bge_m3_xx.html

Current Behavior

I currently get an IllegalArgumentException.

The full error is as follows:

bge_m3 download started this may take some time.
Approximate size to download 391.8 MB
[OK!]
IllegalArgumentException: requirement failed: Embedding dimension mismatch: expected 15, but found 1
File <command-4667109844782750>, line 6
      2 from sparknlp.annotator.embeddings.sentence_embeddings import SentenceEmbeddings
      3 from pyspark.ml import Pipeline
      5 embeddings = (
----> 6     XlmRoBertaSentenceEmbeddings.pretrained("bge_m3", "xx")
      7     .setInputCols(["embedding_text"])
      8     .setOutputCol("embedding")
      9 )
     11 pipeline = Pipeline().setStages([embeddings])
     12 pipeline_model = pipeline.fit(df)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/sparknlp/annotator/embeddings/xlm_roberta_sentence_embeddings.py:194, in XlmRoBertaSentenceEmbeddings.pretrained(name, lang, remote_loc)
    176 """Downloads and loads a pretrained model.
    177 
    178 Parameters
   (...)
    191     The restored model
    192 """
    193 from sparknlp.pretrained import ResourceDownloader
--> 194 return ResourceDownloader.downloadModel(XlmRoBertaSentenceEmbeddings, name, lang, remote_loc)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/sparknlp/pretrained/resource_downloader.py:95, in ResourceDownloader.downloadModel(reader, name, language, remote_loc, j_dwn)
     93 t1.start()
     94 try:
---> 95     j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
     96 except Py4JJavaError as e:
     97     sys.stdout.write("\n" + str(e))
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/sparknlp/internal/__init__.py:606, in _DownloadModel.__init__(self, reader, name, language, remote_loc, validator)
    605 def __init__(self, reader, name, language, remote_loc, validator):
--> 606     super(_DownloadModel, self).__init__(
    607         "com.johnsnowlabs.nlp.pretrained." + validator + ".downloadModel",
    608         reader,
    609         name,
    610         language,
    611         remote_loc,
    612     )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/sparknlp/internal/extended_java_wrapper.py:27, in ExtendedJavaWrapper.__init__(self, java_obj, *args)
     25 super(ExtendedJavaWrapper, self).__init__(java_obj)
     26 self.sc = SparkContext._active_spark_context
---> 27 self._java_obj = self.new_java_obj(java_obj, *args)
     28 self.java_obj = self._java_obj
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/sparknlp/internal/extended_java_wrapper.py:37, in ExtendedJavaWrapper.new_java_obj(self, java_class, *args)
     36 def new_java_obj(self, java_class, *args):
---> 37     return self._new_java_obj(java_class, *args)
File /databricks/spark/python/pyspark/ml/wrapper.py:85, in JavaWrapper._new_java_obj(java_class, *args)
     83     java_obj = getattr(java_obj, name)
     84 java_args = [_py2java(sc, arg) for arg in args]
---> 85 return java_obj(*java_args)
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1355, in JavaMember.__call__(self, *args)
   1349 command = proto.CALL_COMMAND_NAME +\
   1350     self.command_header +\
   1351     args_command +\
   1352     proto.END_COMMAND_PART
   1354 answer = self.gateway_client.send_command(command)
-> 1355 return_value = get_return_value(
   1356     answer, self.gateway_client, self.target_id, self.name)
   1358 for temp_arg in temp_args:
   1359     if hasattr(temp_arg, "_detach"):
File /databricks/spark/python/pyspark/errors/exceptions/captured.py:230, in capture_sql_exception.<locals>.deco(*a, **kw)
    226 converted = convert_exception(e.java_exception)
    227 if not isinstance(converted, UnknownException):
    228     # Hide where the exception came from that shows a non-Pythonic
    229     # JVM exception message.
--> 230     raise converted from None
    231 else:
    232     raise

Expected Behavior

I expect to not get an error when loading the embedding model, as I'm following the instructions of the documentation.

Steps To Reproduce

You can reproduce using the following code:

from sparknlp.annotator.embeddings.xlm_roberta_sentence_embeddings import XlmRoBertaSentenceEmbeddings

embeddings = (
    XlmRoBertaSentenceEmbeddings.pretrained("bge_m3", "xx")
    .setInputCols(["embedding_text"])
    .setOutputCol("embedding")
)

Spark NLP version and Apache Spark

5.5.2
'3.5.0'

Type of Spark Application

Python Application

Java Version

openjdk version "1.8.0_412" OpenJDK Runtime Environment (Zulu 8.78.0.19-CA-linux64) (build 1.8.0_412-b08) OpenJDK 64-Bit Server VM (Zulu 8.78.0.19-CA-linux64) (build 25.412-b08, mixed mode)

Java Home Directory

/usr/lib/jvm/zulu8-ca-amd64/jre/

Setup and installation

I followed the installation instructions under "Install Spark NLP on Databricks" from here: https://sparknlp.org/docs/en/install

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants