Best practices for running a model for sentence embeddings on two different columns #14468

macarran · 2024-11-28T01:46:36Z

Link to the documentation pages (if available)

No response

How could the documentation be improved?

Hi,
I tried searching for existing documentation or discussions on how to run a model for sentence embeddings over two separate columns and did not find any. I was wondering if there are any recommendations or known gotchas on the topic. Say I have a data frame with a name and address column, and would like to use a RoBERTa model to compute sentence embeddings for both. Best I could come up with was something as follows:

def createPipeline(source: String): Pipeline = {
  val documentAssembler = new DocumentAssembler()
    .setInputCol(source)
    .setOutputCol("document")

  val tokenizer = new Tokenizer()
    .setInputCols(Array("document"))
    .setOutputCol("token")

  val embeddings = XlmRoBertaEmbeddings
    .pretrained("xlm_roberta_base", "xx")
    .setInputCols(Array("document", "token"))
    .setOutputCol("embeddings")
    .setCaseSensitive(false)

  val sentenceEmbeddings = new SentenceEmbeddings()
    .setInputCols(Array("document", "embeddings"))
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")

  val embeddingsFinisher = new EmbeddingsFinisher()
    .setInputCols("sentence_embeddings")
    .setOutputCols("finished_embeddings")
    .setOutputAsVector(true)
    .setCleanAnnotations(false)

  new Pipeline().setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    sentenceEmbeddings,
    embeddingsFinisher
  ))
}

And then basically doing something like this:

val testDataWithNameEmbeddings = createPipeline("name").fit(testData).transform(testData).select($"name", $"address", $"finished_embeddings".alias("name_embeddings"))
val testDataWithBothEmbeddings = createPipeline("address").fit(testDataWithNameEmbeddings).transform(testDataWithNameEmbeddings).select($"name", $"address", $"name_embeddings", $"finished_embeddings".alias("address_embeddings"))

This appears to work, but feels... wrong? The existence of MultiDocumentAssembler and setInputCols APIs on several of the stages led me down a rabbit hole of trying out different approaches to see if I could annotate and tokenize multiple columns in one stage, but I hit a variety of issues and assertions for different components of the pipeline. For example, calling setInputCols on Tokenizer with an array containing more than one column results in:

IllegalArgumentException: requirement failed: setInputCols in REGEX_TOKENIZER_2889f26665ad expecting 1 columns. Provided column amount: 2. Which should be columns from the following annotators: document.

Closest thing I stumbled upon is this old issue where someone was trying to run multiple models in one pipeline, but if I try to add two embeddings stages in the pipeline spark-ml fails with:

IllegalArgumentException: requirement failed: Cannot have duplicate components in a pipeline.

Not sure how common of a use case this is given there don't seem to be other issues like it, would appreciate some thoughts on the topic.

Thanks!

Environment: Spark 3.5.0, Scala 2.12, Spark NLP 5.5.1

The text was updated successfully, but these errors were encountered:

albertoandreottiATgmail · 2024-11-28T02:15:25Z

Hey @macarran, thanks for reporting this!. Are you the same guy that used to work for Intel in Cordoba?

macarran · 2024-11-28T02:18:12Z

@albertoandreottiATgmail, that is correct! Of all places to find you this is it, that's funny.

albertoandreottiATgmail · 2024-11-28T02:19:32Z

Funny indeed, ok, being a friend of the house, I'll push for this to be taken care of. Have an awesome night!

macarran · 2024-12-11T11:15:19Z

@albertoandreottiATgmail, @DevinTDHa hope you're well!

Perhaps this would have been better suited as a question than a documentation request as there is some urgency on our side to resolve this. Let me know if you'd like me to close and re-submit with the proper label/form.

I guess my question boils down to whether the approach I shared above is the recommended way to compute embeddings over two (or more) columns when utilizing the same model, or if there are any better ways I might have missed in the documentation.

albertoandreottiATgmail · 2024-12-11T13:31:06Z

Hello Marcos,

nice to hear from you again!. I think that in terms of parallelism, if your dataset is big enough, computing 1 step at a time(first the date, then the name) is good.
Some things you could do:

After the first computation for name happens, don't pass the dataframe to the second step, just send it to disk in parquet format, testDataWithNameEmbeddings.write.parquet("./tmp123'). Then load it back to a DF: spark.read.parquet("./tmp123"), and feed this new DF into the second step. This will make sure nothing is re-computed in the second step.
Don't create a new pipeline everytime. Just keep 1 pipeline in memory, and only change the input col name,

  val documentAssembler = new DocumentAssembler()
    .setInputCol(source) <- here
    .setOutputCol("document")
  
  documentAssembler.setInputCol(address)

and then assemble back everything again with the same instances for all the other stages(don't create new ones). This will avoid keeping 2 copies of the same models in memory.

In general after the 2 steps have executed, I would write everything to disk, this way you avoid re-computation, and in the case that your Spark session crashes(and your DF dies), well, you'll have your backup in disk. This is specially useful when you have a long lasting computation, say a couple of hours.
Consider Streaming Mode if you need something more "event-driven", in which you, for example, drop documents on some folder, and collect the embeddings on the other side, as they are being computed(don't need to wait for the whole batch to finish).

macarran · 2024-12-13T20:27:03Z

@albertoandreottiATgmail appreciate the response! We were observing the behavior you mention where we'd pass in the same data frame on the second pass and it'd recompute the first set of embeddings again. Heading out for PTO but I'll pass this on to our team in the meantime, very helpful.

albertoandreottiATgmail · 2024-12-14T14:27:37Z

Glad to hear! Enjoy your vacation :)

macarran added the documentation label Nov 28, 2024

macarran assigned DevinTDHa Nov 28, 2024

danilojsl mentioned this issue Nov 29, 2024

[SPARKNLP-1096] Adding support to Microsoft Fabric for WordEmbeddings #14467

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for running a model for sentence embeddings on two different columns #14468

Best practices for running a model for sentence embeddings on two different columns #14468

macarran commented Nov 28, 2024

albertoandreottiATgmail commented Nov 28, 2024

macarran commented Nov 28, 2024

albertoandreottiATgmail commented Nov 28, 2024

macarran commented Dec 11, 2024

albertoandreottiATgmail commented Dec 11, 2024

macarran commented Dec 13, 2024

albertoandreottiATgmail commented Dec 14, 2024

Best practices for running a model for sentence embeddings on two different columns #14468

Best practices for running a model for sentence embeddings on two different columns #14468

Comments

macarran commented Nov 28, 2024

Link to the documentation pages (if available)

How could the documentation be improved?

albertoandreottiATgmail commented Nov 28, 2024

macarran commented Nov 28, 2024

albertoandreottiATgmail commented Nov 28, 2024

macarran commented Dec 11, 2024

albertoandreottiATgmail commented Dec 11, 2024

macarran commented Dec 13, 2024

albertoandreottiATgmail commented Dec 14, 2024