It examines current state-of-the-art (SOTA) models, namely:
It introduces different methods to evaluate those models based on the task.
A little explanation of why the models are different is also given.
They did not state which version of USE was used – there are two versions:
The former is less accurate but more performant in longer sentences.
Another thing to note is that ELMo, while contextual, is not as deeply contextual, so says the people who created BERT. BERT is.
OpenAI’s GPT -2 is also absent from the action, and I would have liked to see it included.
There is some buzz about XLNet, but I have not read enough about it to comment on it. Other than that, it promises the ability to learn longer-term dependencies in text. Since transformer models' compute cost grows quadratically with input text length, I wonder how they handled that.
Other takeaways:
…without specific fine-tuning, it seems that BERT is not suited to finding similar sentences.
…USE is trained on a number of tasks but one of the main tasks is to identify the similarity between pairs of sentences. The authors note that the task was to identify “semantic textual similarity (STS) between sentence pairs scored by Pearson correlation with human judgments”. This would help explain why the USE is better at the similarity task.
Pre-trained models are your friend: Most of the models published now are capable of being fine-tuned, but you should use the pre-trained model to get a quick idea of its suitability.