This project explores fine-tuning five different language models for neologism sentiment analysis. The trained models are integrated into an ensemble.
Neologisms, or newly coined words and expressions, present unique challenges in natural language processing (NLP). This project fine-tunes pre-trained language models to identify and analyze neologisms effectively. The models used in this study include:
- DistilBERT
- GPT-2
- RoBERTa
- BERTweet
- VADER
- FinBERT
The related code for the finetuning apporoaches can be found in DistilBERT_Fine_Tuning_For_Neologism.ipynb
- Description: Applied Low-Rank Adaptation (LoRA) to fine-tune specific layers of DistilBERT while freezing the rest.
- Result: Achieved good training accuracy but experienced overfitting on the test set due to the limited dataset size.
- Description: Fine-tuned DistilBERT for sentiment analysis, leveraging labeled data to align the model with the specific task.
- Result: Improved accuracy from 0.36 to 0.59 on the reddit dataset, addressing some challenges in handling neologisms.
- Description: Introduced domain-specific robust adaptation to enhance the model’s ability to handle neologisms and reduce language drift.
- Result: Further improved accuracy to 0.62 on the reddit dataset, but still struggled to generalize effectively.
- Description: Combined task-specific fine-tuning with LoRA to adapt the model to the Neologism Data efficiently.
- Result: Achieved the best performance with an accuracy of 0.74 on the reddit dataset, significantly reducing overfitting. This model had an accuracy of 0.86 on the twitter dataset.
While finetuning the GPT-2 model, the model is first fine-tuning for the task-specific objective of sentiment analysis. Once the model performs well on sentitment analysis, we then finetune the model to perform well on Neologisms. The raw GPT-2 model was able to achieve only 48% accuracy with sentiment analysis on neologisms. With task specific full finetuning for sentiment analysis an accuracy of upto 75% was possible. Then we finetuned the model further to perform better on neologisms. Through this we were able to achieve an overall accuracy of 83% on Neologisms with full-finetuning.
- finetuning_gpt2_sentiment.ipynb - Finetuning the base gpt2 model to perform well on the sentiment analysis task
- finetune_gpt2_neologism.ipynb - Finetuning gpt2 for sentiment analysis to perform well on sentences with neologisms
- lora_gpt2_sentiment.ipynb - Finetuning the base gpt2 model to perform well on the sentiment analysis task using the Low Rank Adaptation (LoRA) finetuning
- lora_gpt2_neologism.ipynb - Finetuning gpt2 for sentiment analysis (using LoRA method) to perform well on sentences with neologisms using LoRA finetuning
- ada_lora_gpt2_sentiment.ipynb - Finetuning the base gpt2 model to perform well on the sentiment analysis task using the Adaptive Budget Low Rank Adaptation (AdaLoRA) finetuning
- ada_lora_gpt2_neologism.ipynb - Finetuning gpt2 for sentiment analysis (using AdaLoRA method) to perform well on sentences with neologisms using AdaLoRA finetuning
The code to train and test for the roberta base model is present in - RoBERTa_roberta_base.ipynb. This includes full finetuning and LORA finetuning the RoBERTa base model. The
2. Full finetune and LORA finetune an already finetuned RoBERTa for Transfer learning and Downstream task of analyzing neologisms
The code to train and test for the finetuned RoBERTa model is present in - RoBERTa_roberta_cardifnlp.ipynb. This includes full finetuning and LORA finetuning the RoBERTa model already trained on twitter dataset - tweet_eval. This gives the best accuracy out of all the models used (91.2%)
3. Fine-tuning RoBERTa by unfreezing only selective layers for downstream task of analyzing neologisms
The file RoBERTa_roberta_freeze.ipynb contains the steps to fine-tune a RoBERTa model by freezing all layers and training only the classification layer. This is the fastest to train but has sub-par performance compared to full-finetune
The model performs relatively poorly, correctly identifying sentiments less than half the time (43% accuracy). While it is slightly better at being precise (54%), it fails to identify many of the correct sentiments (43% recall). This suggests that the basic fine-tuning is not sufficient to handle the complexity of neologisms in sentiment analysis.
Adding LoRA improves the model’s ability to identify sentiments correctly (51% accuracy) with reduced training time and hardware usage. Essentially, the model is better at capturing more true cases but also makes more errors, indicating that it trades off precision for recall.
This model demonstrates significant improvement, achieving high accuracy (79%) while maintaining a good balance between precision (72%) and recall (79%) implying that the model is both accurate and consistent in detecting sentiments from neologisms.
This yielded the most accurate results (89% accuracy). The model is first finetuned against twitter_eval dataset for sentimental analysis improvement. Then, further finetuned aginst neologisms specific dataset using DoRA improved the results, but with a tradeoff in training time.
The ahmedrachid/FinancialBERT-Sentiment-Analysis model is a fine-tuned BERT model designed for sentiment analysis of financial text. It specializes in understanding and classifying sentiment in news, reports, and other financial documents. However, this model was not able to do well with twitter data or neologisms as it is fine tuned using large text
We tried different fine tuning methods like Full Fine tuning, LORA and DORA on the FinancialBERT Sentiment Analysis Model. Subsequent analysis is done on the LORA Finetuning method as that was achieving the best accuracy for this model. Training Time: 938.6101 Seconds The fine-tuned model shows better performance in predicting Class 2 and 3 compared to the original model, however still overestimates class 1 (Neutral).
The fine-tuned model achieves significantly higher accuracy across all text lengths, particularly for longer texts, indicating improved handling of complex or detailed inputs.
The fine-tuned model reduces the gap between correct and incorrect predictions for several frequent words, suggesting improved contextual understanding of common terms.
Each model is trained individually on the neologism dataset and contributes to the final prediction in the deployed application. The ensemble aggregates outputs from all models to improve accuracy and robustness. We perform three ensemble techniques.
The simplest ensemble model employs majority voting, where the final label is selected based on the label predicted by the majority of models in the ensemble. Accuracy Achieved: 80%. Code is available in ensemble_learner_voting.ipynb
- Limitation: The accuracy can decrease when the individual model predictions diverge significantly.
- Reason: Higher divergence reduces the likelihood of the ensemble aligning with the true label.
- Prediction by Models:
- 3 models predict positive.
- 3 models predict negative.
- Ensemble Decision: Chooses either positive or negative randomly.
- Ideal Answer: Neutral (missed due to lack of consensus).
Multiple models make predictions independently, a model’s prediction is assigned a weight, typically based on the model’s performance and reliability. The output of all models are converted to a scale of 0-1 and the end result is returned.
-
Strength: Improves Robustness
Explanation: Weighted averaging reduces the impact of individual model errors by combining predictions, leading to a more balanced and reliable outcome. -
Limitation: Does Not Consider Individual Model Confidence
Explanation: Weighted averaging applies fixed weights to model predictions regardless of how confident each model is about a specific input, potentially leading to less reliable decisions.
Prediction by Models:
- Model A predicts positive with 70% probability.
- Model B predicts positive with 60% probability.
- Model C predicts positive with 80% probability.
Ensemble Decision:
The weighted average combines these predictions to confidently select positive, reducing the risk of relying on a single model’s error.
Ideal Answer:
A robust prediction of positive, leveraging the strengths of multiple models to improve reliability.
We build a meta model with multlayer perceptron to build a stacking ensemble. This achieved the highest possible accuracy of 87%. The code can be found in the ensemble_learner.ipynb. This model was able to capture the non-linear relations between the strengths and weaknesses of the underlying LLM models. This was even able to crack sentences with slang terms.
- DistillBERT - https://huggingface.co/docs/transformers/en/model_doc/distilbert
- GPT-2 - https://huggingface.co/openai-community/gpt2
- BERTweet - https://huggingface.co/vinai/bertweet-base
- FinBERT - https://huggingface.co/ProsusAI/finbert
- RoBERTA - https://huggingface.co/docs/transformers/en/model_doc/roberta
- VADER - https://github.com/cjhutto/vaderSentiment
- LoRA - https://huggingface.co/docs/peft/main/en/conceptual_guides/lora
- AdaLoRA - https://huggingface.co/docs/peft/main/en/package_reference/adalora
- DoRA - https://github.com/NVlabs/DoRA
- RoBERTA - https://huggingface.co/AkhilaGP/roberta-senti-neologism-freeze-1 https://huggingface.co/AkhilaGP/roberta-senti-neologism-full-finetune https://huggingface.co/AkhilaGP/roberta-base-neologism-full-finetune https://huggingface.co/AkhilaGP/roberta-senti-neologism-LORA https://huggingface.co/AkhilaGP/roberta-base-neologism-LORA
- GPT-2 - https://huggingface.co/srinathsk18/ada-lora-finetuned-gpt-2-sentiment-analysis-neologism https://huggingface.co/srinathsk18/finetuned-gpt-2-sentiment-analysis-neologism https://huggingface.co/srinathsk18/lora-finetuned-gpt-2-sentiment-analysis-neologism
- BERTweet - https://huggingface.co/Himaparasa/bertweet-sentiment-model (most-accurate: Finetuned with DoRA against twitter_eval, neologisms dataset)
- DistillBERT - https://huggingface.co/PriyadarshiniTamilselvan/finetuning-sentiment-model-neologisms-distilbert-lora https://huggingface.co/PriyadarshiniTamilselvan/finetuning-sentiment-model-neologisms-distilbert-dora https://huggingface.co/PriyadarshiniTamilselvan/finetuning-sentiment-model-neologisms-distilbert