Models hub finance (#687)

dcecchini · jsl-models · gadde5300 · web-flow · commit 11a114e1f35d · 2023-10-06T18:01:41.000-03:00
* Add model 2023-08-03-finner_bert_subpoenas_sm_en (#493) Co-authored-by: gadde5300 <gadde5300@gmail.com> * Delete subpoenas ner finance * Add model 2023-08-30-finpipe_deid_en (#566) Co-authored-by: Meryem1425 <vildansarikaya25@gmail.com> * Add model 2023-08-30-finpipe_deid_en (#570) Co-authored-by: SKocer <samedkocer22@gmail.com> * Add model 2023-08-30-finpipe_deid_en (#571) Co-authored-by: SKocer <samedkocer22@gmail.com> * Delete 2023-08-30-finpipe_deid_en.md * Add model 2023-08-30-finpipe_deid_en (#572) Co-authored-by: gokhanturer <mgturer@gmail.com> * Add model 2023-08-30-finpipe_deid_en (#574) Co-authored-by: SKocer <samedkocer22@gmail.com> * Add model 2023-09-01-finpipe_deid_en (#586) Co-authored-by: Meryem1425 <vildansarikaya25@gmail.com> * Add model 2023-09-01-finpipe_deid_en (#589) Co-authored-by: SKocer <samedkocer22@gmail.com> * Add model 2023-09-01-finpipe_deid_en (#593) Co-authored-by: gokhanturer <mgturer@gmail.com> * 2023-10-06-finembedding_e5_base_en (#685) * Add model 2023-10-06-finembedding_e5_base_en * Add model 2023-10-06-finner_absa_sm_en * Add model 2023-10-06-finassertion_absa_sm_en --------- Co-authored-by: dcecchini <dadachini@hotmail.com> --------- Co-authored-by: jsl-models <74001263+jsl-models@users.noreply.github.com> Co-authored-by: gadde5300 <gadde5300@gmail.com> Co-authored-by: Meryem1425 <vildansarikaya25@gmail.com> Co-authored-by: SKocer <samedkocer22@gmail.com> Co-authored-by: Merve Ertas Uslu <67653613+Mary-Sci@users.noreply.github.com> Co-authored-by: gokhanturer <mgturer@gmail.com>
diff --git a/docs/_posts/dcecchini/2023-10-06-finassertion_absa_sm_en.md b/docs/_posts/dcecchini/2023-10-06-finassertion_absa_sm_en.md
@@ -0,0 +1,151 @@
+---
+layout: model
+title: Financial Assertion of Sentiment (sm, Small)
+author: John Snow Labs
+name: finassertion_absa_sm
+date: 2023-10-06
+tags: [finance, assertion, en, sentiment_analysis, licensed]
+task: Assertion Status
+language: en
+edition: Finance NLP 1.0.0
+spark_version: 3.0
+supported: true
+annotator: AssertionDLModel
+article_header:
+  type: cover
+use_language_switcher: "Python-Scala-Java"
+---
+
+## Description
+
+This assertion model classifies financial entities into a sentiment. It is designed to be used together with the associated NER model.
+
+## Predicted Entities
+
+`POSITIVE`, `NEGATIVE`, `NEUTRAL`
+
+{:.btn-box}
+<button class="button button-orange" disabled>Live Demo</button>
+<button class="button button-orange" disabled>Open in Colab</button>
+[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finassertion_absa_sm_en_1.0.0_3.0_1696606845902.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
+[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finassertion_absa_sm_en_1.0.0_3.0_1696606845902.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
+
+## How to use
+
+
+
+<div class="tabs-box" markdown="1">
+{% include programmingLanguageSelectScalaPythonNLU.html %}
+```python
+documentAssembler = (
+    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
+)
+
+# Sentence Detector annotator, processes various sentences per line
+sentenceDetector = (
+    nlp.SentenceDetector()
+    .setInputCols(["document"])
+    .setOutputCol("sentence")
+)
+
+# Tokenizer splits words in a relevant format for NLP
+tokenizer = (
+    nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
+)
+
+bert_embeddings = (
+    nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
+    .setInputCols("document", "token")
+    .setOutputCol("embeddings")
+    .setMaxSentenceLength(512)
+)
+
+clinical_ner = (
+    finance.NerModel.pretrained("finner_absa_sm", "en", "finance/models")
+    .setInputCols(["sentence", "token", "embeddings"])
+    .setOutputCol("ner")
+)
+
+ner_converter = (
+    finance.NerConverterInternal()
+    .setInputCols(["sentence", "token", "ner"])
+    .setOutputCol("ner_chunk")
+)
+
+assertion_model = (
+    finance.AssertionDLModel.pretrained("finassertion_absa_sm", "en", "finance/models")
+    .setInputCols(["sentence", "ner_chunk", "embeddings"])
+    .setOutputCol("assertion")
+)
+
+nlpPipeline = nlp.Pipeline(
+    stages=[
+        documentAssembler,
+        sentenceDetector,
+        tokenizer,
+        bert_embeddings,
+        clinical_ner,
+        ner_converter,
+        assertion_model,
+    ]
+)
+
+
+text = "Equity and earnings of affiliates in Latin America increased to $4.8 million in the quarter from $2.2 million in the prior year as the commodity markets in Latin America remain strong through the end of the quarter."
+
+spark_df = spark.createDataFrame([[text]]).toDF("text")
+
+result = model.fit(spark_df ).transform(spark_df)
+
+result.select(
+    F.explode(
+        F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")
+    ).alias("cols")
+).select(
+    F.expr("cols['0']").alias("entity"),
+    F.expr("cols['1']['entity']").alias("label"),
+).show(
+    50, truncate=False
+)
+```
+
+</div>
+
+## Results
+
+```bash
++--------+---------+
+|entity  |label    |
++--------+---------+
+|Equity  |LIABILITY|
+|earnings|PROFIT   |
++--------+---------+
+```
+
+{:.model-param}
+## Model Information
+
+{:.table-model}
+|---|---|
+|Model Name:|finassertion_absa_sm|
+|Compatibility:|Finance NLP 1.0.0+|
+|License:|Licensed|
+|Edition:|Official|
+|Input Labels:|[document, chunk, embeddings]|
+|Output Labels:|[assertion]|
+|Language:|en|
+|Size:|2.7 MB|
+
+## References
+
+In-house annotations of earning call transcripts.
+
+## Benchmarking
+
+```bash
+     label    precision    recall  f1-score   support
+
+    NEGATIVE       0.57      0.42      0.48        74
+     NEUTRAL       0.51      0.70      0.59       184
+    POSITIVE       0.75      0.64      0.69       324
+```
diff --git a/docs/_posts/dcecchini/2023-10-06-finembedding_e5_base_en.md b/docs/_posts/dcecchini/2023-10-06-finembedding_e5_base_en.md
@@ -0,0 +1,93 @@
+---
+layout: model
+title: Finance E5 Embedding Base
+author: John Snow Labs
+name: finembedding_e5_base
+date: 2023-10-06
+tags: [finance, en, licensed, e5, sentence_embedding, onnx]
+task: Embeddings
+language: en
+edition: Finance NLP 1.0.0
+spark_version: 3.0
+supported: true
+engine: onnx
+annotator: E5Embeddings
+article_header:
+  type: cover
+use_language_switcher: "Python-Scala-Java"
+---
+
+## Description
+
+This model is a financial version of the E5 base model fine-tuned on earning call transcripts and finance question-answering datasets. Reference: Wang, Liang, et al. "Text embeddings by weakly-supervised contrastive pre-training." arXiv preprint arXiv:2212.03533 (2022).
+
+## Predicted Entities
+
+
+
+{:.btn-box}
+<button class="button button-orange" disabled>Live Demo</button>
+<button class="button button-orange" disabled>Open in Colab</button>
+[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finembedding_e5_base_en_1.0.0_3.0_1696603847700.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
+[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finembedding_e5_base_en_1.0.0_3.0_1696603847700.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
+
+## How to use
+
+
+
+<div class="tabs-box" markdown="1">
+{% include programmingLanguageSelectScalaPythonNLU.html %}
+```python
+document_assembler = (
+    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
+)
+
+E5_embedding = (
+    nlp.E5Embeddings.pretrained(
+        "finembedding_e5_base", "en", "finance/models"
+    )
+    .setInputCols(["document"])
+    .setOutputCol("E5")
+)
+pipeline = nlp.Pipeline(stages=[document_assembler, E5_embedding])
+
+data = spark.createDataFrame(
+    [["What is the best way to invest in the stock market?"]]
+).toDF("text")
+
+result = pipeline.fit(data).transform(data)
+result. Select("E5.result").show()
+```
+
+</div>
+
+## Results
+
+```bash
++----------------------------------------------------------------------------------------------------+
+|                                                                                          embeddings|
++----------------------------------------------------------------------------------------------------+
+|[0.45521045, -0.16874692, -0.06179046, -0.37956607, 1.152633, 0.6849592, -0.9676384, 0.4624033, ...|
++----------------------------------------------------------------------------------------------------+
+```
+
+{:.model-param}
+## Model Information
+
+{:.table-model}
+|---|---|
+|Model Name:|finembedding_e5_base|
+|Compatibility:|Finance NLP 1.0.0+|
+|License:|Licensed|
+|Edition:|Official|
+|Input Labels:|[document]|
+|Output Labels:|[E5]|
+|Language:|en|
+|Size:|398.5 MB|
+
+## References
+
+For our Finance models, we will use publicly available datasets to fine-tune the model:
+
+- [FiQA](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/)
+- In-house annotated Earning Calls Transcripts
diff --git a/docs/_posts/dcecchini/2023-10-06-finner_absa_sm_en.md b/docs/_posts/dcecchini/2023-10-06-finner_absa_sm_en.md
@@ -0,0 +1,147 @@
+---
+layout: model
+title: Financial NER for Aspect-based Sentiment Analysis (sm, Small)
+author: John Snow Labs
+name: finner_absa_sm
+date: 2023-10-06
+tags: [finance, en, ner, licensed]
+task: Named Entity Recognition
+language: en
+edition: Finance NLP 1.0.0
+spark_version: 3.0
+supported: true
+annotator: FinanceNerModel
+article_header:
+  type: cover
+use_language_switcher: "Python-Scala-Java"
+---
+
+## Description
+
+This NER model identifies entities that can be associated with a financial sentiment. The model is designed to be used with the associated Assertion Status model that classifies the entities into a sentiment category.
+
+## Predicted Entities
+
+`REVENUE`, `EXPENSE`, `PROFIT`, `KPI`, `GAINS`, `ASSET`, `LIABILITY`, `CASHFLOW`, `LOSSES`, `FREE_CASH_FLOW`
+
+{:.btn-box}
+<button class="button button-orange" disabled>Live Demo</button>
+<button class="button button-orange" disabled>Open in Colab</button>
+[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_absa_sm_en_1.0.0_3.0_1696605316183.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
+[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_absa_sm_en_1.0.0_3.0_1696605316183.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
+
+## How to use
+
+
+
+<div class="tabs-box" markdown="1">
+{% include programmingLanguageSelectScalaPythonNLU.html %}
+```python
+document_assembler = nlp.DocumentAssembler()\
+    .setInputCol("text")\
+    .setOutputCol("document")
+
+sentence_detector = nlp.SentenceDetector() \
+    .setInputCols(["document"]) \
+    .setOutputCol("sentence") \
+    .setCustomBounds(["\n\n"])
+
+tokenizer = nlp.Tokenizer()\
+    .setInputCols(["sentence"])\
+    .setOutputCol("token")
+
+embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
+    .setInputCols(["sentence", "token"])\
+    .setOutputCol("embeddings")\
+    .setCaseSensitive(True)\
+    .setMaxSentenceLength(512)
+
+ner_model = finance.NerModel.pretrained("finner_absa_sm", "en", "finance/models")\
+    .setInputCols(["sentence", "token", "embeddings"])\
+    .setOutputCol("ner")\
+
+ner_converter = finance.NerConverterInternal()\
+    .setInputCols(["sentence", "token", "ner"])\
+    .setOutputCol("ner_chunk")
+
+pipeline = nlp.Pipeline(stages=[
+    document_assembler,
+    sentence_detector,
+    tokenizer,
+    embeddings,
+    ner_model,
+    ner_converter   
+    ])
+
+model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
+
+
+text = "Equity and earnings of affiliates in Latin America increased to $4.8 million in the quarter from $2.2 million in the prior year as the commodity markets in Latin America remain strong through the end of the quarter."
+
+spark_df = spark.createDataFrame([[text]]).toDF("text")
+
+result = model. Transform(spark_df)
+result. Select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
+               .select(F.expr("cols['0']").alias("entity"),
+                       F.expr("cols['1']['entity']").alias("label")).show(50, truncate = False)
+
+```
+
+</div>
+
+## Results
+
+```bash
++--------+---------+
+|entity  |label    |
++--------+---------+
+|Equity  |LIABILITY|
+|earnings|PROFIT   |
++--------+---------+
+```
+
+{:.model-param}
+## Model Information
+
+{:.table-model}
+|---|---|
+|Model Name:|finner_absa_sm|
+|Compatibility:|Finance NLP 1.0.0+|
+|License:|Licensed|
+|Edition:|Official|
+|Input Labels:|[sentence, token, embeddings]|
+|Output Labels:|[ner]|
+|Language:|en|
+|Size:|16.3 MB|
+
+## References
+
+In-house annotations of earning call transcripts.
+
+## Benchmarking
+
+```bash
+         label    precision    recall  f1-score   support
+
+         B-ASSET     0.6000    0.2400    0.3429        25
+      B-CASHFLOW     0.7000    0.5833    0.6364        12
+       B-EXPENSE     0.7222    0.6500    0.6842        60
+B-FREE_CASH_FLOW     1.0000    1.0000    1.0000         8
+         B-GAINS     0.7333    0.5946    0.6567        37
+           B-KPI     0.7143    0.5556    0.6250        36
+     B-LIABILITY     0.5000    0.2778    0.3571        18
+        B-LOSSES     0.7143    0.7143    0.7143         7
+        B-PROFIT     0.8462    0.8919    0.8684        37
+       B-REVENUE     0.7385    0.8000    0.7680        60
+         I-ASSET     0.8000    0.3636    0.5000        11
+      I-CASHFLOW     0.9091    0.9091    0.9091        11
+       I-EXPENSE     0.7451    0.6230    0.6786        61
+I-FREE_CASH_FLOW     1.0000    1.0000    1.0000        17
+         I-GAINS     0.8333    0.6667    0.7407        30
+           I-KPI     0.8500    0.5000    0.6296        34
+     I-LIABILITY     0.5000    0.5000    0.5000         6
+        I-LOSSES     0.7143    0.6250    0.6667         8
+        I-PROFIT     0.8621    0.9615    0.9091        26
+       I-REVENUE     0.7600    0.7308    0.7451        26
+               O     0.9839    0.9923    0.9880      8660
+```