Skip to content

Commit 4bbc8d7

Browse files
2024-01-17-hpo_resolver_pipeline_en (#883)
* Add model 2024-01-17-hpo_resolver_pipeline_en * Add model 2024-01-17-ner_deid_name_multilingual_xx * Update 2024-01-17-ner_deid_name_multilingual_xx.md --------- Co-authored-by: bunyamin-polat <[email protected]> Co-authored-by: Bünyamin Polat <[email protected]>
1 parent fd2bbed commit 4bbc8d7

File tree

2 files changed

+322
-0
lines changed

2 files changed

+322
-0
lines changed
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
---
2+
layout: model
3+
title: Pipeline for Human Phenotype Ontology (HPO) Sentence Entity Resolver
4+
author: John Snow Labs
5+
name: hpo_resolver_pipeline
6+
date: 2024-01-17
7+
tags: [licensed, en, entity_resolution, clinical, pipeline, hpo]
8+
task: [Entity Resolution, Pipeline Healthcare]
9+
language: en
10+
edition: Healthcare NLP 5.2.1
11+
spark_version: 3.0
12+
supported: true
13+
annotator: PipelineModel
14+
article_header:
15+
type: cover
16+
use_language_switcher: "Python-Scala-Java"
17+
---
18+
19+
## Description
20+
21+
This advanced pipeline extracts human phenotype entities from clinical texts and utilizes the `sbiobert_base_cased_mli` Sentence Bert Embeddings to map these entities to their corresponding Human Phenotype Ontology (HPO) codes. It also returns associated codes from the following vocabularies for each HPO code: - MeSH (Medical Subject Headings)- SNOMED- UMLS (Unified Medical Language System ) - ORPHA (international reference resource for information on rare diseases and orphan drugs) - OMIM (Online Mendelian Inheritance in Man).
22+
23+
{:.btn-box}
24+
<button class="button button-orange" disabled>Live Demo</button>
25+
<button class="button button-orange" disabled>Open in Colab</button>
26+
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/hpo_resolver_pipeline_en_5.2.1_3.0_1705513067470.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
27+
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/hpo_resolver_pipeline_en_5.2.1_3.0_1705513067470.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
28+
29+
## How to use
30+
31+
32+
33+
<div class="tabs-box" markdown="1">
34+
{% include programmingLanguageSelectScalaPythonNLU.html %}
35+
```python
36+
37+
from sparknlp.pretrained import PretrainedPipeline
38+
39+
ner_pipeline = PretrainedPipeline("hpo_resolver_pipeline", "en", "clinical/models")
40+
41+
result = ner_pipeline.annotate("""She is followed by Dr. X in our office and has a history of severe tricuspid regurgitation. On 05/12/08, preserved left and right ventricular systolic function, aortic sclerosis with apparent mild aortic stenosis. She has previously had a Persantine Myoview nuclear rest-stress test scan completed at ABCD Medical Center in 07/06 that was negative. She has had significant mitral valve regurgitation in the past being moderate, but on the most recent echocardiogram on 05/12/08, that was not felt to be significant. She does have a history of significant hypertension in the past. She has had dizzy spells and denies clearly any true syncope. She has had bradycardia in the past from beta-blocker therapy.""")
42+
43+
```
44+
```scala
45+
46+
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
47+
48+
val ner_pipeline = PretrainedPipeline("hpo_resolver_pipeline", "en", "clinical/models")
49+
50+
val result = ner_pipeline.annotate("""She is followed by Dr. X in our office and has a history of severe tricuspid regurgitation. On 05/12/08, preserved left and right ventricular systolic function, aortic sclerosis with apparent mild aortic stenosis. She has previously had a Persantine Myoview nuclear rest-stress test scan completed at ABCD Medical Center in 07/06 that was negative. She has had significant mitral valve regurgitation in the past being moderate, but on the most recent echocardiogram on 05/12/08, that was not felt to be significant. She does have a history of significant hypertension in the past. She has had dizzy spells and denies clearly any true syncope. She has had bradycardia in the past from beta-blocker therapy.""")
51+
52+
```
53+
</div>
54+
55+
## Results
56+
57+
```bash
58+
| | chunks | begin | end | entities | resolutions | description | all_codes |
59+
|---:|:---------------------------|--------:|------:|:-----------|:--------------|:---------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
60+
| 0 | tricuspid regurgitation | 67 | 89 | HP | HP:0005180 | tricuspid regurgitation | MSH:D014262||SNOMED:111287006||UMLS:C0040961||ORPHA:228410:::MSH:D014264||SNOMED:49915006||UMLS:C0040963||ORPHA:391641:::MSH:D014263||SNOMED:253383003||UMLS:C0040962||ORPHA:1101:::UMLS:C4025753||ORPHA:1759:::UMLS:C4255215||ORPHA:1724::::::MSH:D018785||SNOMED:253455004,63042009||UMLS:C0243002||ORPHA:391641::::::::::::UMLS:C4023292||ORPHA:1880:::MSH:D008944||SNOMED:48724000||UMLS:C0026266,C3551535||ORPHA:363700:::MSH:D004437||SNOMED:204357006||UMLS:C0013481||ORPHA:466791:::MSH:D001022||SNOMED:60234000||UMLS:C0003504||ORPHA:2181::::::MSH:C562388||SNOMED:72352009||UMLS:C0149630||ORPHA:1772:::UMLS:C4023294||ORPHA:2255 |
61+
| 1 | aortic stenosis | 197 | 211 | HP | HP:0001650 | aortic stenosis | MSH:D001024||SNOMED:60573004||UMLS:C0003507||ORPHA:536471:::MSH:D001020||SNOMED:204368006||UMLS:C0340375||ORPHA:1052:::MSH:D021921||SNOMED:268185002||UMLS:C0003499||ORPHA:391665:::UMLS:C1848978||ORPHA:3191:::UMLS:C3887554||OMIM:229310:::MSH:D023921||SNOMED:233970002||UMLS:C0242231||ORPHA:75565:::SNOMED:218728005||UMLS:C0152419||ORPHA:2255:::SNOMED:68109007||UMLS:C0038449||ORPHA:565:::MSH:D001022||SNOMED:60234000||UMLS:C0003504||ORPHA:2181:::MSH:D016893||SNOMED:64586002||UMLS:C0007282||ORPHA:536532:::SNOMED:54160000||UMLS:C2239253||ORPHA:1054:::MSH:D001014||SNOMED:67362008||UMLS:C0003486||ORPHA:1777:::SNOMED:81817003||UMLS:C0155733||ORPHA:412:::MSH:D017545||SNOMED:433068007||UMLS:C0162872||ORPHA:536467:::MSH:C562942||SNOMED:250978003||UMLS:C0428791||ORPHA:2072:::SNOMED:251036003||UMLS:C0238669||ORPHA:231160::: |
62+
| 2 | mitral valve regurgitation | 373 | 398 | HP | HP:0001653 | mitral valve regurgitation | MSH:D008944||SNOMED:48724000||UMLS:C0026266,C3551535||ORPHA:363700:::MSH:D008946||SNOMED:79619009||UMLS:C0026269||ORPHA:2248:::UMLS:C4025759||ORPHA:1724:::MSH:D008945||SNOMED:409712001,8074002||UMLS:C0026267||ORPHA:536467:::::::::MSH:D001022||SNOMED:60234000||UMLS:C0003504||ORPHA:2181::::::MSH:D014262||SNOMED:111287006||UMLS:C0040961||ORPHA:228410:::SNOMED:473372009||UMLS:C0919718||ORPHA:363618::::::UMLS:C1835130||OMIM:154700:::SNOMED:23063005||UMLS:C0344760||ORPHA:2248::::::SNOMED:253402005||UMLS:C0344770:::UMLS:C4021142 |
63+
| 3 | hypertension | 555 | 566 | HP | HP:0000822 | hypertension | MSH:D006973||SNOMED:24184005,38341003||UMLS:C0020538,C0497247||ORPHA:231160:::SNOMED:112222000||UMLS:C0234708||ORPHA:280921:::UMLS:C1857175||OMIM:171300:::SNOMED:706882009||UMLS:C0020546||ORPHA:94093:::MSH:D006975||SNOMED:34742003||UMLS:C0020541||ORPHA:228426:::MSH:D006976,D065627||SNOMED:11399002,697898008,70995007||UMLS:C0020542,C2973725,C3203102||ORPHA:79282:::MSH:D006978||SNOMED:123799005||UMLS:C0020545||ORPHA:3472:::MSH:D019586||SNOMED:271719001||UMLS:C0151740||ORPHA:247525:::MSH:D006983||SNOMED:271607001,29966009||UMLS:C0020555||ORPHA:79277:::SNOMED:288250001||UMLS:C0565599||ORPHA:439167:::UMLS:C3277940||ORPHA:93400:::MSH:D006937||SNOMED:13644009,166830008||UMLS:C0020443,C0595929||ORPHA:79237:::UMLS:C1846345,C3150267||ORPHA:89938:::UMLS:C1504382||OMIM:178600:::UMLS:C2265792||ORPHA:99736:::SNOMED:253920006||UMLS:C0431810||ORPHA:2346:::MSH:D058437||SNOMED:6962006||UMLS:C0152132||ORPHA:94080:::ORPHA:79259 |
64+
| 4 | bradycardia | 655 | 665 | HP | HP:0001662 | bradycardia | MSH:D001919||SNOMED:48867003||UMLS:C0428977||ORPHA:330001:::SNOMED:49710005||UMLS:C0085610||ORPHA:439232:::MSH:D018476||SNOMED:399317006||UMLS:C0233565||ORPHA:33069::::::MSH:D007007||SNOMED:45004005||UMLS:C0020620||ORPHA:69085:::ORPHA:2388::::::::::::MSH:D007022||SNOMED:45007003||UMLS:C0020649||ORPHA:556030:::MSH:D012021||SNOMED:103254005||UMLS:C0151572||ORPHA:280071:::MSH:D007024||SNOMED:28651003||UMLS:C0020651||ORPHA:556030:::SNOMED:405946002||UMLS:C0700078||ORPHA:370079:::MSH:D018476||SNOMED:255385008,43994002||UMLS:C0086439||ORPHA:280071:::ORPHA:300373:::SNOMED:3006004||UMLS:C0234428||ORPHA:29822 |
65+
```
66+
67+
{:.model-param}
68+
## Model Information
69+
70+
{:.table-model}
71+
|---|---|
72+
|Model Name:|hpo_resolver_pipeline|
73+
|Type:|pipeline|
74+
|Compatibility:|Healthcare NLP 5.2.1+|
75+
|License:|Licensed|
76+
|Edition:|Official|
77+
|Language:|en|
78+
|Size:|2.2 GB|
79+
80+
## Included Models
81+
82+
- DocumentAssembler
83+
- SentenceDetectorDLModel
84+
- TokenizerModel
85+
- WordEmbeddingsModel
86+
- MedicalNerModel
87+
- NerConverterInternalModel
88+
- Chunk2Doc
89+
- BertSentenceEmbeddings
90+
- SentenceEntityResolverModel
Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
---
2+
layout: model
3+
title: Detect Name for Deidentification (multilingual)
4+
author: John Snow Labs
5+
name: ner_deid_name_multilingual
6+
date: 2024-01-17
7+
tags: [en, ner, licensed, multilingual, name, xx]
8+
task: Named Entity Recognition
9+
language: xx
10+
edition: Healthcare NLP 5.2.1
11+
spark_version: 3.0
12+
supported: true
13+
annotator: MedicalNerModel
14+
article_header:
15+
type: cover
16+
use_language_switcher: "Python-Scala-Java"
17+
---
18+
19+
## Description
20+
21+
Deidentification NER is a Named Entity Recognition model that annotates English, German, French, Italian, Spanish, Portuguese, and Romanian text to find protected health information (PHI) that may need to be de-identified. It was trained with in-house annotated datasets and detects NAME entities.
22+
23+
## Predicted Entities
24+
25+
`NAME`
26+
27+
{:.btn-box}
28+
<button class="button button-orange" disabled>Live Demo</button>
29+
<button class="button button-orange" disabled>Open in Colab</button>
30+
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_name_multilingual_xx_5.2.1_3.0_1705512521448.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
31+
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_name_multilingual_xx_5.2.1_3.0_1705512521448.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
32+
33+
## How to use
34+
35+
36+
37+
<div class="tabs-box" markdown="1">
38+
{% include programmingLanguageSelectScalaPythonNLU.html %}
39+
40+
```python
41+
document_assembler = DocumentAssembler()\
42+
.setInputCol("text")\
43+
.setOutputCol("document")
44+
45+
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
46+
.setInputCols(["document"])\
47+
.setOutputCol("sentence")
48+
49+
tokenizer = Tokenizer() \
50+
.setInputCols(["sentence"]) \
51+
.setOutputCol("token")
52+
53+
embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base", "xx") \
54+
.setInputCols("sentence", "token") \
55+
.setOutputCol("embeddings")\
56+
.setMaxSentenceLength(512)\
57+
.setCaseSensitive(False)
58+
59+
ner = MedicalNerModel.pretrained("ner_deid_name_multilingual", "xx", "clinical/models") \
60+
.setInputCols(["sentence", "token", "embeddings"]) \
61+
.setOutputCol("ner")
62+
63+
ner_converter = NerConverter() \
64+
.setInputCols(["sentence", "token", "ner"]) \
65+
.setOutputCol("ner_chunk")
66+
67+
nlpPipeline = Pipeline(stages=[document_assembler,
68+
sentence_detector,
69+
tokenizer,
70+
embeddings,
71+
ner,
72+
ner_converter])
73+
74+
empty_data = spark.createDataFrame([[""]]).toDF("text")
75+
76+
model = nlpPipeline.fit(empty_data)
77+
78+
text_list = ["""Record date: 2093-01-13, David Hale, M.D., Name: Hendrickson, Ora MR. # 7194334 Date: 01/13/93 PCP: Oliveira, 25 years old, Record date: 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. The patient's complaints first surfaced when he started working for Brothers Coal-Mine.""",
79+
80+
"""J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015.""",
81+
82+
"""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""",
83+
84+
"""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.""",
85+
86+
"""Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.""",
87+
88+
"""Detalhes do paciente.
89+
Nome do paciente: Pedro Gonçalves
90+
NHC: 2569870.
91+
Endereço: Rua Das Flores 23.
92+
Cidade/ Província: Porto.
93+
Código Postal: 21754-987.
94+
Dados de cuidados.
95+
Data de nascimento: 10/10/1963.
96+
Idade: 53 anos Sexo: Homen
97+
Data de admissão: 17/06/2016.
98+
Doutora: Maria Santos""",
99+
100+
"""Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
101+
Tel: +40(235)413773
102+
Data setului de analize: 25 May 2022 15:36:00
103+
Nume si Prenume : BUREAN MARIA, Varsta: 77
104+
Medic : Agota Evelyn Tımar
105+
C.N.P : 2450502264401"""
106+
]
107+
108+
data = spark.createDataFrame(pd.DataFrame({"text": text_list}))
109+
110+
result = model.transform(data)
111+
```
112+
```scala
113+
val document_assembler = new DocumentAssembler()
114+
.setInputCol("text")
115+
.setOutputCol("document")
116+
117+
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
118+
.setInputCols(Array("document"))
119+
.setOutputCol("sentence")
120+
121+
val tokenizer = new Tokenizer()
122+
.setInputCols(Array("sentence"))
123+
.setOutputCol("token")
124+
125+
val embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base", "xx")
126+
.setInputCols(Array("sentence", "token"))
127+
.setOutputCol("embeddings")
128+
.setMaxSentenceLength(512)
129+
.setCaseSensitive(false)
130+
131+
val ner = MedicalNerModel.pretrained("ner_deid_name_multilingual", "xx", "clinical/models")
132+
.setInputCols(Array("sentence", "token", "embeddings"))
133+
.setOutputCol("ner")
134+
135+
val ner_converter = new NerConverter()
136+
.setInputCols(Array("sentence", "token", "ner"))
137+
.setOutputCol("ner_chunk")
138+
139+
val nlpPipeline = new Pipeline().setStages(Array(
140+
document_assembler,
141+
sentence_detector,
142+
tokenizer,
143+
embeddings,
144+
ner,
145+
ner_converter
146+
))
147+
148+
val text_list = Seq(
149+
"""Record date: 2093-01-13, David Hale, M.D., Name: Hendrickson, Ora MR. # 7194334 Date: 01/13/93 PCP: Oliveira, 25 years old, Record date: 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. The patient's complaints first surfaced when he started working for Brothers Coal-Mine.""",
150+
151+
"""J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015.""",
152+
153+
"""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""",
154+
155+
"""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.""",
156+
157+
"""Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.""",
158+
159+
"""Detalhes do paciente.
160+
Nome do paciente: Pedro Gonçalves
161+
NHC: 2569870.
162+
Endereço: Rua Das Flores 23.
163+
Cidade/ Província: Porto.
164+
Código Postal: 21754-987.
165+
Dados de cuidados.
166+
Data de nascimento: 10/10/1963.
167+
Idade: 53 anos Sexo: Homen
168+
Data de admissão: 17/06/2016.
169+
Doutora: Maria Santos""",
170+
171+
"""Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
172+
Tel: +40(235)413773
173+
Data setului de analize: 25 May 2022 15:36:00
174+
Nume si Prenume : BUREAN MARIA, Varsta: 77
175+
Medic : Agota Evelyn Tımar
176+
C.N.P : 2450502264401"""
177+
)
178+
179+
val data = Seq(text_list).toDS.toDF("text")
180+
181+
val result = model.fit(data).transform(data)
182+
```
183+
</div>
184+
185+
## Results
186+
187+
```bash
188+
+-----------------------+---------+
189+
|chunk |ner_label|
190+
+-----------------------+---------+
191+
|David Hale |NAME |
192+
|Hendrickson, Ora |NAME |
193+
|Oliveira |NAME |
194+
|Michel Martinez |NAME |
195+
|Michael Berger |NAME |
196+
|Berger |NAME |
197+
|Gastone Montanariello |NAME |
198+
|Antonio Miguel Martínez|NAME |
199+
|Pedro Gonçalves |NAME |
200+
|Maria Santos |NAME |
201+
|BUREAN MARIA |NAME |
202+
|Agota Evelyn Tımar |NAME |
203+
+-----------------------+---------+
204+
```
205+
206+
{:.model-param}
207+
## Model Information
208+
209+
{:.table-model}
210+
|---|---|
211+
|Model Name:|ner_deid_name_multilingual|
212+
|Compatibility:|Healthcare NLP 5.2.1+|
213+
|License:|Licensed|
214+
|Edition:|Official|
215+
|Input Labels:|[sentence, token, embeddings]|
216+
|Output Labels:|[ner]|
217+
|Language:|xx|
218+
|Size:|16.2 MB|
219+
220+
## References
221+
222+
It was trained with in-house annotated datasets
223+
224+
## Benchmarking
225+
226+
```bash
227+
label precision recall f1-score support
228+
NAME 0.95 0.95 0.95 5068
229+
micro-avg 0.95 0.95 0.95 5068
230+
macro-avg 0.95 0.95 0.95 5068
231+
weighted-avg 0.95 0.95 0.95 5068
232+
```

0 commit comments

Comments
 (0)