|
| 1 | +--- |
| 2 | +layout: model |
| 3 | +title: Detect Name for Deidentification (multilingual) |
| 4 | +author: John Snow Labs |
| 5 | +name: ner_deid_name_multilingual |
| 6 | +date: 2024-01-17 |
| 7 | +tags: [en, ner, licensed, multilingual, name, xx] |
| 8 | +task: Named Entity Recognition |
| 9 | +language: xx |
| 10 | +edition: Healthcare NLP 5.2.1 |
| 11 | +spark_version: 3.0 |
| 12 | +supported: true |
| 13 | +annotator: MedicalNerModel |
| 14 | +article_header: |
| 15 | + type: cover |
| 16 | +use_language_switcher: "Python-Scala-Java" |
| 17 | +--- |
| 18 | + |
| 19 | +## Description |
| 20 | + |
| 21 | +Deidentification NER is a Named Entity Recognition model that annotates English, German, French, Italian, Spanish, Portuguese, and Romanian text to find protected health information (PHI) that may need to be de-identified. It was trained with in-house annotated datasets and detects NAME entities. |
| 22 | + |
| 23 | +## Predicted Entities |
| 24 | + |
| 25 | +`NAME` |
| 26 | + |
| 27 | +{:.btn-box} |
| 28 | +<button class="button button-orange" disabled>Live Demo</button> |
| 29 | +<button class="button button-orange" disabled>Open in Colab</button> |
| 30 | +[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_name_multilingual_xx_5.2.1_3.0_1705512521448.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} |
| 31 | +[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_name_multilingual_xx_5.2.1_3.0_1705512521448.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} |
| 32 | + |
| 33 | +## How to use |
| 34 | + |
| 35 | + |
| 36 | + |
| 37 | +<div class="tabs-box" markdown="1"> |
| 38 | +{% include programmingLanguageSelectScalaPythonNLU.html %} |
| 39 | + |
| 40 | +```python |
| 41 | +document_assembler = DocumentAssembler()\ |
| 42 | + .setInputCol("text")\ |
| 43 | + .setOutputCol("document") |
| 44 | + |
| 45 | +sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ |
| 46 | + .setInputCols(["document"])\ |
| 47 | + .setOutputCol("sentence") |
| 48 | + |
| 49 | +tokenizer = Tokenizer() \ |
| 50 | + .setInputCols(["sentence"]) \ |
| 51 | + .setOutputCol("token") |
| 52 | + |
| 53 | +embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base", "xx") \ |
| 54 | + .setInputCols("sentence", "token") \ |
| 55 | + .setOutputCol("embeddings")\ |
| 56 | + .setMaxSentenceLength(512)\ |
| 57 | + .setCaseSensitive(False) |
| 58 | + |
| 59 | +ner = MedicalNerModel.pretrained("ner_deid_name_multilingual", "xx", "clinical/models") \ |
| 60 | + .setInputCols(["sentence", "token", "embeddings"]) \ |
| 61 | + .setOutputCol("ner") |
| 62 | + |
| 63 | +ner_converter = NerConverter() \ |
| 64 | + .setInputCols(["sentence", "token", "ner"]) \ |
| 65 | + .setOutputCol("ner_chunk") |
| 66 | + |
| 67 | +nlpPipeline = Pipeline(stages=[document_assembler, |
| 68 | + sentence_detector, |
| 69 | + tokenizer, |
| 70 | + embeddings, |
| 71 | + ner, |
| 72 | + ner_converter]) |
| 73 | + |
| 74 | +empty_data = spark.createDataFrame([[""]]).toDF("text") |
| 75 | + |
| 76 | +model = nlpPipeline.fit(empty_data) |
| 77 | + |
| 78 | +text_list = ["""Record date: 2093-01-13, David Hale, M.D., Name: Hendrickson, Ora MR. # 7194334 Date: 01/13/93 PCP: Oliveira, 25 years old, Record date: 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. The patient's complaints first surfaced when he started working for Brothers Coal-Mine.""", |
| 79 | + |
| 80 | +"""J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015.""", |
| 81 | + |
| 82 | +"""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""", |
| 83 | + |
| 84 | +"""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.""", |
| 85 | + |
| 86 | +"""Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.""", |
| 87 | + |
| 88 | +"""Detalhes do paciente. |
| 89 | +Nome do paciente: Pedro Gonçalves |
| 90 | +NHC: 2569870. |
| 91 | +Endereço: Rua Das Flores 23. |
| 92 | +Cidade/ Província: Porto. |
| 93 | +Código Postal: 21754-987. |
| 94 | +Dados de cuidados. |
| 95 | +Data de nascimento: 10/10/1963. |
| 96 | +Idade: 53 anos Sexo: Homen |
| 97 | +Data de admissão: 17/06/2016. |
| 98 | +Doutora: Maria Santos""", |
| 99 | + |
| 100 | +"""Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România |
| 101 | +Tel: +40(235)413773 |
| 102 | +Data setului de analize: 25 May 2022 15:36:00 |
| 103 | +Nume si Prenume : BUREAN MARIA, Varsta: 77 |
| 104 | +Medic : Agota Evelyn Tımar |
| 105 | +C.N.P : 2450502264401""" |
| 106 | +] |
| 107 | + |
| 108 | +data = spark.createDataFrame(pd.DataFrame({"text": text_list})) |
| 109 | + |
| 110 | +result = model.transform(data) |
| 111 | +``` |
| 112 | +```scala |
| 113 | +val document_assembler = new DocumentAssembler() |
| 114 | + .setInputCol("text") |
| 115 | + .setOutputCol("document") |
| 116 | + |
| 117 | +val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx") |
| 118 | + .setInputCols(Array("document")) |
| 119 | + .setOutputCol("sentence") |
| 120 | + |
| 121 | +val tokenizer = new Tokenizer() |
| 122 | + .setInputCols(Array("sentence")) |
| 123 | + .setOutputCol("token") |
| 124 | + |
| 125 | +val embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base", "xx") |
| 126 | + .setInputCols(Array("sentence", "token")) |
| 127 | + .setOutputCol("embeddings") |
| 128 | + .setMaxSentenceLength(512) |
| 129 | + .setCaseSensitive(false) |
| 130 | + |
| 131 | +val ner = MedicalNerModel.pretrained("ner_deid_name_multilingual", "xx", "clinical/models") |
| 132 | + .setInputCols(Array("sentence", "token", "embeddings")) |
| 133 | + .setOutputCol("ner") |
| 134 | + |
| 135 | +val ner_converter = new NerConverter() |
| 136 | + .setInputCols(Array("sentence", "token", "ner")) |
| 137 | + .setOutputCol("ner_chunk") |
| 138 | + |
| 139 | +val nlpPipeline = new Pipeline().setStages(Array( |
| 140 | + document_assembler, |
| 141 | + sentence_detector, |
| 142 | + tokenizer, |
| 143 | + embeddings, |
| 144 | + ner, |
| 145 | + ner_converter |
| 146 | +)) |
| 147 | + |
| 148 | +val text_list = Seq( |
| 149 | + """Record date: 2093-01-13, David Hale, M.D., Name: Hendrickson, Ora MR. # 7194334 Date: 01/13/93 PCP: Oliveira, 25 years old, Record date: 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. The patient's complaints first surfaced when he started working for Brothers Coal-Mine.""", |
| 150 | + |
| 151 | +"""J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015.""", |
| 152 | + |
| 153 | +"""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""", |
| 154 | + |
| 155 | +"""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.""", |
| 156 | + |
| 157 | +"""Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.""", |
| 158 | + |
| 159 | +"""Detalhes do paciente. |
| 160 | +Nome do paciente: Pedro Gonçalves |
| 161 | +NHC: 2569870. |
| 162 | +Endereço: Rua Das Flores 23. |
| 163 | +Cidade/ Província: Porto. |
| 164 | +Código Postal: 21754-987. |
| 165 | +Dados de cuidados. |
| 166 | +Data de nascimento: 10/10/1963. |
| 167 | +Idade: 53 anos Sexo: Homen |
| 168 | +Data de admissão: 17/06/2016. |
| 169 | +Doutora: Maria Santos""", |
| 170 | + |
| 171 | +"""Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România |
| 172 | +Tel: +40(235)413773 |
| 173 | +Data setului de analize: 25 May 2022 15:36:00 |
| 174 | +Nume si Prenume : BUREAN MARIA, Varsta: 77 |
| 175 | +Medic : Agota Evelyn Tımar |
| 176 | +C.N.P : 2450502264401""" |
| 177 | +) |
| 178 | + |
| 179 | +val data = Seq(text_list).toDS.toDF("text") |
| 180 | + |
| 181 | +val result = model.fit(data).transform(data) |
| 182 | +``` |
| 183 | +</div> |
| 184 | + |
| 185 | +## Results |
| 186 | + |
| 187 | +```bash |
| 188 | ++-----------------------+---------+ |
| 189 | +|chunk |ner_label| |
| 190 | ++-----------------------+---------+ |
| 191 | +|David Hale |NAME | |
| 192 | +|Hendrickson, Ora |NAME | |
| 193 | +|Oliveira |NAME | |
| 194 | +|Michel Martinez |NAME | |
| 195 | +|Michael Berger |NAME | |
| 196 | +|Berger |NAME | |
| 197 | +|Gastone Montanariello |NAME | |
| 198 | +|Antonio Miguel Martínez|NAME | |
| 199 | +|Pedro Gonçalves |NAME | |
| 200 | +|Maria Santos |NAME | |
| 201 | +|BUREAN MARIA |NAME | |
| 202 | +|Agota Evelyn Tımar |NAME | |
| 203 | ++-----------------------+---------+ |
| 204 | +``` |
| 205 | + |
| 206 | +{:.model-param} |
| 207 | +## Model Information |
| 208 | + |
| 209 | +{:.table-model} |
| 210 | +|---|---| |
| 211 | +|Model Name:|ner_deid_name_multilingual| |
| 212 | +|Compatibility:|Healthcare NLP 5.2.1+| |
| 213 | +|License:|Licensed| |
| 214 | +|Edition:|Official| |
| 215 | +|Input Labels:|[sentence, token, embeddings]| |
| 216 | +|Output Labels:|[ner]| |
| 217 | +|Language:|xx| |
| 218 | +|Size:|16.2 MB| |
| 219 | + |
| 220 | +## References |
| 221 | + |
| 222 | +It was trained with in-house annotated datasets |
| 223 | + |
| 224 | +## Benchmarking |
| 225 | + |
| 226 | +```bash |
| 227 | +label precision recall f1-score support |
| 228 | +NAME 0.95 0.95 0.95 5068 |
| 229 | +micro-avg 0.95 0.95 0.95 5068 |
| 230 | +macro-avg 0.95 0.95 0.95 5068 |
| 231 | +weighted-avg 0.95 0.95 0.95 5068 |
| 232 | +``` |
0 commit comments