-
-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Hi @adbar, thanks for this awesome library.
While porting this library to Go, I noticed there are two Mediacloud tests that might be wrong:
"https://www.baltimoresun.com/opinion/columnists/zurawik/bs-ed-zontv-media-year-20201223-cnvrlhkhnrbihcxx6wxcxt2b7y-story.html#ed=rss_www.baltimoresun.com/arcio/rss/category/latest/": {
"file": "1805697156.html",
"date": "2020-12-23"
},
"https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/": {
"file": "1806793639.html",
"date": "2020-12-25"
},
For baltimoresun
, its JSON+LD contains following snippet:
{
// ... omitted
"articleSection": "zurawik",
"dateCreated": "2020-12-22T01:06:41.361Z",
"datePublished": "2020-12-23T15:42:33.814Z",
"dateModified": "2020-12-23T15:42:34.197Z",
// ... omitted
}
From that snippet we can see its creation date is 2020-12-22
. Since we want the original date, I think we should use that one instead of 2020-12-23
?
For elbalad.tv
, its JSON+LD contains following snippet:
{
"@type": "WebPage",
"@id": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/#webpage",
"url": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/",
"name": "\u062a\u0631\u0643\u0649 \u0622\u0644 \u0627\u0644\u0634\u064a\u062e \u0628\u0639\u062f \u0625\u0635\u0627\u0628\u0629 \u064a\u0633\u0631\u0627 \u0628\u0643\u0648\u0631\u0648\u0646\u0627: \u064a\u0627\u0631\u0628 \u064a\u0631\u0641\u0639 \u0639\u0646\u0643 - \u0642\u0646\u0627\u0629 \u0635\u062f\u0649 \u0627\u0644\u0628\u0644\u062f",
"datePublished": "2020-12-25T01:59:50+02:00",
"dateModified": "2020-12-25T01:59:50+02:00",
"isPartOf": { "@id": "https://elbaladtv.net/#website" },
"primaryImageOfPage": {
"@id": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/#primaryImage"
},
"inLanguage": "ar"
}
It also contains following meta tag:
<meta property="article:published_time" content="2020-12-24T23:59:50+00:00">
From those two, we can see that the published time in JSON+LD and meta tags are actually the same except the former is in UTC+2 while the latter is in UTC+0.
So, for extraction result I think we should use 2020-12-24
since it's use UTC time instead of local time.
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested