Skip to content

Possibly wrong Mediacloud test data? #154

@RadhiFadlillah

Description

@RadhiFadlillah

Hi @adbar, thanks for this awesome library.

While porting this library to Go, I noticed there are two Mediacloud tests that might be wrong:

"https://www.baltimoresun.com/opinion/columnists/zurawik/bs-ed-zontv-media-year-20201223-cnvrlhkhnrbihcxx6wxcxt2b7y-story.html#ed=rss_www.baltimoresun.com/arcio/rss/category/latest/": {
	"file": "1805697156.html",
	"date": "2020-12-23"
},
"https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/": {
	"file": "1806793639.html",
	"date": "2020-12-25"
},

For baltimoresun, its JSON+LD contains following snippet:

{
	// ... omitted
	"articleSection": "zurawik",
	"dateCreated": "2020-12-22T01:06:41.361Z",
	"datePublished": "2020-12-23T15:42:33.814Z",
	"dateModified": "2020-12-23T15:42:34.197Z",
	// ... omitted
}

From that snippet we can see its creation date is 2020-12-22. Since we want the original date, I think we should use that one instead of 2020-12-23?


For elbalad.tv, its JSON+LD contains following snippet:

{
	"@type": "WebPage",
	"@id": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/#webpage",
	"url": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/",
	"name": "\u062a\u0631\u0643\u0649 \u0622\u0644 \u0627\u0644\u0634\u064a\u062e \u0628\u0639\u062f \u0625\u0635\u0627\u0628\u0629 \u064a\u0633\u0631\u0627 \u0628\u0643\u0648\u0631\u0648\u0646\u0627: \u064a\u0627\u0631\u0628 \u064a\u0631\u0641\u0639 \u0639\u0646\u0643 - \u0642\u0646\u0627\u0629 \u0635\u062f\u0649 \u0627\u0644\u0628\u0644\u062f",
	"datePublished": "2020-12-25T01:59:50+02:00",
	"dateModified": "2020-12-25T01:59:50+02:00",
	"isPartOf": { "@id": "https://elbaladtv.net/#website" },
	"primaryImageOfPage": {
		"@id": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/#primaryImage"
	},
	"inLanguage": "ar"
}

It also contains following meta tag:

<meta property="article:published_time" content="2020-12-24T23:59:50+00:00">

From those two, we can see that the published time in JSON+LD and meta tags are actually the same except the former is in UTC+2 while the latter is in UTC+0.

So, for extraction result I think we should use 2020-12-24 since it's use UTC time instead of local time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions