Skip to content

Allow storage of arbitrary state data with an URL #14

@mna

Description

@mna

When a page is crawled, some data is extracted. Sometimes, the complete data on a given piece of information is split across many pages. It may be necessary to store some state when crawling a page so that when a "child" page is crawled, this information is available.

For example, a page /author is crawled and information on the author is saved in a DB, with an ID. The URL /author/book1 is then enqueued, but if this page is crawled in a stateless way, it has no way to link the information back to the previously crawled author (it could find the author name in the book page, but let's pretend it's not there, or even if it was, there are maybe many authors with the same name, or there might be a typo, etc.).

Not sure yet if this should be managed by gocrawl or not. Should seed URLs also be allowed to have state? How much of a pain will it be to implement, complexify the API?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions