Skip to content

Feature request: Cheerio selector #704

@magick93

Description

@magick93

Currently the Cheerio doc loader is hardcoded to get the entire page, using $("body"). However typically there is a main content area, which is important, and surrounding elements which are not.

https://github.com/hwchase17/langchainjs/blob/main/langchain/src/document_loaders/cheerio_web_base.ts#L50-L55

  async load(): Promise<Document[]> {
    const $ = await this.scrape();
    const text = $("body").text();
    const metadata = { source: this.webPath };
    return [new Document({ pageContent: text, metadata })];
  }

It would be good if it were possible to pass in an option jquery style selector to target the exact content.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedThis would make a good PR

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions