Skip to content

Commit c7d061e

Browse files
feat: Add ScrapyWebReader Integration (#20212)
1 parent 0def5d2 commit c7d061e

File tree

11 files changed

+795
-4
lines changed

11 files changed

+795
-4
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
::: llama_index.readers.web
22
options:
3-
members: - AgentQLWebReader - AsyncWebPageReader - BeautifulSoupWebReader - BrowserbaseWebReader - FireCrawlWebReader - HyperbrowserWebReader - KnowledgeBaseWebReader - MainContentExtractorReader - NewsArticleReader - OlostepWebReader - OxylabsWebReader - ReadabilityWebPageReader - RssNewsReader - RssReader - ScrapflyReader - SimpleWebPageReader - SitemapReader - SpiderReader - TrafilaturaWebReader - UnstructuredURLLoader - WholeSiteReader - ZenRowsWebReader
3+
members: - AgentQLWebReader - AsyncWebPageReader - BeautifulSoupWebReader - BrowserbaseWebReader - FireCrawlWebReader - HyperbrowserWebReader - KnowledgeBaseWebReader - MainContentExtractorReader - NewsArticleReader - OlostepWebReader - OxylabsWebReader - ReadabilityWebPageReader - RssNewsReader - RssReader - ScrapflyReader - ScrapyWebReader - SimpleWebPageReader - SitemapReader - SpiderReader - TrafilaturaWebReader - UnstructuredURLLoader - WholeSiteReader - ZenRowsWebReader

docs/examples/data_connectors/WebPageDemo.ipynb

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1328,6 +1328,134 @@
13281328
"\n",
13291329
"print(response)"
13301330
]
1331+
},
1332+
{
1333+
"cell_type": "markdown",
1334+
"id": "07117c04",
1335+
"metadata": {},
1336+
"source": [
1337+
"# Using Scrapy Web Reader 🕸️"
1338+
]
1339+
},
1340+
{
1341+
"cell_type": "markdown",
1342+
"id": "22fd0310",
1343+
"metadata": {},
1344+
"source": [
1345+
"Scrapy is a popular web crawling framework for Python. The ScrapyWebReader allows you to leverage Scrapy's powerful crawling capabilities to extract data from websites. It can be used in 2 ways\n",
1346+
"\n",
1347+
"1. By providing an Scrapy spider class.\n",
1348+
"2. By providing the path to a Scrapy project."
1349+
]
1350+
},
1351+
{
1352+
"cell_type": "markdown",
1353+
"id": "0462b632",
1354+
"metadata": {},
1355+
"source": [
1356+
"### 1. Using with Scrapy Spider Class"
1357+
]
1358+
},
1359+
{
1360+
"cell_type": "code",
1361+
"execution_count": null,
1362+
"id": "25da4f69",
1363+
"metadata": {},
1364+
"outputs": [],
1365+
"source": [
1366+
"from scrapy.spiders import Spider\n",
1367+
"from llama_index.readers.web import ScrapyWebReader\n",
1368+
"\n",
1369+
"\n",
1370+
"class SampleSpider(Spider):\n",
1371+
" name = \"sample_spider\"\n",
1372+
" start_urls = [\"http://quotes.toscrape.com\"]\n",
1373+
"\n",
1374+
" def parse(self, response):\n",
1375+
" ...\n",
1376+
"\n",
1377+
"\n",
1378+
"reader = ScrapyWebReader()\n",
1379+
"docs = reader.load_data(SampleSpider)"
1380+
]
1381+
},
1382+
{
1383+
"cell_type": "markdown",
1384+
"id": "e99c6e02",
1385+
"metadata": {},
1386+
"source": [
1387+
"### 2. Using with Scrapy Project Path"
1388+
]
1389+
},
1390+
{
1391+
"cell_type": "markdown",
1392+
"id": "1110e52e",
1393+
"metadata": {},
1394+
"source": [
1395+
"Downloading a Sample Scrapy Project"
1396+
]
1397+
},
1398+
{
1399+
"cell_type": "code",
1400+
"execution_count": null,
1401+
"id": "40060d02",
1402+
"metadata": {},
1403+
"outputs": [],
1404+
"source": [
1405+
"%git clone https://github.com/scrapy/quotesbot.git"
1406+
]
1407+
},
1408+
{
1409+
"cell_type": "markdown",
1410+
"id": "91d304d4",
1411+
"metadata": {},
1412+
"source": [
1413+
"Using the scrapy project with spider named \"toscrape-css\""
1414+
]
1415+
},
1416+
{
1417+
"cell_type": "code",
1418+
"execution_count": null,
1419+
"id": "8cf448df",
1420+
"metadata": {},
1421+
"outputs": [],
1422+
"source": [
1423+
"from llama_index.readers.web import ScrapyWebReader\n",
1424+
"\n",
1425+
"reader = ScrapyWebReader(project_path=\"./quotesbot\")\n",
1426+
"docs = reader.load_data(\"toscrape-css\")"
1427+
]
1428+
},
1429+
{
1430+
"cell_type": "markdown",
1431+
"id": "12c85cd4",
1432+
"metadata": {},
1433+
"source": [
1434+
"### Metadata"
1435+
]
1436+
},
1437+
{
1438+
"cell_type": "markdown",
1439+
"id": "ce6769ec",
1440+
"metadata": {},
1441+
"source": [
1442+
"Some keys from the scraped items can be stored as metadata in the Document object. You can specify which keys to include as metadata using the `metadata_keys` parameter. If you want to keep the keys in both the content and as metadata, you can set the `keep_keys` parameter to `True`."
1443+
]
1444+
},
1445+
{
1446+
"cell_type": "code",
1447+
"execution_count": null,
1448+
"id": "1c3f6112",
1449+
"metadata": {},
1450+
"outputs": [],
1451+
"source": [
1452+
"reader = ScrapyWebReader(\n",
1453+
" project_path=\"./quotesbot\",\n",
1454+
" metadata_keys=[\"author\", \"tags\"],\n",
1455+
" keep_keys=True,\n",
1456+
")\n",
1457+
"docs = reader.load_data(\"toscrape-css\")"
1458+
]
13311459
}
13321460
],
13331461
"metadata": {

llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,9 @@
3333
from llama_index.readers.web.scrapfly_web.base import (
3434
ScrapflyReader,
3535
)
36+
from llama_index.readers.web.scrapy_web.base import (
37+
ScrapyWebReader,
38+
)
3639
from llama_index.readers.web.simple_web.base import (
3740
SimpleWebPageReader,
3841
)
@@ -73,6 +76,7 @@
7376
"RssReader",
7477
"RssNewsReader",
7578
"ScrapflyReader",
79+
"ScrapyWebReader",
7680
"SimpleWebPageReader",
7781
"SitemapReader",
7882
"SpiderWebReader",
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# LlamaIndex Scrapy Web Reader Integration
2+
3+
This integration provides the `ScrapyWebReader` class that allows you to use Scrapy to scrape data and load it into LlamaIndex.
4+
5+
## Installation
6+
7+
```bash
8+
pip install llama-index llama-index-readers-web
9+
```
10+
11+
## Usage
12+
13+
The `ScrapyWebReader` can be used in 2 ways
14+
15+
1. By providing an Scrapy spider class.
16+
2. By providing the path to a Scrapy project.
17+
18+
### 1. Using with Scrapy Spider Class
19+
20+
```python
21+
from llama_index.readers.web import ScrapyWebReader
22+
23+
24+
class SampleSpider(Spider):
25+
name = "sample_spider"
26+
start_urls = ["http://quotes.toscrape.com"]
27+
28+
def parse(self, response):
29+
...
30+
31+
32+
reader = ScrapyWebReader()
33+
docs = reader.load_data(SampleSpider)
34+
```
35+
36+
### 2. Using with Scrapy Project Path
37+
38+
```python
39+
from llama_index.readers.web import ScrapyWebReader
40+
41+
reader = ScrapyWebReader(project_path="/path/to/scrapy/project")
42+
docs = reader.load_data("spider_name")
43+
```

llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/scrapy_web/__init__.py

Whitespace-only changes.
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
from typing import List, Optional, Union
2+
from multiprocessing import Process, Queue
3+
4+
from scrapy.spiders import Spider
5+
6+
from llama_index.core.readers.base import BasePydanticReader
7+
from llama_index.core.schema import Document
8+
9+
from .utils import run_spider_process, load_scrapy_settings
10+
11+
12+
class ScrapyWebReader(BasePydanticReader):
13+
"""
14+
Scrapy web page reader.
15+
16+
Reads pages from the web.
17+
18+
Args:
19+
project_path (Optional[str]): The path to the Scrapy project for
20+
loading the project settings (with middlewares and pipelines).
21+
The project path should contain the `scrapy.cfg` file.
22+
Settings will be set to empty if path not specified or not found.
23+
Defaults to "".
24+
25+
metadata_keys (Optional[List[str]]): List of keys to use
26+
as document metadata from the scraped item. Defaults to [].
27+
28+
keep_keys (bool): Whether to keep metadata keys in items.
29+
Defaults to False.
30+
31+
"""
32+
33+
project_path: Optional[str] = ""
34+
metadata_keys: Optional[List[str]] = []
35+
keep_keys: bool = False
36+
37+
def __init__(
38+
self,
39+
project_path: Optional[str] = "",
40+
metadata_keys: Optional[List[str]] = [],
41+
keep_keys: bool = False,
42+
):
43+
super().__init__(
44+
project_path=project_path,
45+
metadata_keys=metadata_keys,
46+
keep_keys=keep_keys,
47+
)
48+
49+
@classmethod
50+
def class_name(cls) -> str:
51+
return "ScrapyWebReader"
52+
53+
def load_data(self, spider: Union[Spider, str]) -> List[Document]:
54+
"""
55+
Load data from the input spider.
56+
57+
Args:
58+
spider (Union[Spider, str]): The Scrapy spider class or
59+
the spider name from the project to use for scraping.
60+
61+
Returns:
62+
List[Document]: List of documents extracted from the web pages.
63+
64+
"""
65+
if not self._is_spider_correct_type(spider):
66+
raise ValueError(
67+
"Invalid spider type. Provide a Spider class or spider name with project path."
68+
)
69+
70+
documents_queue = Queue()
71+
72+
config = {
73+
"keep_keys": self.keep_keys,
74+
"metadata_keys": self.metadata_keys,
75+
"settings": load_scrapy_settings(self.project_path),
76+
}
77+
78+
# Running each spider in a separate process as Scrapy uses
79+
# twisted reactor which can only be run once in a process
80+
process = Process(
81+
target=run_spider_process, args=(spider, documents_queue, config)
82+
)
83+
84+
process.start()
85+
process.join()
86+
87+
if documents_queue.empty():
88+
return []
89+
90+
return documents_queue.get()
91+
92+
def _is_spider_correct_type(self, spider: Union[Spider, str]) -> bool:
93+
return not (isinstance(spider, str) and not self.project_path)
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Scrapy
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
import json
2+
import os
3+
from multiprocessing import Queue
4+
from typing import Dict
5+
6+
from scrapy.spiders import signals, Spider
7+
from scrapy.crawler import CrawlerProcess
8+
from scrapy.utils.project import get_project_settings
9+
10+
from llama_index.core.schema import Document
11+
12+
13+
def load_scrapy_settings(project_path: str) -> Dict:
14+
"""
15+
Load Scrapy settings from the given project path.
16+
"""
17+
if not project_path:
18+
return {}
19+
20+
if not os.path.exists(project_path):
21+
return {}
22+
23+
cwd = os.getcwd()
24+
25+
try:
26+
os.chdir(project_path)
27+
28+
try:
29+
settings = get_project_settings() or {}
30+
except Exception:
31+
settings = {}
32+
finally:
33+
os.chdir(cwd)
34+
35+
return settings
36+
37+
38+
def run_spider_process(spider: Spider, documents_queue: Queue, config: Dict):
39+
"""
40+
Run the Scrapy spider process and collect documents in the queue.
41+
"""
42+
documents = []
43+
44+
def item_scraped(item, response, spider):
45+
documents.append(item_to_document(dict(item), config))
46+
47+
process = CrawlerProcess(settings=config["settings"])
48+
crawler = process.create_crawler(spider)
49+
crawler.signals.connect(item_scraped, signal=signals.item_scraped)
50+
process.crawl(crawler)
51+
process.start()
52+
53+
documents_queue.put(documents)
54+
55+
56+
def item_to_document(item: Dict, config: Dict) -> Dict:
57+
"""
58+
Convert a scraped item to a Document with metadata.
59+
"""
60+
metadata = setup_metadata(item, config)
61+
item = remove_metadata_keys(item, config)
62+
63+
return Document(text=json.dumps(item), metadata=metadata)
64+
65+
66+
def setup_metadata(item: Dict, config: Dict) -> Dict:
67+
"""
68+
Set up metadata for the document from the scraped item.
69+
"""
70+
metadata = {}
71+
72+
for key in config["metadata_keys"]:
73+
if key in item:
74+
metadata[key] = item[key]
75+
76+
return metadata
77+
78+
79+
def remove_metadata_keys(item: Dict, config: Dict) -> Dict:
80+
"""
81+
Remove metadata keys from the scraped item if keep_keys is False.
82+
"""
83+
if not config["keep_keys"]:
84+
for key in config["metadata_keys"]:
85+
item.pop(key, None)
86+
87+
return item

0 commit comments

Comments
 (0)