Skip to content

Commit aa7a7e2

Browse files
VinciGit00semantic-release-botLevyathanusf-aguzzi
authored
Pre/beta (#835)
* fix: error on fetching the code * feat: revert search function * feat: add api integration * ci(release): 1.32.0-beta.1 [skip ci] ## [1.32.0-beta.1](v1.31.1...v1.32.0-beta.1) (2024-11-24) ### Features * revert search function ([faf0c01](faf0c01)) * fix: improved links extraction for parse_node, resolves #822 * ci(release): 1.32.0-beta.2 [skip ci] ## [1.32.0-beta.2](v1.32.0-beta.1...v1.32.0-beta.2) (2024-11-25) ### Bug Fixes * error on fetching the code ([7285ab0](7285ab0)) * ci(release): 1.32.0-beta.3 [skip ci] ## [1.32.0-beta.3](v1.32.0-beta.2...v1.32.0-beta.3) (2024-11-26) ### Bug Fixes * improved links extraction for parse_node, resolves [#822](#822) ([7da7bfe](7da7bfe)) * chore: migrate from rye to uv * feat: add sdk integration * ci(release): 1.32.0-beta.4 [skip ci] ## [1.32.0-beta.4](v1.32.0-beta.3...v1.32.0-beta.4) (2024-12-02) ### Features * add api integration ([8aa9103](8aa9103)) * add sdk integration ([209b445](209b445)) ### chore * migrate from rye to uv ([5fe528a](5fe528a)) --------- Co-authored-by: semantic-release-bot <semantic-release-bot@martynus.net> Co-authored-by: Michele_Zenoni <michelezenoni1@gmail.com> Co-authored-by: Federico Aguzzi <62149513+f-aguzzi@users.noreply.github.com>
1 parent bbc7184 commit aa7a7e2

File tree

13 files changed

+5296
-165
lines changed

13 files changed

+5296
-165
lines changed

.github/workflows/pylint.yml

+5-5
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,15 @@ jobs:
99
runs-on: ubuntu-latest
1010
steps:
1111
- uses: actions/checkout@v3
12-
- name: Install the latest version of rye
13-
uses: eifinger/setup-rye@v3
12+
- name: Install uv
13+
uses: astral-sh/setup-uv@v3
1414
- name: Install dependencies
15-
run: rye sync --no-lock
15+
run: uv sync --frozen
1616
- name: Analysing the code with pylint
17-
run: rye run pylint-ci
17+
run: uv run poe pylint-ci
1818
- name: Check Pylint score
1919
run: |
20-
pylint_score=$(rye run pylint-score-ci | grep 'Raw metrics' | awk '{print $4}')
20+
pylint_score=$(uv run poe pylint-score-ci | grep 'Raw metrics' | awk '{print $4}')
2121
if (( $(echo "$pylint_score < 8" | bc -l) )); then
2222
echo "Pylint score is below 8. Blocking commit."
2323
exit 1

.github/workflows/release.yml

+4-4
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,8 @@ jobs:
1414
run: |
1515
sudo apt update
1616
sudo apt install -y git
17-
- name: Install the latest version of rye
18-
uses: eifinger/setup-rye@v3
17+
- name: Install uv
18+
uses: astral-sh/setup-uv@v3
1919
- name: Install Node Env
2020
uses: actions/setup-node@v4
2121
with:
@@ -27,8 +27,8 @@ jobs:
2727
persist-credentials: false
2828
- name: Build app
2929
run: |
30-
rye sync --no-lock
31-
rye build
30+
uv sync --frozen
31+
uv build
3232
id: build_cache
3333
if: success()
3434
- name: Cache build

CHANGELOG.md

+34
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,37 @@
1+
## [1.32.0-beta.4](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.32.0-beta.3...v1.32.0-beta.4) (2024-12-02)
2+
3+
4+
### Features
5+
6+
* add api integration ([8aa9103](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/8aa9103f02af92d9e1a780450daa7bb303afc150))
7+
* add sdk integration ([209b445](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/209b4456fd668d9d124fd5586b32a4be677d4bf8))
8+
9+
10+
### chore
11+
12+
* migrate from rye to uv ([5fe528a](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/5fe528a7e7a3e230d8f68fd83ce5ad6ede5adfef))
13+
14+
## [1.32.0-beta.3](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.32.0-beta.2...v1.32.0-beta.3) (2024-11-26)
15+
16+
17+
### Bug Fixes
18+
19+
* improved links extraction for parse_node, resolves [#822](https://github.com/ScrapeGraphAI/Scrapegraph-ai/issues/822) ([7da7bfe](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/7da7bfe338a6ce53c83361a1f6cd9ea2d5bd797f))
20+
21+
## [1.32.0-beta.2](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.32.0-beta.1...v1.32.0-beta.2) (2024-11-25)
22+
23+
24+
### Bug Fixes
25+
26+
* error on fetching the code ([7285ab0](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/7285ab065bba9099ba2751c9d2f21ee13fed0d5f))
27+
28+
## [1.32.0-beta.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.31.1...v1.32.0-beta.1) (2024-11-24)
29+
30+
31+
### Features
32+
33+
* revert search function ([faf0c01](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/faf0c0123b5e2e548cbd1917e9d1df22e1edb1c5))
34+
135
## [1.31.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.31.0...v1.31.1) (2024-11-22)
236

337

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper
3+
"""
4+
import os
5+
import json
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import SmartScraperGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
10+
load_dotenv()
11+
12+
# ************************************************
13+
# Define the configuration for the graph
14+
# ************************************************
15+
16+
17+
graph_config = {
18+
"llm": {
19+
"model": "scrapegraphai/smart-scraper",
20+
"api_key": os.getenv("SCRAPEGRAPH_API_KEY")
21+
},
22+
"verbose": True,
23+
"headless": False,
24+
}
25+
26+
# ************************************************
27+
# Create the SmartScraperGraph instance and run it
28+
# ************************************************
29+
30+
smart_scraper_graph = SmartScraperGraph(
31+
prompt="Extract me all the articles",
32+
source="https://www.wired.com",
33+
config=graph_config
34+
)
35+
36+
result = smart_scraper_graph.run()
37+
print(json.dumps(result, indent=4))
38+
39+
# ************************************************
40+
# Get graph execution info
41+
# ************************************************
42+
43+
graph_exec_info = smart_scraper_graph.get_execution_info()
44+
print(prettify_exec_info(graph_exec_info))

pyproject.toml

+15-10
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ name = "scrapegraphai"
33

44

55

6-
version = "1.31.1"
6+
version = "1.32.0b4"
77

88

99

@@ -43,7 +43,8 @@ dependencies = [
4343
"transformers>=4.44.2",
4444
"googlesearch-python>=1.2.5",
4545
"simpleeval>=1.0.0",
46-
"async_timeout>=4.0.3"
46+
"async_timeout>=4.0.3",
47+
"scrapegraph-py>=0.0.4"
4748
]
4849

4950
license = "MIT"
@@ -91,7 +92,7 @@ other-language-models = [
9192
"langchain-anthropic>=0.1.11",
9293
"langchain-huggingface>=0.0.3",
9394
"langchain-nvidia-ai-endpoints>=0.1.6",
94-
"langchain_together>=1.2.9"
95+
"langchain_together>=0.2.0"
9596
]
9697

9798
# Group 2: More Semantic Options
@@ -116,17 +117,21 @@ screenshot_scraper = [
116117
requires = ["hatchling"]
117118
build-backend = "hatchling.build"
118119

119-
[tool.rye]
120-
managed = true
120+
[dependency-groups]
121+
dev = [
122+
"burr[start]==0.22.1",
123+
"sphinx==6.0",
124+
"furo==2024.5.6",
125+
]
126+
127+
[tool.uv]
121128
dev-dependencies = [
129+
"poethepoet>=0.31.1",
122130
"pytest==8.0.0",
123131
"pytest-mock==3.14.0",
124-
"-e file:.[burr]",
125-
"-e file:.[docs]",
126132
"pylint>=3.2.5",
127133
]
128134

129-
[tool.rye.scripts]
130-
pylint-local = "pylint scrapegraphai/**/*.py"
135+
[tool.poe.tasks]
136+
pylint-local = "pylint scraperaphai/**/*.py"
131137
pylint-ci = "pylint --disable=C0114,C0115,C0116 --exit-zero scrapegraphai/**/*.py"
132-
update-requirements = "python 'manual deployment/autorequirements.py'"

requirements-dev.lock

+7-2
Original file line numberDiff line numberDiff line change
@@ -353,7 +353,7 @@ pyasn1==0.6.0
353353
# via rsa
354354
pyasn1-modules==0.4.0
355355
# via google-auth
356-
pydantic==2.8.2
356+
pydantic==2.10.1
357357
# via burr
358358
# via fastapi
359359
# via fastapi-pagination
@@ -368,7 +368,8 @@ pydantic==2.8.2
368368
# via openai
369369
# via pydantic-settings
370370
# via qdrant-client
371-
pydantic-core==2.20.1
371+
# via scrapegraph-py
372+
pydantic-core==2.27.1
372373
# via pydantic
373374
pydantic-settings==2.5.2
374375
# via langchain-community
@@ -396,6 +397,7 @@ python-dateutil==2.9.0.post0
396397
# via pandas
397398
python-dotenv==1.0.1
398399
# via pydantic-settings
400+
# via scrapegraph-py
399401
# via scrapegraphai
400402
pytz==2024.1
401403
# via pandas
@@ -424,6 +426,7 @@ requests==2.32.3
424426
# via langchain-community
425427
# via langsmith
426428
# via mistral-common
429+
# via scrapegraph-py
427430
# via sphinx
428431
# via streamlit
429432
# via tiktoken
@@ -439,6 +442,8 @@ s3transfer==0.10.2
439442
# via boto3
440443
safetensors==0.4.5
441444
# via transformers
445+
scrapegraph-py==0.0.3
446+
# via scrapegraphai
442447
semchunk==2.2.0
443448
# via scrapegraphai
444449
sentencepiece==0.2.0

requirements.lock

+7-2
Original file line numberDiff line numberDiff line change
@@ -257,7 +257,7 @@ pyasn1==0.6.0
257257
# via rsa
258258
pyasn1-modules==0.4.0
259259
# via google-auth
260-
pydantic==2.8.2
260+
pydantic==2.10.1
261261
# via google-generativeai
262262
# via langchain
263263
# via langchain-aws
@@ -269,7 +269,8 @@ pydantic==2.8.2
269269
# via openai
270270
# via pydantic-settings
271271
# via qdrant-client
272-
pydantic-core==2.20.1
272+
# via scrapegraph-py
273+
pydantic-core==2.27.1
273274
# via pydantic
274275
pydantic-settings==2.5.2
275276
# via langchain-community
@@ -286,6 +287,7 @@ python-dateutil==2.9.0.post0
286287
# via pandas
287288
python-dotenv==1.0.1
288289
# via pydantic-settings
290+
# via scrapegraph-py
289291
# via scrapegraphai
290292
pytz==2024.1
291293
# via pandas
@@ -313,6 +315,7 @@ requests==2.32.3
313315
# via langchain-community
314316
# via langsmith
315317
# via mistral-common
318+
# via scrapegraph-py
316319
# via tiktoken
317320
# via transformers
318321
rpds-py==0.20.0
@@ -324,6 +327,8 @@ s3transfer==0.10.2
324327
# via boto3
325328
safetensors==0.4.5
326329
# via transformers
330+
scrapegraph-py==0.0.3
331+
# via scrapegraphai
327332
semchunk==2.2.0
328333
# via scrapegraphai
329334
sentencepiece==0.2.0

scrapegraphai/docloaders/chromium.py

+4-12
Original file line numberDiff line numberDiff line change
@@ -100,18 +100,11 @@ async def ascrape_undetected_chromedriver(self, url: str) -> str:
100100
async def ascrape_playwright(self, url: str) -> str:
101101
"""
102102
Asynchronously scrape the content of a given URL using Playwright's async API.
103-
104-
Args:
105-
url (str): The URL to scrape.
106-
107-
Returns:
108-
str: The scraped HTML content or an error message if an exception occurs.
109103
"""
110104
from playwright.async_api import async_playwright
111105
from undetected_playwright import Malenia
112106

113107
logger.info(f"Starting scraping with {self.backend}...")
114-
results = ""
115108
attempt = 0
116109

117110
while attempt < self.RETRY_LIMIT:
@@ -127,16 +120,15 @@ async def ascrape_playwright(self, url: str) -> str:
127120
await page.wait_for_load_state(self.load_state)
128121
results = await page.content()
129122
logger.info("Content scraped")
130-
break
123+
return results
131124
except (aiohttp.ClientError, asyncio.TimeoutError, Exception) as e:
132125
attempt += 1
133126
logger.error(f"Attempt {attempt} failed: {e}")
134127
if attempt == self.RETRY_LIMIT:
135-
results = f"Error: Network error after {self.RETRY_LIMIT} attempts - {e}"
128+
raise RuntimeError(f"Failed to fetch {url} after {self.RETRY_LIMIT} attempts: {e}")
136129
finally:
137-
await browser.close()
138-
139-
return results
130+
if 'browser' in locals():
131+
await browser.close()
140132

141133
async def ascrape_with_js_support(self, url: str) -> str:
142134
"""

scrapegraphai/graphs/smart_scraper_graph.py

+10
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
ConditionalNode
1414
)
1515
from ..prompts import REGEN_ADDITIONAL_INFO
16+
from scrapegraph_py import SyncClient
1617

1718
class SmartScraperGraph(AbstractGraph):
1819
"""
@@ -59,6 +60,15 @@ def _create_graph(self) -> BaseGraph:
5960
Returns:
6061
BaseGraph: A graph instance representing the web scraping workflow.
6162
"""
63+
if self.llm_model == "scrapegraphai/smart-scraper":
64+
65+
sgai_client = SyncClient(api_key=self.config.get("api_key"))
66+
67+
response = sgai_client.smartscraper(
68+
website_url=self.source,
69+
user_prompt=self.prompt
70+
)
71+
return response
6272

6373
fetch_node = FetchNode(
6474
input="url| local_dir",

0 commit comments

Comments
 (0)