This project is a web scraper built using Scrapy that extracts data from a list of amazon websites and saves it to a JSON files
- Clone the repository to your local machine.
- Install the required dependencies by running
pip install -r requirements.txt
in your terminal. - Install docker if you don't have it
- Depending on the os, start splash on docker. This command will work for windows
docker run -d -p 8050:8050 --memory=2G --restart=always scrapinghub/splash:3.1 --maxrss 1600
. For other operating systems check https://splash.readthedocs.io/en/stable/install.html - Everything should be working now. Run any of the commands below and test it out!
Quick documentation
scrapy crawl AmazonGetHighResImages -a prod_id=B002HAJQGA -o AmazonGetHighResImages.json
Parameter | Type | Description |
---|---|---|
prod_id |
string |
Specify product id for scraper (Amazon ASIN) |
testing |
boolean |
Testing mode on/off (true/false) |
scrapy crawl AmazonOneProductSpider -a prod_id=B002HAJQGA
Parameter | Type | Description |
---|---|---|
prod_id |
string |
Specify product id for scraper (Amazon ASIN) |
scrapy crawl AmazonProductPrices -a fetch_prod_ids_from_db=True -o AmazonProductPrices.json
Parameter | Type | Description |
---|---|---|
prod_id |
string |
1/2 Required#1 Specify product id for scraper (Amazon ASIN) |
fetch_prod_ids_from_db |
boolean |
2/2 Required#1 Should the program fetch ids from database (true/false) |
instance_id |
int |
Current instance id (if using only one leave blank/put 1) |
max_instances |
int |
Number of working instances (if using only one leave blank/put 1) |
Required#1 1/2 and 2/2 means that ONLY ONE of these two is required
scrapy crawl AmazonReviewsSpider -a fetch_prod_ids_from_db=True -o AmazonReviewsSpider.json
Parameter | Type | Description |
---|---|---|
prod_id |
string |
1/2 Required#1 Specify product id for scraper (Amazon ASIN) |
fetch_prod_ids_from_db |
boolean |
2/2 Required#1 Should the program fetch ids from database (true/false) |
testing |
boolean |
Testing mode on/off (true/false) |
instance_id |
int |
Current instance id (if using only one leave blank/put 1) |
max_instances |
int |
Number of working instances (if using only one leave blank/put 1) |
Required#1 1/2 and 2/2 means that ONLY ONE of these two is required
scrapy crawl AmazonProductSpider -o amazon_product_data.json
Parameter | Type | Description |
---|---|---|
instance_id |
int |
Current instance id (if using only one leave blank/put 1) |
max_instances |
int |
Number of working instances (if using only one leave blank/put 1) |
Contributions to this project are welcome. To contribute, please follow these steps:
- Fork the repository.
- Create a new branch for your changes.
- Make your changes and commit them to your branch.
- Push your changes to your forked repository.
- Submit a pull request to the original repository.
This project was built by Highlighted-dev using Scrapy, a Python web scraping framework
This project is licensed under the MIT License. See the LICENSE
file for more information.
For questions or support, please contact Highlighted.