3.7 KiB
3.7 KiB
Web Scraper CLI
A FastAPI-based web scraper with CLI interface.
Features
- REST API for web scraping management
- CLI tool for scraping websites
- Extract metadata, links, and specific content using CSS selectors
- Store scraping results in SQLite database
- Background job processing
- Rate limiting to avoid overloading target websites
Installation
Local Installation
- Clone the repository:
git clone https://github.com/yourusername/webscrapercli.git
cd webscrapercli
- Install dependencies:
pip install -r requirements.txt
- Run the database migrations:
alembic upgrade head
Docker Installation
- Clone the repository:
git clone https://github.com/yourusername/webscrapercli.git
cd webscrapercli
- Build and run using Docker Compose:
docker-compose up --build
This will:
- Build the Docker image with all dependencies
- Start the FastAPI server on port 8000
- Mount the app and storage directories as volumes for live code reloading
Usage
API Server
Start the API server:
# Development mode
uvicorn main:app --reload
# Production mode
uvicorn main:app --host 0.0.0.0 --port 8000
Access the API documentation at: http://localhost:8000/docs
CLI Usage
The CLI provides several commands for scraping websites:
# Scrape a URL
python cli.py scrape https://example.com
# Scrape a URL with a specific selector
python cli.py scrape https://example.com --selector "div.content"
# Save the results to a file
python cli.py scrape https://example.com --output results.json
# List all scrape jobs
python cli.py list
# List scrape jobs with a specific status
python cli.py list --status completed
# Show details of a specific job
python cli.py show 1
# Run a specific job
python cli.py run 1
API Endpoints
GET /health
: Health check endpointPOST /api/v1/scrape-jobs/
: Create a new scrape jobGET /api/v1/scrape-jobs/
: List scrape jobsGET /api/v1/scrape-jobs/{job_id}
: Get a specific scrape jobPUT /api/v1/scrape-jobs/{job_id}
: Update a scrape jobDELETE /api/v1/scrape-jobs/{job_id}
: Delete a scrape jobPOST /api/v1/scrape-jobs/{job_id}/run
: Run a scrape jobGET /api/v1/scrape-jobs/{job_id}/results
: Get the results of a scrape job
Development
Project Structure
webscrapercli/
├── alembic.ini # Alembic configuration
├── app/ # Application package
│ ├── api/ # API endpoints
│ ├── cli/ # CLI implementation
│ ├── core/ # Core functionality
│ ├── crud/ # CRUD operations
│ ├── db/ # Database configuration
│ ├── models/ # SQLAlchemy models
│ ├── schemas/ # Pydantic schemas
│ ├── services/ # Business logic
│ └── utils/ # Utility functions
├── cli.py # CLI entry point
├── docker-compose.yml # Docker Compose configuration
├── Dockerfile # Docker configuration
├── main.py # API entry point
├── migrations/ # Alembic migrations
│ ├── env.py # Alembic environment
│ ├── script.py.mako # Alembic script template
│ └── versions/ # Migration scripts
├── requirements.txt # Dependencies
└── storage/ # Storage directory for database and other files
└── db/ # Database directory
Running Tests
# Run tests
pytest
License
This project is open source.