# Web Scraper CLI A FastAPI-based web scraper with CLI interface. ## Features - REST API for web scraping management - CLI tool for scraping websites - Extract metadata, links, and specific content using CSS selectors - Store scraping results in SQLite database - Background job processing - Rate limiting to avoid overloading target websites ## Installation ### Local Installation 1. Clone the repository: ```bash git clone https://github.com/yourusername/webscrapercli.git cd webscrapercli ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Run the database migrations: ```bash alembic upgrade head ``` ### Docker Installation 1. Clone the repository: ```bash git clone https://github.com/yourusername/webscrapercli.git cd webscrapercli ``` 2. Build and run using Docker Compose: ```bash docker-compose up --build ``` This will: - Build the Docker image with all dependencies - Start the FastAPI server on port 8000 - Mount the app and storage directories as volumes for live code reloading ## Usage ### API Server Start the API server: ```bash # Development mode uvicorn main:app --reload # Production mode uvicorn main:app --host 0.0.0.0 --port 8000 ``` Access the API documentation at: http://localhost:8000/docs ### CLI Usage The CLI provides several commands for scraping websites: ```bash # Scrape a URL python cli.py scrape https://example.com # Scrape a URL with a specific selector python cli.py scrape https://example.com --selector "div.content" # Save the results to a file python cli.py scrape https://example.com --output results.json # List all scrape jobs python cli.py list # List scrape jobs with a specific status python cli.py list --status completed # Show details of a specific job python cli.py show 1 # Run a specific job python cli.py run 1 ``` ## API Endpoints - `GET /health`: Health check endpoint - `POST /api/v1/scrape-jobs/`: Create a new scrape job - `GET /api/v1/scrape-jobs/`: List scrape jobs - `GET /api/v1/scrape-jobs/{job_id}`: Get a specific scrape job - `PUT /api/v1/scrape-jobs/{job_id}`: Update a scrape job - `DELETE /api/v1/scrape-jobs/{job_id}`: Delete a scrape job - `POST /api/v1/scrape-jobs/{job_id}/run`: Run a scrape job - `GET /api/v1/scrape-jobs/{job_id}/results`: Get the results of a scrape job ## Development ### Project Structure ``` webscrapercli/ ├── alembic.ini # Alembic configuration ├── app/ # Application package │ ├── api/ # API endpoints │ ├── cli/ # CLI implementation │ ├── core/ # Core functionality │ ├── crud/ # CRUD operations │ ├── db/ # Database configuration │ ├── models/ # SQLAlchemy models │ ├── schemas/ # Pydantic schemas │ ├── services/ # Business logic │ └── utils/ # Utility functions ├── cli.py # CLI entry point ├── docker-compose.yml # Docker Compose configuration ├── Dockerfile # Docker configuration ├── main.py # API entry point ├── migrations/ # Alembic migrations │ ├── env.py # Alembic environment │ ├── script.py.mako # Alembic script template │ └── versions/ # Migration scripts ├── requirements.txt # Dependencies └── storage/ # Storage directory for database and other files └── db/ # Database directory ``` ### Running Tests ```bash # Run tests pytest ``` ## License This project is open source.