150 lines
3.7 KiB
Markdown

# Web Scraper CLI
A FastAPI-based web scraper with CLI interface.
## Features
- REST API for web scraping management
- CLI tool for scraping websites
- Extract metadata, links, and specific content using CSS selectors
- Store scraping results in SQLite database
- Background job processing
- Rate limiting to avoid overloading target websites
## Installation
### Local Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/webscrapercli.git
cd webscrapercli
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Run the database migrations:
```bash
alembic upgrade head
```
### Docker Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/webscrapercli.git
cd webscrapercli
```
2. Build and run using Docker Compose:
```bash
docker-compose up --build
```
This will:
- Build the Docker image with all dependencies
- Start the FastAPI server on port 8000
- Mount the app and storage directories as volumes for live code reloading
## Usage
### API Server
Start the API server:
```bash
# Development mode
uvicorn main:app --reload
# Production mode
uvicorn main:app --host 0.0.0.0 --port 8000
```
Access the API documentation at: http://localhost:8000/docs
### CLI Usage
The CLI provides several commands for scraping websites:
```bash
# Scrape a URL
python cli.py scrape https://example.com
# Scrape a URL with a specific selector
python cli.py scrape https://example.com --selector "div.content"
# Save the results to a file
python cli.py scrape https://example.com --output results.json
# List all scrape jobs
python cli.py list
# List scrape jobs with a specific status
python cli.py list --status completed
# Show details of a specific job
python cli.py show 1
# Run a specific job
python cli.py run 1
```
## API Endpoints
- `GET /health`: Health check endpoint
- `POST /api/v1/scrape-jobs/`: Create a new scrape job
- `GET /api/v1/scrape-jobs/`: List scrape jobs
- `GET /api/v1/scrape-jobs/{job_id}`: Get a specific scrape job
- `PUT /api/v1/scrape-jobs/{job_id}`: Update a scrape job
- `DELETE /api/v1/scrape-jobs/{job_id}`: Delete a scrape job
- `POST /api/v1/scrape-jobs/{job_id}/run`: Run a scrape job
- `GET /api/v1/scrape-jobs/{job_id}/results`: Get the results of a scrape job
## Development
### Project Structure
```
webscrapercli/
├── alembic.ini # Alembic configuration
├── app/ # Application package
│ ├── api/ # API endpoints
│ ├── cli/ # CLI implementation
│ ├── core/ # Core functionality
│ ├── crud/ # CRUD operations
│ ├── db/ # Database configuration
│ ├── models/ # SQLAlchemy models
│ ├── schemas/ # Pydantic schemas
│ ├── services/ # Business logic
│ └── utils/ # Utility functions
├── cli.py # CLI entry point
├── docker-compose.yml # Docker Compose configuration
├── Dockerfile # Docker configuration
├── main.py # API entry point
├── migrations/ # Alembic migrations
│ ├── env.py # Alembic environment
│ ├── script.py.mako # Alembic script template
│ └── versions/ # Migration scripts
├── requirements.txt # Dependencies
└── storage/ # Storage directory for database and other files
└── db/ # Database directory
```
### Running Tests
```bash
# Run tests
pytest
```
## License
This project is open source.