150 lines
3.7 KiB
Markdown
150 lines
3.7 KiB
Markdown
# Web Scraper CLI
|
|
|
|
A FastAPI-based web scraper with CLI interface.
|
|
|
|
## Features
|
|
|
|
- REST API for web scraping management
|
|
- CLI tool for scraping websites
|
|
- Extract metadata, links, and specific content using CSS selectors
|
|
- Store scraping results in SQLite database
|
|
- Background job processing
|
|
- Rate limiting to avoid overloading target websites
|
|
|
|
## Installation
|
|
|
|
### Local Installation
|
|
|
|
1. Clone the repository:
|
|
|
|
```bash
|
|
git clone https://github.com/yourusername/webscrapercli.git
|
|
cd webscrapercli
|
|
```
|
|
|
|
2. Install dependencies:
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
3. Run the database migrations:
|
|
|
|
```bash
|
|
alembic upgrade head
|
|
```
|
|
|
|
### Docker Installation
|
|
|
|
1. Clone the repository:
|
|
|
|
```bash
|
|
git clone https://github.com/yourusername/webscrapercli.git
|
|
cd webscrapercli
|
|
```
|
|
|
|
2. Build and run using Docker Compose:
|
|
|
|
```bash
|
|
docker-compose up --build
|
|
```
|
|
|
|
This will:
|
|
- Build the Docker image with all dependencies
|
|
- Start the FastAPI server on port 8000
|
|
- Mount the app and storage directories as volumes for live code reloading
|
|
|
|
## Usage
|
|
|
|
### API Server
|
|
|
|
Start the API server:
|
|
|
|
```bash
|
|
# Development mode
|
|
uvicorn main:app --reload
|
|
|
|
# Production mode
|
|
uvicorn main:app --host 0.0.0.0 --port 8000
|
|
```
|
|
|
|
Access the API documentation at: http://localhost:8000/docs
|
|
|
|
### CLI Usage
|
|
|
|
The CLI provides several commands for scraping websites:
|
|
|
|
```bash
|
|
# Scrape a URL
|
|
python cli.py scrape https://example.com
|
|
|
|
# Scrape a URL with a specific selector
|
|
python cli.py scrape https://example.com --selector "div.content"
|
|
|
|
# Save the results to a file
|
|
python cli.py scrape https://example.com --output results.json
|
|
|
|
# List all scrape jobs
|
|
python cli.py list
|
|
|
|
# List scrape jobs with a specific status
|
|
python cli.py list --status completed
|
|
|
|
# Show details of a specific job
|
|
python cli.py show 1
|
|
|
|
# Run a specific job
|
|
python cli.py run 1
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
- `GET /health`: Health check endpoint
|
|
- `POST /api/v1/scrape-jobs/`: Create a new scrape job
|
|
- `GET /api/v1/scrape-jobs/`: List scrape jobs
|
|
- `GET /api/v1/scrape-jobs/{job_id}`: Get a specific scrape job
|
|
- `PUT /api/v1/scrape-jobs/{job_id}`: Update a scrape job
|
|
- `DELETE /api/v1/scrape-jobs/{job_id}`: Delete a scrape job
|
|
- `POST /api/v1/scrape-jobs/{job_id}/run`: Run a scrape job
|
|
- `GET /api/v1/scrape-jobs/{job_id}/results`: Get the results of a scrape job
|
|
|
|
## Development
|
|
|
|
### Project Structure
|
|
|
|
```
|
|
webscrapercli/
|
|
├── alembic.ini # Alembic configuration
|
|
├── app/ # Application package
|
|
│ ├── api/ # API endpoints
|
|
│ ├── cli/ # CLI implementation
|
|
│ ├── core/ # Core functionality
|
|
│ ├── crud/ # CRUD operations
|
|
│ ├── db/ # Database configuration
|
|
│ ├── models/ # SQLAlchemy models
|
|
│ ├── schemas/ # Pydantic schemas
|
|
│ ├── services/ # Business logic
|
|
│ └── utils/ # Utility functions
|
|
├── cli.py # CLI entry point
|
|
├── docker-compose.yml # Docker Compose configuration
|
|
├── Dockerfile # Docker configuration
|
|
├── main.py # API entry point
|
|
├── migrations/ # Alembic migrations
|
|
│ ├── env.py # Alembic environment
|
|
│ ├── script.py.mako # Alembic script template
|
|
│ └── versions/ # Migration scripts
|
|
├── requirements.txt # Dependencies
|
|
└── storage/ # Storage directory for database and other files
|
|
└── db/ # Database directory
|
|
```
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
# Run tests
|
|
pytest
|
|
```
|
|
|
|
## License
|
|
|
|
This project is open source. |