Pipelines¶
- class city_scrapers_core.pipelines.AzureDiffPipeline(crawler, output_format)[source]¶
Implements
DiffPipeline
for Azure Blob Storage
- class city_scrapers_core.pipelines.DefaultValuesPipeline[source]¶
Pipeline for setting default values on scraped Item objects
- class city_scrapers_core.pipelines.DiffPipeline(crawler, output_format)[source]¶
Class for loading and comparing previous feed export results in OCD format. Either merges UIDs for consistency or marks upcoming meetings that no longer appear as cancelled.
Provider-specific backends can be created by subclassing and implementing the load_previous_results method.
- classmethod from_crawler(crawler)[source]¶
Classmethod for creating a pipeline object from a Crawler
- Parameters
crawler (
Crawler
) – Crawler currently being run- Raises
ValueError – Raises an error if an output format is not supplied
- Returns
Instance of DiffPipeline
- load_previous_results()[source]¶
Method that must be implemented for loading previously-scraped results
- Raises
NotImplementedError – Required to be implemented on subclasses
- Return type
List
[Mapping
]- Returns
Items previously scraped and loaded from a storage backend
- process_item(item, spider)[source]¶
Processes Item objects or general dict-like objects and compares them to previously scraped values.
- Parameters
item (
Mapping
) – Dict-like item to process from a scraperspider (
Spider
) – Spider currently being scraped
- Raises
DropItem – Drops items with IDs that have been already scraped
DropItem – Drops items that are in the past and already scraped
- Return type
Mapping
- Returns
Returns the item, merged with previous values if found
- class city_scrapers_core.pipelines.GCSDiffPipeline(crawler, output_format)[source]¶
Implements
DiffPipeline
for Google Cloud Storage
- class city_scrapers_core.pipelines.MeetingPipeline[source]¶
General pipeline for setting some defaults on meetings, can be subclassed for additional processing.
- class city_scrapers_core.pipelines.OpenCivicDataPipeline[source]¶
Pipeline for transforming Meeting items into the Open Civic Data Event format.
- class city_scrapers_core.pipelines.S3DiffPipeline(crawler, output_format)[source]¶
Implements
DiffPipeline
for AWS S3
- class city_scrapers_core.pipelines.ValidationPipeline[source]¶
Pipeline for validating whether a scraper’s results match the expected schema.
- close_spider(spider)[source]¶
Run validation report when Spider is closed
- Parameters
spider (
Spider
) – Spider object being run
- classmethod from_crawler(crawler)[source]¶
Create pipeline from crawler
- Parameters
crawler (
Crawler
) – Current Crawler object- Returns
Created pipeline
- open_spider(spider)[source]¶
Set initial item count and error count for tracking
- Parameters
spider (
Spider
) – Spider object being run
- city_scrapers_core.decorators.ignore_processed(func)[source]¶
Method decorator to ignore processed items passed to pipeline by middleware.
This should be used on the
process_item
method of any additional custom pipelines used to handleMeeting
objects to make sure thatdict
items passed byDiffPipeline
don’t cause issues.