Pipelines¶
-
class
city_scrapers_core.pipelines.AzureDiffPipeline(crawler, output_format)[source]¶ Implements
DiffPipelinefor Azure Blob Storage
-
class
city_scrapers_core.pipelines.DefaultValuesPipeline[source]¶ Pipeline for setting default values on scraped Item objects
-
class
city_scrapers_core.pipelines.DiffPipeline(crawler, output_format)[source]¶ Class for loading and comparing previous feed export results in OCD format. Either merges UIDs for consistency or marks upcoming meetings that no longer appear as cancelled.
Provider-specific backends can be created by subclassing and implementing the load_previous_results method.
-
classmethod
from_crawler(crawler)[source]¶ Classmethod for creating a pipeline object from a Crawler
- Parameters
crawler (
Crawler) – Crawler currently being run- Raises
ValueError – Raises an error if an output format is not supplied
- Returns
Instance of DiffPipeline
-
load_previous_results()[source]¶ Method that must be implemented for loading previously-scraped results
- Raises
NotImplementedError – Required to be implemented on subclasses
- Return type
List[Mapping]- Returns
Items previously scraped and loaded from a storage backend
-
process_item(item, spider)[source]¶ Processes Item objects or general dict-like objects and compares them to previously scraped values.
- Parameters
item (
Mapping) – Dict-like item to process from a scraperspider (
Spider) – Spider currently being scraped
- Raises
DropItem – Drops items with IDs that have been already scraped
DropItem – Drops items that are in the past and already scraped
- Return type
Mapping- Returns
Returns the item, merged with previous values if found
-
classmethod
-
class
city_scrapers_core.pipelines.GCSDiffPipeline(crawler, output_format)[source]¶ Implements
DiffPipelinefor Google Cloud Storage
-
class
city_scrapers_core.pipelines.MeetingPipeline[source]¶ General pipeline for setting some defaults on meetings, can be subclassed for additional processing.
-
class
city_scrapers_core.pipelines.OpenCivicDataPipeline[source]¶ Pipeline for transforming Meeting items into the Open Civic Data Event format.
-
class
city_scrapers_core.pipelines.S3DiffPipeline(crawler, output_format)[source]¶ Implements
DiffPipelinefor AWS S3
-
class
city_scrapers_core.pipelines.ValidationPipeline[source]¶ Pipeline for validating whether a scraper’s results match the expected schema.
-
close_spider(spider)[source]¶ Run validation report when Spider is closed
- Parameters
spider (
Spider) – Spider object being run
-
classmethod
from_crawler(crawler)[source]¶ Create pipeline from crawler
- Parameters
crawler (
Crawler) – Current Crawler object- Returns
Created pipeline
-
open_spider(spider)[source]¶ Set initial item count and error count for tracking
- Parameters
spider (
Spider) – Spider object being run
-
-
city_scrapers_core.decorators.ignore_processed(func)[source]¶ Method decorator to ignore processed items passed to pipeline by middleware.
This should be used on the
process_itemmethod of any additional custom pipelines used to handleMeetingobjects to make sure thatdictitems passed byDiffPipelinedon’t cause issues.