Pipelines¶
-
class
city_scrapers_core.pipelines.
AzureDiffPipeline
(crawler, output_format)[source]¶ Implements
DiffPipeline
for Azure Blob Storage
-
class
city_scrapers_core.pipelines.
DefaultValuesPipeline
[source]¶ Pipeline for setting default values on scraped Item objects
-
class
city_scrapers_core.pipelines.
DiffPipeline
(crawler, output_format)[source]¶ Class for loading and comparing previous feed export results in OCD format. Either merges UIDs for consistency or marks upcoming meetings that no longer appear as cancelled.
Provider-specific backends can be created by subclassing and implementing the load_previous_results method.
-
classmethod
from_crawler
(crawler)[source]¶ Classmethod for creating a pipeline object from a Crawler
- Parameters
crawler (
Crawler
) – Crawler currently being run- Raises
ValueError – Raises an error if an output format is not supplied
- Returns
Instance of DiffPipeline
-
load_previous_results
()[source]¶ Method that must be implemented for loading previously-scraped results
- Raises
NotImplementedError – Required to be implemented on subclasses
- Return type
List
[Mapping
]- Returns
Items previously scraped and loaded from a storage backend
-
process_item
(item, spider)[source]¶ Processes Item objects or general dict-like objects and compares them to previously scraped values.
- Parameters
item (
Mapping
) – Dict-like item to process from a scraperspider (
Spider
) – Spider currently being scraped
- Raises
DropItem – Drops items with IDs that have been already scraped
DropItem – Drops items that are in the past and already scraped
- Return type
Mapping
- Returns
Returns the item, merged with previous values if found
-
classmethod
-
class
city_scrapers_core.pipelines.
GCSDiffPipeline
(crawler, output_format)[source]¶ Implements
DiffPipeline
for Google Cloud Storage
-
class
city_scrapers_core.pipelines.
MeetingPipeline
[source]¶ General pipeline for setting some defaults on meetings, can be subclassed for additional processing.
-
class
city_scrapers_core.pipelines.
OpenCivicDataPipeline
[source]¶ Pipeline for transforming Meeting items into the Open Civic Data Event format.
-
class
city_scrapers_core.pipelines.
S3DiffPipeline
(crawler, output_format)[source]¶ Implements
DiffPipeline
for AWS S3
-
class
city_scrapers_core.pipelines.
ValidationPipeline
[source]¶ Pipeline for validating whether a scraper’s results match the expected schema.
-
close_spider
(spider)[source]¶ Run validation report when Spider is closed
- Parameters
spider (
Spider
) – Spider object being run
-
classmethod
from_crawler
(crawler)[source]¶ Create pipeline from crawler
- Parameters
crawler (
Crawler
) – Current Crawler object- Returns
Created pipeline
-
open_spider
(spider)[source]¶ Set initial item count and error count for tracking
- Parameters
spider (
Spider
) – Spider object being run
-
-
city_scrapers_core.decorators.
ignore_processed
(func)[source]¶ Method decorator to ignore processed items passed to pipeline by middleware.
This should be used on the
process_item
method of any additional custom pipelines used to handleMeeting
objects to make sure thatdict
items passed byDiffPipeline
don’t cause issues.