Pipelines¶

class city_scrapers_core.pipelines.AzureDiffPipeline(crawler, output_format)[source]¶

Implements DiffPipeline for Azure Blob Storage

load_previous_results()[source]¶

Loads previously scraped items on Azure Blob Storage

Return type: List[Mapping]
Returns: Previously scraped results

class city_scrapers_core.pipelines.DefaultValuesPipeline[source]¶

Pipeline for setting default values on scraped Item objects

process_item(item, spider)[source]¶

Pipeline hook for setting multiple default values for scraped Item objects

Parameters

item (Item) – An individual Item that’s been scraped
spider (Spider) – Spider passed to the pipeline

Return type

Item

Returns

Item with defaults set

class city_scrapers_core.pipelines.DiffPipeline(crawler, output_format)[source]¶

Class for loading and comparing previous feed export results in OCD format. Either merges UIDs for consistency or marks upcoming meetings that no longer appear as cancelled.

Provider-specific backends can be created by subclassing and implementing the load_previous_results method.

classmethod from_crawler(crawler)[source]¶

Classmethod for creating a pipeline object from a Crawler

Parameters: crawler (Crawler) – Crawler currently being run
Raises: ValueError – Raises an error if an output format is not supplied
Returns: Instance of DiffPipeline

load_previous_results()[source]¶

Method that must be implemented for loading previously-scraped results

Raises: NotImplementedError – Required to be implemented on subclasses
Return type: List[Mapping]
Returns: Items previously scraped and loaded from a storage backend

process_item(item, spider)[source]¶

Processes Item objects or general dict-like objects and compares them to previously scraped values.

Parameters

item (Mapping) – Dict-like item to process from a scraper
spider (Spider) – Spider currently being scraped

Raises

DropItem – Drops items with IDs that have been already scraped
DropItem – Drops items that are in the past and already scraped

Return type

Mapping

Returns

Returns the item, merged with previous values if found

spider_idle(spider)[source]¶

Add _previous_results to spider queue when current results finish

Parameters: spider (Spider) – Spider being scraped
Raises: DontCloseSpider – Makes sure spider isn’t closed to make sure prior results are processed

class city_scrapers_core.pipelines.GCSDiffPipeline(crawler, output_format)[source]¶

Implements DiffPipeline for Google Cloud Storage

load_previous_results()[source]¶

Load previously scraped items on Google Cloud Storage

Return type: List[Mapping]
Returns: Previously scraped results

class city_scrapers_core.pipelines.MeetingPipeline[source]¶

General pipeline for setting some defaults on meetings, can be subclassed for additional processing.

process_item(item, spider)[source]¶

Custom processing to set defaults on meeting, including cleaning up title and setting a default end time if one is not provided

Parameters: item (Item) – Scraped item passed to pipeline
Return type: Item
Returns: Processed item

class city_scrapers_core.pipelines.OpenCivicDataPipeline[source]¶

Pipeline for transforming Meeting items into the Open Civic Data Event format.

create_location(item)[source]¶

Creates an OCD-formatted location from a scraped item’s data

Parameters: item (Mapping) – Item to process the location
Return type: Mapping
Returns: Dict of the location

process_item(item, spider)[source]¶

Takes a dict-like object and converts it into an Open Civic Data Event.

Parameters

item (Mapping) – Item to be converted
spider (Spider) – Current spider being run

Return type

Mapping

Returns

Dict formatted as an OCD event

class city_scrapers_core.pipelines.S3DiffPipeline(crawler, output_format)[source]¶

Implements DiffPipeline for AWS S3

load_previous_results()[source]¶

Load previously scraped items on AWS S3

Return type: List[Mapping]
Returns: Previously scraped results

class city_scrapers_core.pipelines.ValidationPipeline[source]¶

Pipeline for validating whether a scraper’s results match the expected schema.

close_spider(spider)[source]¶

Run validation report when Spider is closed

Parameters: spider (Spider) – Spider object being run

classmethod from_crawler(crawler)[source]¶

Create pipeline from crawler

Parameters: crawler (Crawler) – Current Crawler object
Returns: Created pipeline

open_spider(spider)[source]¶

Set initial item count and error count for tracking

Parameters: spider (Spider) – Spider object being run

process_item(item, spider)[source]¶

Check whether each item scraped matches the schema

Parameters

item (Mapping) – Item to be processed, ignored if not Meeting
spider (Spider) – Spider object being run

Return type

Mapping

Returns

Item with modifications for validation

validation_report(spider)[source]¶

Print the results of validating Spider output against a required schema

Parameters: spider (Spider) – Spider object to validate
Raises: ValueError – Raises error if validation fails

city_scrapers_core.decorators.ignore_processed(func)[source]¶

Method decorator to ignore processed items passed to pipeline by middleware.

This should be used on the process_item method of any additional custom pipelines used to handle Meeting objects to make sure that dict items passed by DiffPipeline don’t cause issues.