Pipelines

class city_scrapers_core.pipelines.AzureDiffPipeline(crawler, output_format)[source]

Implements DiffPipeline for Azure Blob Storage

load_previous_results()[source]

Loads previously scraped items on Azure Blob Storage

Return type

List[Mapping]

Returns

Previously scraped results

class city_scrapers_core.pipelines.DefaultValuesPipeline[source]

Pipeline for setting default values on scraped Item objects

process_item(item, spider)[source]

Pipeline hook for setting multiple default values for scraped Item objects

Parameters
  • item (Item) – An individual Item that’s been scraped

  • spider (Spider) – Spider passed to the pipeline

Return type

Item

Returns

Item with defaults set

class city_scrapers_core.pipelines.DiffPipeline(crawler, output_format)[source]

Class for loading and comparing previous feed export results in OCD format. Either merges UIDs for consistency or marks upcoming meetings that no longer appear as cancelled.

Provider-specific backends can be created by subclassing and implementing the load_previous_results method.

classmethod from_crawler(crawler)[source]

Classmethod for creating a pipeline object from a Crawler

Parameters

crawler (Crawler) – Crawler currently being run

Raises

ValueError – Raises an error if an output format is not supplied

Returns

Instance of DiffPipeline

load_previous_results()[source]

Method that must be implemented for loading previously-scraped results

Raises

NotImplementedError – Required to be implemented on subclasses

Return type

List[Mapping]

Returns

Items previously scraped and loaded from a storage backend

process_item(item, spider)[source]

Processes Item objects or general dict-like objects and compares them to previously scraped values.

Parameters
  • item (Mapping) – Dict-like item to process from a scraper

  • spider (Spider) – Spider currently being scraped

Raises
  • DropItem – Drops items with IDs that have been already scraped

  • DropItem – Drops items that are in the past and already scraped

Return type

Mapping

Returns

Returns the item, merged with previous values if found

spider_idle(spider)[source]

Add _previous_results to spider queue when current results finish

Parameters

spider (Spider) – Spider being scraped

Raises

DontCloseSpider – Makes sure spider isn’t closed to make sure prior results are processed

class city_scrapers_core.pipelines.GCSDiffPipeline(crawler, output_format)[source]

Implements DiffPipeline for Google Cloud Storage

load_previous_results()[source]

Load previously scraped items on Google Cloud Storage

Return type

List[Mapping]

Returns

Previously scraped results

class city_scrapers_core.pipelines.MeetingPipeline[source]

General pipeline for setting some defaults on meetings, can be subclassed for additional processing.

process_item(item, spider)[source]

Custom processing to set defaults on meeting, including cleaning up title and setting a default end time if one is not provided

Parameters

item (Item) – Scraped item passed to pipeline

Return type

Item

Returns

Processed item

class city_scrapers_core.pipelines.OpenCivicDataPipeline[source]

Pipeline for transforming Meeting items into the Open Civic Data Event format.

create_location(item)[source]

Creates an OCD-formatted location from a scraped item’s data

Parameters

item (Mapping) – Item to process the location

Return type

Mapping

Returns

Dict of the location

process_item(item, spider)[source]

Takes a dict-like object and converts it into an Open Civic Data Event.

Parameters
  • item (Mapping) – Item to be converted

  • spider (Spider) – Current spider being run

Return type

Mapping

Returns

Dict formatted as an OCD event

class city_scrapers_core.pipelines.S3DiffPipeline(crawler, output_format)[source]

Implements DiffPipeline for AWS S3

load_previous_results()[source]

Load previously scraped items on AWS S3

Return type

List[Mapping]

Returns

Previously scraped results

class city_scrapers_core.pipelines.ValidationPipeline[source]

Pipeline for validating whether a scraper’s results match the expected schema.

close_spider(spider)[source]

Run validation report when Spider is closed

Parameters

spider (Spider) – Spider object being run

classmethod from_crawler(crawler)[source]

Create pipeline from crawler

Parameters

crawler (Crawler) – Current Crawler object

Returns

Created pipeline

open_spider(spider)[source]

Set initial item count and error count for tracking

Parameters

spider (Spider) – Spider object being run

process_item(item, spider)[source]

Check whether each item scraped matches the schema

Parameters
  • item (Mapping) – Item to be processed, ignored if not Meeting

  • spider (Spider) – Spider object being run

Return type

Mapping

Returns

Item with modifications for validation

validation_report(spider)[source]

Print the results of validating Spider output against a required schema

Parameters

spider (Spider) – Spider object to validate

Raises

ValueError – Raises error if validation fails

city_scrapers_core.decorators.ignore_processed(func)[source]

Method decorator to ignore processed items passed to pipeline by middleware.

This should be used on the process_item method of any additional custom pipelines used to handle Meeting objects to make sure that dict items passed by DiffPipeline don’t cause issues.