City Scrapers Core

Indices and tables

Constants

This package defines several constants to standardize the values of meeting classifications and status.

Classification

Meeting classifications are used to describe the type of meeting taking place. The Open Civic Data Event specification requires this field but doesn’t specify allowed values. These categories are based off of the meetings we’ve encountered and are an attempt to simplify the information we’re scraping.

For many agencies all of their meetings will have the same classification, but the most common example of needing to use multiple classifications would be boards that hold separate committee meetings. In that case, meetings of the overall board would have the BOARD classification while each committee would be classified with COMMITTEE.

ADVISORY_COMMITTEE

Advisory bodies that typically don’t directly oversee the administration of any governmental functions. Examples would be citizen’s advisory councils or technical advisory committees. These will typically

BOARD

Any board of directors or body that oversees an agency or governmental function. In most cases “Board” will be in the name.

CITY_COUNCIL

Any local government legislative body, also including county-level agencies like the Cook County Board of Commissioners. This is mainly distinguished from the others in that meetings with this classification will consist of elected rather than appointed members.

COMMISSION

Similar to boards, but typically commissions are set up for more focused purposes. Should generally be used if “Commission” is in an agency name.

COMMITTEE

Represents a committee of a BOARD or COMMISSION. This will rarely be used as an agency’s default classification, and in most cases will only be set when the meeting title indicates that a committee will be meeting instead of the full body.

FORUM

Any town hall, feedback session, or other type of meeting where a wide audience of the public is invited outside of a general public comment period. These meetings usually don’t include binding votes.

POLICE_BEAT

Meetings of police beats, only used for police departments.

NOT_CLASSIFIED

Default value for meetings that don’t fit in the other categories. This should almost never be used, including as a default since most agencies will have a default classification that fits better if one can’t clearly be determined for a meeting.

Status

All allowed status values come from the allowed values in the Open Civic Data Event specification. In general, these are set in pipelines to handle logic around cancellation, and CANCELLED is the only one you might need to interact with directly outside of testing.

CANCELLED

Indicates that a particular instance of a meeting has been cancelled. This applies to the initially planned time of a meeting that was rescheduled, because the meeting is no longer occurring at a specific time and the new time will be treated as a separate meeting.

TENTATIVE

An internal status indicating that a meeting is far enough into the future that the details may change.

CONFIRMED

Indicates that a meeting’s details have been confirmed. In our case this is automatically set when the meeting is in the near future.

PASSED

Meetings that have already happened. Will mostly be set automatically.

Spiders

class city_scrapers_core.spiders.CityScrapersSpider(*args, **kwargs)[source]

Base Spider class for City Scrapers projects. Provides a few utilities for common tasks like creating a meeting ID and checking the status based on meeting details.

get_id(item, identifier=None)[source]

Create an ID for a meeting based on its details like title and start time as well as any agency-provided unique identifiers.

Parameters
  • item (Mapping) – Meeting to generate an ID for

  • identifier (Optional[str]) – Optional unique meeting identifier if available, defaults to None

Return type

str

Returns

ID string based on meeting details

get_status(item, text='')[source]

Determine the status of a meeting based off of its details as well as any additional text that may indicate whether it has been cancelled.

Parameters
  • item (Mapping) – Meeting to get the status for

  • text (str) – Any additional text not included in the meeting details that may indicate whether it’s been cancelled, defaults to “”

Return type

str

Returns

Status constant

class city_scrapers_core.spiders.LegistarSpider(*args, **kwargs)[source]

Subclass of CityScrapersSpider that handles processing Legistar sites, which almost always share the same components and general structure.

Any methods that don’t pull the correct values can be replaced.

Pulls relevant links from a Legistar item

Parameters

item (Dict) – Scraped item from Legistar

Return type

List[Dict]

Returns

List of meeting links

legistar_source(item)[source]

Pulls the source URL from a Legistar item. Pulls a specific meeting URL if available, otherwise defaults to the general Legistar calendar page.

Parameters

item (Dict) – Scraped item from Legistar

Return type

str

Returns

Source URL

legistar_start(item)[source]

Pulls the start time from a Legistar item

Parameters

item (Dict) – Scraped item from Legistar

Return type

datetime

Returns

Meeting start datetime

parse(response)[source]

Creates initial event requests for each queried year.

Parameters

response (Response) – Scrapy response to be ignored

Return type

Iterable[Request]

Returns

Iterable of Request objects for event pages

parse_legistar(events)[source]

Method to be implemented by Spider classes that will handle the response from Legistar. Functions similar to parse for other Spider classes.

Parameters

events (Iterable[Dict]) – Iterable consisting of a dict of scraped results from Legistar

Raises

NotImplementedError – Must be implemented in subclasses

Return type

Iterable[Meeting]

Returns

Meeting objects that will be passed to pipelines, output

Items

class city_scrapers_core.items.Meeting(*args, **kwargs)[source]

Main scrapy Item subclass used for handing meetings.

clear()None.  Remove all items from D.
deepcopy()

Return a deepcopy() of this item.

get(k[, d])D[k] if k in D, else d.  d defaults to None.
items()a set-like object providing a view on D’s items
keys()a set-like object providing a view on D’s keys
pop(k[, d])v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised.

popitem()(k, v), remove and return some (key, value) pair

as a 2-tuple; but raise KeyError if D is empty.

setdefault(k[, d])D.get(k,d), also set D[k]=d if k not in D
update([E, ]**F)None.  Update D from mapping/iterable E and F.

If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k, v in F.items(): D[k] = v

values()an object providing a view on D’s values

Pipelines

class city_scrapers_core.pipelines.AzureDiffPipeline(crawler, output_format)[source]

Implements DiffPipeline for Azure Blob Storage

load_previous_results()[source]

Loads previously scraped items on Azure Blob Storage

Return type

List[Mapping]

Returns

Previously scraped results

class city_scrapers_core.pipelines.DefaultValuesPipeline[source]

Pipeline for setting default values on scraped Item objects

process_item(item, spider)[source]

Pipeline hook for setting multiple default values for scraped Item objects

Parameters
  • item (Item) – An individual Item that’s been scraped

  • spider (Spider) – Spider passed to the pipeline

Return type

Item

Returns

Item with defaults set

class city_scrapers_core.pipelines.DiffPipeline(crawler, output_format)[source]

Class for loading and comparing previous feed export results in OCD format. Either merges UIDs for consistency or marks upcoming meetings that no longer appear as cancelled.

Provider-specific backends can be created by subclassing and implementing the load_previous_results method.

classmethod from_crawler(crawler)[source]

Classmethod for creating a pipeline object from a Crawler

Parameters

crawler (Crawler) – Crawler currently being run

Raises

ValueError – Raises an error if an output format is not supplied

Returns

Instance of DiffPipeline

load_previous_results()[source]

Method that must be implemented for loading previously-scraped results

Raises

NotImplementedError – Required to be implemented on subclasses

Return type

List[Mapping]

Returns

Items previously scraped and loaded from a storage backend

process_item(item, spider)[source]

Processes Item objects or general dict-like objects and compares them to previously scraped values.

Parameters
  • item (Mapping) – Dict-like item to process from a scraper

  • spider (Spider) – Spider currently being scraped

Raises
  • DropItem – Drops items with IDs that have been already scraped

  • DropItem – Drops items that are in the past and already scraped

Return type

Mapping

Returns

Returns the item, merged with previous values if found

spider_idle(spider)[source]

Add _previous_results to spider queue when current results finish

Parameters

spider (Spider) – Spider being scraped

Raises

DontCloseSpider – Makes sure spider isn’t closed to make sure prior results are processed

class city_scrapers_core.pipelines.GCSDiffPipeline(crawler, output_format)[source]

Implements DiffPipeline for Google Cloud Storage

load_previous_results()[source]

Load previously scraped items on Google Cloud Storage

Return type

List[Mapping]

Returns

Previously scraped results

class city_scrapers_core.pipelines.MeetingPipeline[source]

General pipeline for setting some defaults on meetings, can be subclassed for additional processing.

process_item(item, spider)[source]

Custom processing to set defaults on meeting, including cleaning up title and setting a default end time if one is not provided

Parameters

item (Item) – Scraped item passed to pipeline

Return type

Item

Returns

Processed item

class city_scrapers_core.pipelines.OpenCivicDataPipeline[source]

Pipeline for transforming Meeting items into the Open Civic Data Event format.

create_location(item)[source]

Creates an OCD-formatted location from a scraped item’s data

Parameters

item (Mapping) – Item to process the location

Return type

Mapping

Returns

Dict of the location

process_item(item, spider)[source]

Takes a dict-like object and converts it into an Open Civic Data Event.

Parameters
  • item (Mapping) – Item to be converted

  • spider (Spider) – Current spider being run

Return type

Mapping

Returns

Dict formatted as an OCD event

class city_scrapers_core.pipelines.S3DiffPipeline(crawler, output_format)[source]

Implements DiffPipeline for AWS S3

load_previous_results()[source]

Load previously scraped items on AWS S3

Return type

List[Mapping]

Returns

Previously scraped results

class city_scrapers_core.pipelines.ValidationPipeline[source]

Pipeline for validating whether a scraper’s results match the expected schema.

close_spider(spider)[source]

Run validation report when Spider is closed

Parameters

spider (Spider) – Spider object being run

classmethod from_crawler(crawler)[source]

Create pipeline from crawler

Parameters

crawler (Crawler) – Current Crawler object

Returns

Created pipeline

open_spider(spider)[source]

Set initial item count and error count for tracking

Parameters

spider (Spider) – Spider object being run

process_item(item, spider)[source]

Check whether each item scraped matches the schema

Parameters
  • item (Mapping) – Item to be processed, ignored if not Meeting

  • spider (Spider) – Spider object being run

Return type

Mapping

Returns

Item with modifications for validation

validation_report(spider)[source]

Print the results of validating Spider output against a required schema

Parameters

spider (Spider) – Spider object to validate

Raises

ValueError – Raises error if validation fails

city_scrapers_core.decorators.ignore_processed(func)[source]

Method decorator to ignore processed items passed to pipeline by middleware.

This should be used on the process_item method of any additional custom pipelines used to handle Meeting objects to make sure that dict items passed by DiffPipeline don’t cause issues.

Extensions

class city_scrapers_core.extensions.StatusExtension(crawler)[source]

Scrapy extension for maintaining an SVG badge for each scraper’s status.

create_status_svg(spider, status)[source]

Format a template status SVG string based on a spider and status information

Parameters
  • spider (Spider) – Spider to determine the status for

  • status (str) – String indicating scraper status, one of “running”, “failing”

Return type

str

Returns

An SVG string formatted for a given spider and status

classmethod from_crawler(crawler)[source]

Generate an extension from a crawler

Parameters

crawler (Crawler) – Current scrapy crawler

spider_closed()[source]

Updates the status SVG with a running status unless the spider has encountered an error in which case it exits

spider_error()[source]

Sets the has_error flag on the first spider error and immediately updates the SVG with a “failing” status

update_status_svg(spider, svg)[source]

Method for updating the status button SVG for a storage provider. Must be implemented on subclasses.

Parameters
  • spider (Spider) – Spider with the status being tracked

  • svg (str) – Templated SVG string

Raises

NotImplementedError – Raises if not implemented on subclass

class city_scrapers_core.extensions.AzureBlobStatusExtension(crawler)[source]

Implements StatusExtension for Azure Blob Storage

create_status_svg(spider, status)

Format a template status SVG string based on a spider and status information

Parameters
  • spider (Spider) – Spider to determine the status for

  • status (str) – String indicating scraper status, one of “running”, “failing”

Return type

str

Returns

An SVG string formatted for a given spider and status

classmethod from_crawler(crawler)

Generate an extension from a crawler

Parameters

crawler (Crawler) – Current scrapy crawler

spider_closed()

Updates the status SVG with a running status unless the spider has encountered an error in which case it exits

spider_error()

Sets the has_error flag on the first spider error and immediately updates the SVG with a “failing” status

update_status_svg(spider, svg)[source]

Implements writing templated status SVG to Azure Blob Storage

Parameters
  • spider (Spider) – Spider with the status being tracked

  • svg (str) – Templated SVG string

class city_scrapers_core.extensions.S3StatusExtension(crawler)[source]

Implements StatusExtension for AWS S3

create_status_svg(spider, status)

Format a template status SVG string based on a spider and status information

Parameters
  • spider (Spider) – Spider to determine the status for

  • status (str) – String indicating scraper status, one of “running”, “failing”

Return type

str

Returns

An SVG string formatted for a given spider and status

classmethod from_crawler(crawler)

Generate an extension from a crawler

Parameters

crawler (Crawler) – Current scrapy crawler

spider_closed()

Updates the status SVG with a running status unless the spider has encountered an error in which case it exits

spider_error()

Sets the has_error flag on the first spider error and immediately updates the SVG with a “failing” status

update_status_svg(spider, svg)[source]

Implements writing templated status SVG to AWS S3

Parameters
  • spider (Spider) – Spider with the status being tracked

  • svg (str) – Templated SVG string

class city_scrapers_core.extensions.GCSStatusExtension(crawler)[source]

Implements StatusExtension for Google Cloud Storage

create_status_svg(spider, status)

Format a template status SVG string based on a spider and status information

Parameters
  • spider (Spider) – Spider to determine the status for

  • status (str) – String indicating scraper status, one of “running”, “failing”

Return type

str

Returns

An SVG string formatted for a given spider and status

classmethod from_crawler(crawler)

Generate an extension from a crawler

Parameters

crawler (Crawler) – Current scrapy crawler

spider_closed()

Updates the status SVG with a running status unless the spider has encountered an error in which case it exits

spider_error()

Sets the has_error flag on the first spider error and immediately updates the SVG with a “failing” status

update_status_svg(spider, svg)[source]

Implements writing templated status SVG to Google Cloud Storage

Parameters
  • spider (Spider) – Spider with the status being tracked

  • svg (str) – Templated SVG string

class city_scrapers_core.extensions.AzureBlobFeedStorage(uri)[source]

Subclass of scrapy.extensions.feedexport.BlockingFeedStorage for writing scraper results to Azure Blob Storage.

Parameters

uri (str) – Azure Blob Storage URL including an account name, credentials, container, and filename

Testing

city_scrapers_core.utils.file_response(file_name, mode='r', url=None)[source]

Create a Scrapy fake HTTP response from a HTML file. Based on https://stackoverflow.com/a/12741030

Parameters
  • file_name (str) – The relative or absolute filename from the tests directory

  • url (Optional[str]) – The URL of the response

  • mode (str) – The mode the file should be opened with, defaults to “r”

Return type

Union[Response, HtmlResponse, TextResponse]

Returns

A scrapy HTTP response which can be used for testing

Commands

City Scrapers has several custom Scrapy commands to streamline common tasks.

genspider

  • Syntax: scrapy genspider <name> <agency> <start_url>

  • Example: scrapy genspider chi_planning "Chicago Plan Commission" "https://chicago.gov/"

Scrapy’s genspider command is subclassed for this project to handle creating the boilerplate code.

The command accepts the Spider slug, the full agency name, and a URL that should be initially scraped. It will use this information to create a Spider, initial Pytest test file, and fixtures for the tests. If the site uses Legistar (based on the URL), it will use a separate template specific to Legistar sites that simplifies some commmon functionality.

The boilerplate files won’t work for all sites, and in particular they won’t cover cases where multiple pages need to be scraped, but they provide a starting point for some setup tasks that can cause confusion.

combinefeeds

  • Syntax: scrapy combinefeeds

Combines output files written to a storage backend into latest.json which contains all meetings scraped, upcoming.json which only includes meetings in the future, and a file for each agency slug (i.e. chi_plan_commission.json) at the top level of the storage backend with the most recently scraped meetings for an agency.

runall

  • Syntax: scrapy runall

This will load all spiders and run them in the same process.

validate

  • Syntax: scrapy validate <name>

  • Example: scrapy validate chi_plan_commission

This command is used to run the ValidationPipeline and ensure that a scraper is returning valid output. This is predominantly used for CI.