City Scrapers Core¶
Indices and tables¶
Constants¶
This package defines several constants to standardize the values of meeting classifications and status.
Classification¶
Meeting classifications are used to describe the type of meeting taking place. The Open Civic Data Event specification requires this field but doesn’t specify allowed values. These categories are based off of the meetings we’ve encountered and are an attempt to simplify the information we’re scraping.
For many agencies all of their meetings will have the same classification, but the most
common example of needing to use multiple classifications would be boards that hold
separate committee meetings. In that case, meetings of the overall board would have the
BOARD
classification while each committee would be classified with COMMITTEE
.
ADVISORY_COMMITTEE
¶
Advisory bodies that typically don’t directly oversee the administration of any governmental functions. Examples would be citizen’s advisory councils or technical advisory committees. These will typically
BOARD
¶
Any board of directors or body that oversees an agency or governmental function. In most cases “Board” will be in the name.
CITY_COUNCIL
¶
Any local government legislative body, also including county-level agencies like the Cook County Board of Commissioners. This is mainly distinguished from the others in that meetings with this classification will consist of elected rather than appointed members.
COMMISSION
¶
Similar to boards, but typically commissions are set up for more focused purposes. Should generally be used if “Commission” is in an agency name.
COMMITTEE
¶
Represents a committee of a BOARD
or COMMISSION
. This will rarely be used as an
agency’s default classification, and in most cases will only be set when the meeting
title indicates that a committee will be meeting instead of the full body.
FORUM
¶
Any town hall, feedback session, or other type of meeting where a wide audience of the public is invited outside of a general public comment period. These meetings usually don’t include binding votes.
POLICE_BEAT
¶
Meetings of police beats, only used for police departments.
NOT_CLASSIFIED
¶
Default value for meetings that don’t fit in the other categories. This should almost never be used, including as a default since most agencies will have a default classification that fits better if one can’t clearly be determined for a meeting.
Status¶
All allowed status values come from the allowed values in the Open Civic Data Event
specification. In
general, these are set in pipelines to handle logic around cancellation, and
CANCELLED
is the only one you might need to interact with directly outside of
testing.
CANCELLED
¶
Indicates that a particular instance of a meeting has been cancelled. This applies to the initially planned time of a meeting that was rescheduled, because the meeting is no longer occurring at a specific time and the new time will be treated as a separate meeting.
TENTATIVE
¶
An internal status indicating that a meeting is far enough into the future that the details may change.
CONFIRMED
¶
Indicates that a meeting’s details have been confirmed. In our case this is automatically set when the meeting is in the near future.
PASSED
¶
Meetings that have already happened. Will mostly be set automatically.
Spiders¶
-
class
city_scrapers_core.spiders.
CityScrapersSpider
(*args, **kwargs)[source]¶ Base Spider class for City Scrapers projects. Provides a few utilities for common tasks like creating a meeting ID and checking the status based on meeting details.
-
get_id
(item, identifier=None)[source]¶ Create an ID for a meeting based on its details like title and start time as well as any agency-provided unique identifiers.
- Parameters
item (
Mapping
) – Meeting to generate an ID foridentifier (
Optional
[str
]) – Optional unique meeting identifier if available, defaults to None
- Return type
str
- Returns
ID string based on meeting details
-
get_status
(item, text='')[source]¶ Determine the status of a meeting based off of its details as well as any additional text that may indicate whether it has been cancelled.
- Parameters
item (
Mapping
) – Meeting to get the status fortext (
str
) – Any additional text not included in the meeting details that may indicate whether it’s been cancelled, defaults to “”
- Return type
str
- Returns
Status constant
-
-
class
city_scrapers_core.spiders.
LegistarSpider
(*args, **kwargs)[source]¶ Subclass of
CityScrapersSpider
that handles processing Legistar sites, which almost always share the same components and general structure.Any methods that don’t pull the correct values can be replaced.
-
legistar_links
(item)[source]¶ Pulls relevant links from a Legistar item
- Parameters
item (
Dict
) – Scraped item from Legistar- Return type
List
[Dict
]- Returns
List of meeting links
-
legistar_source
(item)[source]¶ Pulls the source URL from a Legistar item. Pulls a specific meeting URL if available, otherwise defaults to the general Legistar calendar page.
- Parameters
item (
Dict
) – Scraped item from Legistar- Return type
str
- Returns
Source URL
-
legistar_start
(item)[source]¶ Pulls the start time from a Legistar item
- Parameters
item (
Dict
) – Scraped item from Legistar- Return type
datetime
- Returns
Meeting start datetime
-
parse_legistar
(events)[source]¶ Method to be implemented by Spider classes that will handle the response from Legistar. Functions similar to
parse
for other Spider classes.- Parameters
events (
Iterable
[Dict
]) – Iterable consisting of a dict of scraped results from Legistar- Raises
NotImplementedError – Must be implemented in subclasses
- Return type
Iterable
[Meeting
]- Returns
Meeting
objects that will be passed to pipelines, output
-
Items¶
-
class
city_scrapers_core.items.
Meeting
(*args, **kwargs)[source]¶ Main scrapy Item subclass used for handing meetings.
-
clear
() → None. Remove all items from D.¶
-
deepcopy
()¶ Return a
deepcopy()
of this item.
-
get
(k[, d]) → D[k] if k in D, else d. d defaults to None.¶
-
items
() → a set-like object providing a view on D’s items¶
-
keys
() → a set-like object providing a view on D’s keys¶
-
pop
(k[, d]) → v, remove specified key and return the corresponding value.¶ If key is not found, d is returned if given, otherwise KeyError is raised.
-
popitem
() → (k, v), remove and return some (key, value) pair¶ as a 2-tuple; but raise KeyError if D is empty.
-
setdefault
(k[, d]) → D.get(k,d), also set D[k]=d if k not in D¶
-
update
([E, ]**F) → None. Update D from mapping/iterable E and F.¶ If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k, v in F.items(): D[k] = v
-
values
() → an object providing a view on D’s values¶
-
Pipelines¶
-
class
city_scrapers_core.pipelines.
AzureDiffPipeline
(crawler, output_format)[source]¶ Implements
DiffPipeline
for Azure Blob Storage
-
class
city_scrapers_core.pipelines.
DefaultValuesPipeline
[source]¶ Pipeline for setting default values on scraped Item objects
-
class
city_scrapers_core.pipelines.
DiffPipeline
(crawler, output_format)[source]¶ Class for loading and comparing previous feed export results in OCD format. Either merges UIDs for consistency or marks upcoming meetings that no longer appear as cancelled.
Provider-specific backends can be created by subclassing and implementing the load_previous_results method.
-
classmethod
from_crawler
(crawler)[source]¶ Classmethod for creating a pipeline object from a Crawler
- Parameters
crawler (
Crawler
) – Crawler currently being run- Raises
ValueError – Raises an error if an output format is not supplied
- Returns
Instance of DiffPipeline
-
load_previous_results
()[source]¶ Method that must be implemented for loading previously-scraped results
- Raises
NotImplementedError – Required to be implemented on subclasses
- Return type
List
[Mapping
]- Returns
Items previously scraped and loaded from a storage backend
-
process_item
(item, spider)[source]¶ Processes Item objects or general dict-like objects and compares them to previously scraped values.
- Parameters
item (
Mapping
) – Dict-like item to process from a scraperspider (
Spider
) – Spider currently being scraped
- Raises
DropItem – Drops items with IDs that have been already scraped
DropItem – Drops items that are in the past and already scraped
- Return type
Mapping
- Returns
Returns the item, merged with previous values if found
-
classmethod
-
class
city_scrapers_core.pipelines.
GCSDiffPipeline
(crawler, output_format)[source]¶ Implements
DiffPipeline
for Google Cloud Storage
-
class
city_scrapers_core.pipelines.
MeetingPipeline
[source]¶ General pipeline for setting some defaults on meetings, can be subclassed for additional processing.
-
class
city_scrapers_core.pipelines.
OpenCivicDataPipeline
[source]¶ Pipeline for transforming Meeting items into the Open Civic Data Event format.
-
class
city_scrapers_core.pipelines.
S3DiffPipeline
(crawler, output_format)[source]¶ Implements
DiffPipeline
for AWS S3
-
class
city_scrapers_core.pipelines.
ValidationPipeline
[source]¶ Pipeline for validating whether a scraper’s results match the expected schema.
-
close_spider
(spider)[source]¶ Run validation report when Spider is closed
- Parameters
spider (
Spider
) – Spider object being run
-
classmethod
from_crawler
(crawler)[source]¶ Create pipeline from crawler
- Parameters
crawler (
Crawler
) – Current Crawler object- Returns
Created pipeline
-
open_spider
(spider)[source]¶ Set initial item count and error count for tracking
- Parameters
spider (
Spider
) – Spider object being run
-
-
city_scrapers_core.decorators.
ignore_processed
(func)[source]¶ Method decorator to ignore processed items passed to pipeline by middleware.
This should be used on the
process_item
method of any additional custom pipelines used to handleMeeting
objects to make sure thatdict
items passed byDiffPipeline
don’t cause issues.
Extensions¶
-
class
city_scrapers_core.extensions.
StatusExtension
(crawler)[source]¶ Scrapy extension for maintaining an SVG badge for each scraper’s status.
-
create_status_svg
(spider, status)[source]¶ Format a template status SVG string based on a spider and status information
- Parameters
spider (
Spider
) – Spider to determine the status forstatus (
str
) – String indicating scraper status, one of “running”, “failing”
- Return type
str
- Returns
An SVG string formatted for a given spider and status
-
classmethod
from_crawler
(crawler)[source]¶ Generate an extension from a crawler
- Parameters
crawler (
Crawler
) – Current scrapy crawler
-
spider_closed
()[source]¶ Updates the status SVG with a running status unless the spider has encountered an error in which case it exits
-
spider_error
()[source]¶ Sets the has_error flag on the first spider error and immediately updates the SVG with a “failing” status
-
update_status_svg
(spider, svg)[source]¶ Method for updating the status button SVG for a storage provider. Must be implemented on subclasses.
- Parameters
spider (
Spider
) – Spider with the status being trackedsvg (
str
) – Templated SVG string
- Raises
NotImplementedError – Raises if not implemented on subclass
-
-
class
city_scrapers_core.extensions.
AzureBlobStatusExtension
(crawler)[source]¶ Implements
StatusExtension
for Azure Blob Storage-
create_status_svg
(spider, status)¶ Format a template status SVG string based on a spider and status information
- Parameters
spider (
Spider
) – Spider to determine the status forstatus (
str
) – String indicating scraper status, one of “running”, “failing”
- Return type
str
- Returns
An SVG string formatted for a given spider and status
-
classmethod
from_crawler
(crawler)¶ Generate an extension from a crawler
- Parameters
crawler (
Crawler
) – Current scrapy crawler
-
spider_closed
()¶ Updates the status SVG with a running status unless the spider has encountered an error in which case it exits
-
spider_error
()¶ Sets the has_error flag on the first spider error and immediately updates the SVG with a “failing” status
-
-
class
city_scrapers_core.extensions.
S3StatusExtension
(crawler)[source]¶ Implements
StatusExtension
for AWS S3-
create_status_svg
(spider, status)¶ Format a template status SVG string based on a spider and status information
- Parameters
spider (
Spider
) – Spider to determine the status forstatus (
str
) – String indicating scraper status, one of “running”, “failing”
- Return type
str
- Returns
An SVG string formatted for a given spider and status
-
classmethod
from_crawler
(crawler)¶ Generate an extension from a crawler
- Parameters
crawler (
Crawler
) – Current scrapy crawler
-
spider_closed
()¶ Updates the status SVG with a running status unless the spider has encountered an error in which case it exits
-
spider_error
()¶ Sets the has_error flag on the first spider error and immediately updates the SVG with a “failing” status
-
-
class
city_scrapers_core.extensions.
GCSStatusExtension
(crawler)[source]¶ Implements
StatusExtension
for Google Cloud Storage-
create_status_svg
(spider, status)¶ Format a template status SVG string based on a spider and status information
- Parameters
spider (
Spider
) – Spider to determine the status forstatus (
str
) – String indicating scraper status, one of “running”, “failing”
- Return type
str
- Returns
An SVG string formatted for a given spider and status
-
classmethod
from_crawler
(crawler)¶ Generate an extension from a crawler
- Parameters
crawler (
Crawler
) – Current scrapy crawler
-
spider_closed
()¶ Updates the status SVG with a running status unless the spider has encountered an error in which case it exits
-
spider_error
()¶ Sets the has_error flag on the first spider error and immediately updates the SVG with a “failing” status
-
Testing¶
-
city_scrapers_core.utils.
file_response
(file_name, mode='r', url=None)[source]¶ Create a Scrapy fake HTTP response from a HTML file. Based on https://stackoverflow.com/a/12741030
- Parameters
file_name (
str
) – The relative or absolute filename from the tests directoryurl (
Optional
[str
]) – The URL of the responsemode (
str
) – The mode the file should be opened with, defaults to “r”
- Return type
Union
[Response
,HtmlResponse
,TextResponse
]- Returns
A scrapy HTTP response which can be used for testing
Commands¶
City Scrapers has several custom Scrapy commands to streamline common tasks.
genspider¶
Syntax:
scrapy genspider <name> <agency> <start_url>
Example:
scrapy genspider chi_planning "Chicago Plan Commission" "https://chicago.gov/"
Scrapy’s genspider command is subclassed for this project to handle creating the boilerplate code.
The command accepts the Spider slug, the full agency name, and a URL that should be initially scraped. It will use this information to create a Spider, initial Pytest test file, and fixtures for the tests. If the site uses Legistar (based on the URL), it will use a separate template specific to Legistar sites that simplifies some commmon functionality.
The boilerplate files won’t work for all sites, and in particular they won’t cover cases where multiple pages need to be scraped, but they provide a starting point for some setup tasks that can cause confusion.
combinefeeds¶
Syntax:
scrapy combinefeeds
Combines output files written to a storage backend into latest.json
which contains
all meetings scraped, upcoming.json
which only includes meetings in the future, and
a file for each agency slug (i.e. chi_plan_commission.json
) at the top level of the
storage backend with the most recently scraped meetings for an agency.
validate¶
Syntax:
scrapy validate <name>
Example:
scrapy validate chi_plan_commission
This command is used to run the ValidationPipeline
and ensure that a scraper is
returning valid output. This is predominantly used for CI.