There are several roles you can play in Open Data

Diagram of Open Data Roles

This talk will focus on the API Developer role

Diagram of Open Data Roles with API Developer highlighted

Focus on government transparency is not new

Timeline of Open Government Ideas in Law
  • Idea that government should be open to public scrutiny is long established
  • Started in the US with the First Amendment (1791): Freedom of the Press
  • Focus in the US picked up with the Freedom of Information Act (1966)
  • Renewed (recent) focus with the Open Government Initiative (2009-present)
  • Open Government Initiative has specifically given focus to data
source: en.wikipedia.org/wiki/Open_government#History

So where's the data now?

Piles of Files by Malcolm Craig, on FlickrLink icon to view Piles of Files by Malcolm Craig on Flickr

Why?

  • Many data exist only in files on individuals' desktops or shared drives
  • Within an agency there are many data owners, each owning a portion of enterprise data
  • Government agencies need to determine what data can be made public
    • Legitimate concerns about citizen privacy and national security
    • Some information is pre-decisional or could have adverse impacts
    • Review processes are time-consuming and costly
  • Once review processes are complete, agencies release data as quickly as possible, often without any modification of format
  • Many researchers request bulk data formats, which agencies interpret as uploading flat files to their web servers

A Short Trip Back to 2007

A look at what we mean by "open" and how APIs help

8 Principles for Open Government Data (2007)

Drafted by Open Government Working Group, December 7-8, 2007.

Government data shall be considered open if it is made public in a way that complies with the principles below:

Complete All public data is made available. Public data is data that is not subject to valid privacy, security or privilege limitations.
Primary Data is as collected at the source, with the highest possible level of granularity, not in aggregate or modified forms.
Timely Data is made available as quickly as necessary to preserve the value of the data.
Accessible Data is available to the widest range of users for the widest range of purposes.
Machine processable Data is reasonably structured to allow automated processing.
Non-discriminatory Data is available to anyone, with no requirement of registration.
Non-proprietary Data is available in a format over which no entity has exclusive control.
License-free Data is not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed.
source: (1) public.resource.org/open_government_meeting.html (2) public.resource.org/8_principles.html

A well-designed API can easily help with 5 of 8

Drafted by Open Government Working Group, December 7-8, 2007.

Government data shall be considered open if it is made public in a way that complies with the principles below:

Complete All public data is made available. Public data is data that is not subject to valid privacy, security or privilege limitations.
Primary Data is as collected at the source, with the highest possible level of granularity, not in aggregate or modified forms.
Timely Data is made available as quickly as necessary to preserve the value of the data.
Accessible Data is available to the widest range of users for the widest range of purposes.
Machine processable Data is reasonably structured to allow automated processing.
Non-discriminatory Data is available to anyone, with no requirement of registration.
Non-proprietary Data is available in a format over which no entity has exclusive control.
License-free Data is not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed.
source: (1) public.resource.org/open_government_meeting.html (2) public.resource.org/8_principles.html

Open Data Principle: Primary

Data is collected at the source, with the highest possible level of granularity, not in aggregate or modified forms.

APIs can help agencies to deliver more granular data by:

  • Connecting directly to transactional data stores
  • Connecting indirectly to transactional data stores via ETL processes
  • Allowing partial retrieval of data customized to the user's needs
  • Allowing retrieval of aggregate data via on-the-fly computation

Open Data Principle: Timely

Data is made available as quickly as necessary to preserve the value of the data.

APIs can help agencies to deliver data more quickly by:

  • Automating the release of data through intelligent marking of database records
  • Pulling data from transactional data stores without human intervention
  • Reducing time required to aggregate data by providing on-the-fly computation in the API
  • Easing the burden of disclosure review processes by enabling automated testing via a private version of the same API

Open Data Principle: Accessible

Data is available to the widest range of users for the widest range of purposes.

APIs allow a broad set of users to determine the subset of data they need by:

  • Providing for granularity, subset retrieval, and on-the-fly aggregation
  • Providing several data formats
  • Allowing for simultaneous availability of bulk download and partial or streamed data
  • Providing the opportunity for multiple versions of the same API

Open Data Principle: Machine processable

Data is reasonably structured to allow automated processing.

APIs by definition are intended to enable automated processing:

  • APIs can return a variety of data formats based on parameters in the API call
  • Calls can be ReSTful over the web or use heavier web service models
  • Providing API wrappers for common languages is a simple task that can further enable users to do automated processing
  • Agencies can use the same APIs to build visualizations and explanations of the data for their websites and publications
  • Agencies can automate portions of their disclosure review processes by using internal-facing versions of the same APIs

Open Data Principle: Non-proprietary

Data is available in a format over which no entity has exclusive control.

Well-designed APIs provide data in a variety of formats:

  • APIs can provide data in well-used open standards-based data formats like JSON, XML, and OData
  • Providing more proprietary formats is also possible where necessary, but does not preclude the open standards-based formats
  • Data format documentation can be baked into the API as a structure or schema call
  • Creation of documentation can be automated so that the API can self-generate and deploy documentation to a website

The OMB Public Budget Database

About the dataset

Dataset: the OMB Public Budget Database

The Office of Management and Budget, each year, alongside the President's Budget, publishes the Public Budget Database, a dataset containing historical numbers and projections in summary from the Budget's published tables.

  • Consists of a User's Guide and three data files
    • Budget Authority
    • Outlays
    • Receipts
  • Data files are available as bulk downloads in CSV and Microsoft XLS formats
  • Data files are comparably small, and data is aggregated at the account level
  • Unique key for each record in each data file consists of the composition of 7 or more columns
  • Several limitations on use inherent in the data are discussed in the User's Guide
source: (1) www.whitehouse.gov/omb/budget/Supplemental (2) 2013 Public Budget Database User's Guide, OMB

Dataset: Quirks and Considerations

  • Data is generally consistent with what's in the Treasury Monthly Statement
    • Corrections to reporting and classification errors
    • Analytical Perspectives volume of Budget addresses inconsistencies
  • Historical records adjusted to match current budget structure
  • Current & budget year estimates prepared by agencies per OMB Circ. A-11
  • Data have sufficient detail to produce totals consistent with those published in the corresponding Budget document
    • Outlay totals by agency, subfunction, and BEA category
    • Receipt totals by source
    • The deficit (on-budget, off-budget, and unified budget basis)
source: 2013 Public Budget Database User's Guide, OMB

Dataset: Limitations

  • Data files do not include
    • Data by object classes
    • Data for program and financing accounts
    • Data for character class (other than grants to State/local governments)
    • Data for personnel summaries
    • Data for credit schedules
  • Account-level details do not include current services estimates (based on proposed Budget only)
  • Receipts data prior to 1982 are aggregates, not true account-level detail
  • Budget authority data prior to 1976 not available
  • Outlay data for 1962-1981 tend to be account-level details
  • Outlay data for Legislative Branch & DoD only bureau-level before 1982
source: 2013 Public Budget Database User's Guide, OMB

Building an Open Data API

Thoughts on what to do and how

Making the API Help with Openness (A Recap)

Primary Timely Accessible Machine processable Non-proprietary
Easing data release processes, clearance and development
Allowing partial retrieval of data
Allowing retrieval of aggregated data via on-the-fly computation
Providing delivery of data in open data formats
Allowing for delivery of data in custom data formats
Allowing management of multiple simultaneous versions
Allowing access to data via multiple API views
Pulling data from a database
Automating documentation of data and testing of data access
Enabling data use and visualization

Making the API Help with Openness (Examples)

Topics I'll cover

  • Spinning up an API in no time
  • Defining API endpoints
  • Using different data formats
  • Versioning the API
  • Creating different API components
  • Working with the data
  • Connecting to a database
  • Self-generating and API-testing documentation

Note: My API is a work in progress, so nothing here is final. These are just thoughts along the way.

 

My development stack

Grape Rack Ruby

Spinning up an API in no time

Ruby + Rack + Grape

Use github.com/dblock/grape-on-rack as a great starting point.

The API is defined by subclassing the Grape::API class.

module BudgetData
  class API < Grape::API
    ...
  end
end

Add the API to your Rack configuration file (config.ru).

run BudgetData::API

Defining API endpoints

Create resources which contain endpoints.

module BudgetData
  class API < Grape::API
    ...
    resource :outlays do
      desc "Return outlays for the requested year"
      params do
        requires :year, :type => String, :desc => "Fiscal year"
      end
      get 'fy/:year' do  # Responds to route: .../outlays/fy/1994
        Outlays.find(params[:year])
      end
      ...
    end
  end
end

Defining API endpoints: input validation

Define validations on the endpoint parameters.

resource :outlays do
  desc "Return outlays for the requested year"
  params do
    requires :year, :type => String, 
             # Regular expression below accepts years 1962-2017 and TQ
             :regexp => /^(196[2-9]|19[7-9][0-9]|200[0-9]|201[0-7]|TQ)$/, 
             :desc => "Fiscal year"
  end
  get 'fy/:year' do  # Responds to route: .../outlays/fy/1994
    Outlays.find(params[:year])
  end
  ...
end

Using different data formats

Grape, by default, responds to requests for text, JSON and XML. To request one from the API, simply add the file extension to the end of the request.

To limit the formats available, use the format method.

class BudgetData::API < Grape::API
  format :json # Now this API will only respond to requests with JSON format
  ...
end

But we don't want to limit the formats, we just want to specify a default!

class BudgetData::API < Grape::API
  default_format :json # Now, JSON format requires no extension on the path; 
                       # for other formats, specify an extension (.xml, etc)
  ...
end

Using different data formats: custom formatters

For proprietary or unsupported formats, use a custom formatter.

module XlsFormatter
  def self.call(object, env)
    object.to_xls # this method can be defined by 
                  # some other Gem or custom coded
  end
end

class BudgetData::API < Grape::API
  content_type :xls, "application/vnd.ms-excel"
  # The symbol specifies the extension this will respond to (:xls => .xls)
  formatter :xls, XlsFormatter
  ...
end

Versioning the API

We can version the API using the version method.

class BudgetData::APIv1 < Grape::API
  version 'v1', :using => :path # Could be :header or :param
end

You can mount multiple versions in a single API for simultaneous use.

class BudgetData::API < Grape::API
  mount BudgetData::APIv1 # Could also specify a custom prefix via => '/prefix'
  mount BudgetData::APIv2
end

Creating different API components

We can similarly create components of the API.

For example, we can create Outlays, BudgetAuthority, and Receipts.

class BudgetData::OutlaysAPIv1 < Grape::API
  ...
end

class BudgetData::BudgetAuthorityAPIv1 < Grape::API
  ...
end

...

class BudgetData::API < Grape::API
  mount BudgetData::OutlaysAPIv1 => '/outlays/v1'
  mount BudgetData::OutlaysAPIv2 => '/outlays/v2'
  mount BudgetData::BudgetAuthorityAPIv1 => '/budauth/v1'
  mount BudgetData::BudgetAuthorityAPIv2 => '/budauth/v2'
  mount BudgetData::ReceiptsAPIv1 => '/receipts/v1'
  mount BudgetData::ReceiptsAPIv2 => '/receipts/v2'
end

Working with the data

Drake (future) + Ruby + CSV + a little SQLite3

To get the data ready, we need to preprocess it. Drake is a new tool for data workflow management.

First, create a workflow file called workflow.d. When ready, call drake in the shell to run.

BASE=/data/budget-db-2013

; Pre-process each CSV to pull metadata into dimension tables and store in DB
budauth.sqlite <- budauth.csv [ruby-script file:./scripts/budauth_fact_dims.rb]
outlays.sqlite <- outlays.csv [ruby-script file:./scripts/outlays_fact_dims.rb]
receipts.sqlite <- receipts.csv [ruby-script file:./scripts/receipts_fact_dims.rb]

; Combine dimension tables
budget_data.sqlite <- budauth.sqlite, outlays.sqlite, receipts.sqlite
  ; run SQL script as shell command, use external executable, etc.

; Check data integrity against CSV files
budget_data_check.log <- budget_data.sqlite, budauth.csv, outlays.csv, receipts.csv
  # Python code goes here...
...

Working with the data: an example script

Drake (future) + Ruby + CSV + a little SQLite3

Then, for each step, define the script. First is budauth_fact_dims.rb (in my case, in Ruby).

require 'csv'
require 'sqlite3'

budauth_csv = CSV.read ARGV[0]
get_db_name = ->(arg) {parts = arg.split('.'); parts.pop; parts.join('.')}
budauth_db = SQLite3::Database.new get_db_name.call ARGV[1]

# Create tables and insert data
budauth_csv.column_headers = budauth_csv.delete_at 0
budauth_db.transaction do |db|
  db.execute "create table budauth_facts ( ... );"
  ... 
end
budauth_csv.foreach do |row| 
  budauth_db.transaction do |db| ...
    # Pull out metadata, save metadata to dimension tables...
    db.execute "insert into budauth_facts ( ?, ... )", *row
  end
end

Connecting to a database: the roadmap

ActiveRecord + Grape::Entity

First, a roadmap of what we're trying to do:

  • Models connect to the database tables via ActiveRecord.
    Model classes subclass ActiveRecord::Base.
  • Entities provide views that expose model fields to the API.
    Entity classes subclass Grape::Entity.
  • Resources are subsets of the API that can be mounted (discussed earlier).
    Resource classes subclass Grape::API.
- api
  * api.rb (main API definition)
  - entities
    * budget_authority_entity.rb (entity definition)
  - models
    * budget_authority.rb (model definition)
  - resources
    * budget_authority_api.rb (resource definition)

Connecting to a database: model classes

ActiveRecord + Standalone Migrations

So, a model might look like:

class BudgetData::BudgetAuthority < ActiveRecord::Base
  set_table_name 'budauth_facts'
  attr_readonly :agency, :bureau, :bea_category, :budget_year, ...
  has_one :agency
  has_one :bureau
  has_one :bea_category
  ...
  def budget_authority_by_year(year)
    BudgetAuthority.find(year)
  end
end

Bonus: the standalone_migrations gem provides the migrations framework used in Rails as a standalone capability.

Connecting to a database: entity classes

ActiveRecord + Grape::Entity

You can define a new Entity class by:

  • subclassing Grape::Entity to create a standalone Entity class
  • adding include Grape::Entity::DSL to a class
  • adding a nested/namespaced entity class that subclasses Grape::Entity
  • class BudgetData::BudgetAuthorityEntity < Grape::Entity # standalone entity
      expose :agency, :bureau, :bea_category, :budget_year...
      ...
      def 
    end
    
    class BudgetData::BudgetAuthority < ActiveRecord::Base # nested/namespaced
      ...
      def entity
        BudgetData::BudgetAuthority.new(self)
      end
      class Entity < Grape::Entity
        expose :agency, :bureau, :bea_category, :budget_year, ...
      end
    end
    

Connecting to a database: using entities

ActiveRecord + Grape::Entity

You can use entity classes in your API via the present method:

class BudgetData::BudgetAuthorityAPI < Grape::API
  version 'v1', :using => :path
  resource :budget_authority do
    desc "Return budget authority for the requested year"
    params do
      requires :year, :type => String, 
               :regexp => /^(196[2-9]|19[7-9][0-9]|200[0-9]|201[0-7]|TQ)$/, 
               :desc => "Fiscal year"
    end
    get 'fy/:year' do  # Responds to route: .../budget_authority/fy/1994
      budauth = BudgetData::BudgetAuthority.budget_authority_by_year(params[:year])
      present budauth, with: BudgetData::BudgetAuthorityEntity
    end
    ...
  end
end

Self-generating documentation

Swagger specification + Grape-Swagger

The grape-swagger gem provides an auto-documenting mechanism for Grape APIs that registers /swagger-doc.json as an API mount point for the documentation.

class BudgetData::API < Grape::API
  mount BudgetData::OutlaysAPIv1 => '/outlays/v1'
  mount BudgetData::OutlaysAPIv2 => '/outlays/v2'
  mount BudgetData::BudgetAuthorityAPIv1 => '/budauth/v1'
  mount BudgetData::BudgetAuthorityAPIv2 => '/budauth/v2'
  mount BudgetData::ReceiptsAPIv1 => '/receipts/v1'
  mount BudgetData::ReceiptsAPIv2 => '/receipts/v2'
  add_swagger_documentation
end

To explore the docs, set up an instance of Swagger UI or use the online demo and enter your localhost URL for the documentation route.

Swagger and Grape-Swagger each have many options. For more on these, see swagger-core and grape-swagger projects, respectively, on GitHub.

API-testing documentation

Apiary.io Blueprints

Apiary provides a service and custom DSL for describing an API targeted at collaboration, merging and version management on your API.

The API's Blueprint allows Apiary to generate docs, a debugging proxy and bug reports for your API.

HOST: http://my.budgetdata.host.com/

--- Public Budget Database API v1 ---
---
Generic information about the API v1 goes here.
---

--- Budget Authority ---
GET /budauth/v1/fy/2003
> Accept: application/json
< 200
< Content-Type: application/json
{ ... expected JSON goes here ... }

...

But I don't code in Ruby...

Options exist for other languages!

So now you've built an API... let others find it!

By allowing the community to build API libraries it gives you a chance to put the spotlight on developers, increases the chance for competition, spurs innovation and developers will rise to the occasion.

So, to the developers out there:

Don't wait around for someone else to build an API for you!

Build it yourself!

Ryan Harvey's avatar

<Thank You!>