Contributing to the Data Science Ontology

From this short guide, you will learn how to contribute new concepts and annotations to the Data Science Ontology. We assume you already understand the basic ideas behind the ontology, as explained in the introductory guide. Here we explain the contribution process and the data format for concepts and annotations.

How to contribute

The Data Science Ontology source is hosted on GitHub . To submit a new concept or annotation, or improve an existing one, you should:

Write the concept or annotation, in the data format described below.
Run the validation script, to ensure it conforms to the schema.
Open a pull request on GitHub.

We welcome contributions of all kinds and we will make every effort to give each pull request a fair and timely review.

Data format

Concepts and annotations are expressed in YAML , a markup language designed to be easy to read and write by humans. To simplify machine processing, they are also converted automatically into JSON . The documents can then be straightforwardly loaded into a database or processed by other tools, such as Catlab .

Concepts

YAML defines a simple syntax for expressing key-value pairs, reminiscent of the OBO file format popular among biomedical ontologists. For example, the concept read a data table is expressed in YAML as:

schema: concept
id: read-table
name: read table
description: read tabular data from a data source
kind: function
is-a: read-data
inputs:
  - type: tabular-data-source
outputs:
  - type: table
    name: data

The correspondence between the web page and the YAML content should be clear enough.

Annotations

Annotations are also expressed in YAML, with a twist provided by the definition field. Recall that an annotation defines a chunk of code by an expression written in the ontology language and built out of the ontology's concepts. The expression trees depicted in the introductory guide are represented as S-expressions in JSON or YAML. For instance, the product of function compositions

product
1. compose
2. f
3. g
1. compose
2. h
3. k

is represented as the S-expression

[ product,
  [ compose, f, g],
  [ compose, h, k] ]

As a complete example, here is the YAML source for the annotation read data frame from SQL table:

schema: annotation
language: python
package: pandas
id: read-sql-table
name: read data frame from SQL table
description: read pandas data frame from table in SQL databsase
function: pandas.io.sql.read_sql_table
kind: function
definition: [
  compose,
  [ construct, [ pair, sql-table-database, sql-table-name ] ],
  read-table
]
inputs:
  - slot: 1
    name: database
  - slot: 0
    name: table-name
    description: name of SQL table
outputs:
  - slot: __return__

Other resources

We hope this brief introduction to the data format is helpful, but it's not an exhaustive reference. The definitive definitions of the data format are the JSON schemas for concepts and annotations . Perhaps the easiest way to get started is to look at existing examples of concepts and annotations and adapt them to your purposes. If you get stuck, feel free to ask us questions by opening a GitHub issue .