CSV on the Web with csvw
Overview
Exploring CSVW described data
>>> from csvw import TableGroup
>>> tg = TableGroup.from_url(
... 'https://raw.githubusercontent.com/cldf/csvw/master/tests/fixtures/csv.txt-metadata.json')
>>> len(tg.tables)
1
>>> len(tg.tables[0].tableSchema.columns)
2
>>> tg.tables[0].tableSchema.columns[0].datatype.base
'string'
>>> tg.tables[0].tableSchema.columns[0].name
'ID'
>>> list(tg.tables[0])[0]['ID']
'first'
Creating CSVW described data
>>> from csvw import Table
>>> t = Table(url='data.csv')
>>> t.tableSchema.columns.append(Column.fromvalue(dict(name='ID', datatype='integer')))
>>> t.write([dict(ID=1), dict(ID=2)])
2
>>> t.to_file('data.csv-metadata.json')
PosixPath('data.csv-metadata.json')
>>> list(Table.from_file('data.csv-metadata.json').iterdicts())
[OrderedDict([('ID', 1)]), OrderedDict([('ID', 2)])]
Where’s the “on the Web” part?
While it’s nice that we can fetch data descriptions from URLs with TableGroup
and Table
, CSVW
also includes a specification for locating metadata.
This is meant to support use cases where you find the data first - and might be missing out on the description without
such mechanisms. This part of the spec is implemented in csvw.CSVW
.
>>> from csvw import CSVW
>>> data = CSVW('https://raw.githubusercontent.com/cldf/csvw/master/tests/fixtures/csv.txt')
>>> data.t
TableGroup(...)
- class csvw.CSVW(url, md_url=None, validate=False)[source]
Python API to read CSVW described data and convert it to JSON.
- Parameters:
url (
str
) –md_url (
typing.Optional
[str
]) –validate (
bool
) –
- property is_valid: bool
Performs CSVW validation.
Note
For this to also catch problems during metadata location, the CSVW instance must be initialized with validate=True.
- static locate_metadata(url=None)[source]
Implements metadata discovery as specified in §5. Locating Metadata
- Return type:
typing.Tuple
[dict
,bool
]
- to_json(minimal=False)[source]
Implements algorithm described in https://w3c.github.io/csvw/csv2json/#standard-mode
Top-level descriptions
The csvw package provides a Python API to read and write CSVW decribed data. The main building blocks of this API are Python classes representing the top-level objects of a CSVW description.
- class csvw.Table(common_props=_Nothing.NOTHING, at_props=_Nothing.NOTHING, parent=None, aboutUrl=None, datatype=None, default='', lang='und', null=_Nothing.NOTHING, ordered=None, propertyUrl=None, required=None, separator=None, textDirection=None, valueUrl=None, dialect=None, notes=_Nothing.NOTHING, tableDirection='auto', tableSchema=None, transformations=_Nothing.NOTHING, url=None, fname=None, suppressOutput=False)[source]
A table description is an object that describes a table within a CSV file.
Table objects provide access to schema manipulation either by manipulating the tableSchema property directly or via higher-level methods like
add_foreign_key()
Table objects also mediate read/write access to the actual data through
- add_foreign_key(colref, ref_resource, ref_colref)[source]
Add a foreign key constraint to tableSchema.foreignKeys.
- Parameters:
colref – Column reference for the foreign key.
ref_resource – Referenced table.
ref_colref – Column reference of the key in the referenced table.
- iterdicts(log=None, with_metadata=False, fname=None, _Row=<class 'collections.OrderedDict'>, strict=True)[source]
Iterate over the rows of the table
Create an iterator that maps the information in each row to a dict whose keys are the column names of the table and whose values are the values in the corresponding table cells, or for virtual columns (which have no values) the valueUrl for that column. This includes columns not specified in the table specification.
Note: If the resolved data filename does not exist - but there is a zip file of the form fname+’.zip’, we try to read the data from this file after unzipping.
- Parameters:
log – Logger object (default None) The object that reports parsing errors. If none is given, parsing errors raise ValueError instead.
with_metadata (bool) – (default False) Also yield fname and lineno
fname – file-like, pathlib.Path, or str (default None) The file to be read. Defaults to inheriting from a parent object, if one exists.
strict – Flag signaling whether data is read strictly - i.e. raising ValueError when invalid data is encountered - or not - i.e. only issueing a warning and returning invalid data as str as provided by the undelying DSV reader.
- Return type:
typing.Generator
[dict
,None
,None
]- Returns:
A generator of dicts or triples (fname, lineno, dict) if with_metadata
- write(items, fname=<object object>, base=None, strict=False, _zipped=False)[source]
Write row items to a CSV file according to the table schema.
- Parameters:
items (
typing.Iterable
[typing.Union
[dict
,list
,tuple
]]) – Iterator of dict storing the data per row.fname (
typing.Union
[str
,pathlib.Path
,None
]) – Name of the file to which to write the data.base (
typing.Union
[str
,pathlib.Path
,None
]) – Base directory relative to which to interpret table urls.strict (
typing.Optional
[bool
]) – Flag signaling to use strict mode when writing. This will raise ValueError if any row (dict) passed in items contains unspecified fieldnames._zipped (
typing.Optional
[bool
]) – Flag signaling whether the resulting data file should be zipped.
- Return type:
typing.Union
[str
,int
]- Returns:
The CSV content if fname==None else the number of rows written.
- class csvw.TableGroup(common_props=_Nothing.NOTHING, at_props=_Nothing.NOTHING, parent=None, aboutUrl=None, datatype=None, default='', lang='und', null=_Nothing.NOTHING, ordered=None, propertyUrl=None, required=None, separator=None, textDirection=None, valueUrl=None, dialect=None, notes=_Nothing.NOTHING, tableDirection='auto', tableSchema=None, transformations=_Nothing.NOTHING, url=None, fname=None, tables=_Nothing.NOTHING)[source]
A table group description is an object that describes a group of tables.
A TableGroup delegates most of its responsibilities to the Table objects listed in its tables property. For convenience, TableGroup provides methods to - read data from all tables:
TableGroup.read()
- write data for all tables:TableGroup.write()
It also provides a method to check the referential integrity of data in related tables via
TableGroup.check_referential_integrity()
- check_referential_integrity(data=None, log=None, strict=False)[source]
Strict validation does not allow for nullable foreign key columns.
- copy(dest)[source]
Write a TableGroup’s data and metadata to files relative to dest, adapting the base attribute.
- Parameters:
dest (
typing.Union
[pathlib.Path
,str
]) –- Returns:
- write(fname, strict=False, _zipped=False, **items)[source]
Write a TableGroup’s data and metadata to files.
- Parameters:
fname (
typing.Union
[str
,pathlib.Path
]) – Filename for the metadata file.items (
typing.Iterable
[typing.Union
[dict
,list
,tuple
]]) – Keyword arguments are used to pass iterables of rows per table, where the table URL is specified as keyword.strict (
typing.Optional
[bool
]) –_zipped (
typing.Optional
[bool
]) –
Reading and writing top-level descriptions
Both types of objects are recognized as top-level descriptions, i.e. may be encountered a JSON objects on the web or on disk. Thus, they share some factory and serialization methods inherited from a base class:
- class csvw.metadata.TableLike(common_props=_Nothing.NOTHING, at_props=_Nothing.NOTHING, parent=None, aboutUrl=None, datatype=None, default='', lang='und', null=_Nothing.NOTHING, ordered=None, propertyUrl=None, required=None, separator=None, textDirection=None, valueUrl=None, dialect=None, notes=_Nothing.NOTHING, tableDirection='auto', tableSchema=None, transformations=_Nothing.NOTHING, url=None, fname=None)[source]
A CSVW description object as encountered “in the wild”, i.e. identified by URL on the web or as file on disk.
Since TableLike objects may be instantiated from “externally referenced” objects (via file paths or URLs), they have the necessary means to resolve link properties
>>> from csvw import Table, TableGroup, Link >>> t = Table.from_file('tests/fixtures/csv.txt-table-metadata.json') >>> Link('abc.txt').resolve(t.base) PosixPath('tests/fixtures/abc.txt') >>> tg = TableGroup.from_url( ... 'https://raw.githubusercontent.com/cldf/csvw/master/tests/fixtures/' ... 'csv.txt-metadata.json') >>> str(tg.tables[0].url) 'csv.txt' >>> tg.tables[0].url.resolve(tg.base) 'https://raw.githubusercontent.com/cldf/csvw/master/tests/fixtures/csv.txt'
and URI template properties (see
expand()
).- property base: str | Path
The “base” to resolve relative links against.
- expand(tmpl, row, _row, _name=None, qname=False, uri=False)[source]
Expand a URITemplate using row, _row and _name as context and resolving the result against TableLike.url.
>>> from csvw import URITemplate, TableGroup >>> tg = TableGroup.from_url( ... 'https://raw.githubusercontent.com/cldf/csvw/master/tests/fixtures/' ... 'csv.txt-metadata.json') >>> tg.expand(URITemplate('/path?{a}{#b}'), dict(a='1', b='2'), None) 'https://raw.githubusercontent.com/path?1#2'
- Parameters:
tmpl (
csvw.metadata.URITemplate
) –row (
dict
) –
- Return type:
str
- classmethod from_file(fname, data=None)[source]
Instantiate a CSVW Table or TableGroup description from a metadata file.
- Parameters:
fname (
typing.Union
[str
,pathlib.Path
]) –- Return type:
- classmethod from_url(url, data=None)[source]
Instantiate a CSVW Table or TableGroup description from a metadata file specified by URL.
- Parameters:
url (
str
) –- Return type:
- to_file(fname, omit_defaults=True)[source]
Write a CSVW Table or TableGroup description as JSON object to a local file.
- Parameters:
omit_defaults – The CSVW spec specifies defaults for most properties of most description objects. If omit_defaults==True, these properties will be pruned from the JSON object.
fname (
typing.Union
[str
,pathlib.Path
]) –
- Return type:
pathlib.Path
Table schema descriptions
A table’s schema is described using a hierarchy of description objects:
csvw.metadata.Schema
- accessed through Table.tableSchemacsvw.Column
- accessed through Table.tableSchema.columnscsvw.Datatype
- accessed through Column.datatype
- class csvw.metadata.Schema(common_props=_Nothing.NOTHING, at_props=_Nothing.NOTHING, parent=None, aboutUrl=None, datatype=None, default='', lang='und', null=_Nothing.NOTHING, ordered=None, propertyUrl=None, required=None, separator=None, textDirection=None, valueUrl=None, columns=_Nothing.NOTHING, foreignKeys=_Nothing.NOTHING, primaryKey=None, rowTitles=_Nothing.NOTHING)[source]
A schema description is an object that encodes the information about a schema, which describes the structure of a table.
- Variables:
columns – list of
Column
descriptions.foreignKeys – list of
ForeignKey
descriptions.
- class csvw.Column(common_props=_Nothing.NOTHING, at_props=_Nothing.NOTHING, parent=None, aboutUrl=None, datatype=None, default='', lang='und', null=_Nothing.NOTHING, ordered=None, propertyUrl=None, required=None, separator=None, textDirection=None, valueUrl=None, name=None, suppressOutput=False, titles=None, virtual=False, number=None)[source]
A column description is an object that describes a single column.
The description provides additional human-readable documentation for a column, as well as additional information that may be used to validate the cells within the column, create a user interface for data entry, or inform conversion into other formats.
- class csvw.Datatype(common_props=_Nothing.NOTHING, at_props=_Nothing.NOTHING, base=None, format=None, length=None, minLength=None, maxLength=None, minimum=None, maximum=None, minInclusive=None, maxInclusive=None, minExclusive=None, maxExclusive=None)[source]
A datatype description
Cells within tables may be annotated with a datatype which indicates the type of the values obtained by parsing the string value of the cell.
- classmethod fromvalue(v)[source]
- Parameters:
v (
typing.Union
[str
,dict
,csvw.metadata.Datatype
]) – Initialization data for cls; either a single string that is the main datatype of the values of the cell or a datatype description object, i.e. a dict or a cls instance.- Return type:
- Returns:
An instance of cls
For a list of datatypes supported by csvw, see csvw.datatypes