csvw.datatypes

We model the hierarchy of basic datatypes using a class hierarchy.

Derived datatypes are implemented via csvw.Datatype which is composed of a basic datatype and additional behaviour.

class csvw.datatypes.anyAtomicType[source]

A basic datatype consists of

a bag of attributes, most importantly a name which matches the name or alias of one of the CSVW built-in datatypes
three staticmethods controlling marshalling and unmarshalling of Python objects to strings.

Theses methods are orchestrated from csvw.Datatype in its read and formatted methods.

class csvw.datatypes.string[source]: Maps to str.

The lexical and value spaces of xs:string are the set of all possible strings composed of any character allowed in a XML 1.0 document without any treatment done on whitespaces.

class csvw.datatypes.anyURI[source]: Maps to rfc3986.URIReference.

This datatype corresponds normatively to the XLink href attribute. Its value space includes the URIs defined by the RFCs 2396 and 2732, but its lexical space doesn’t require the character escapes needed to include non-ASCII characters in URIs.

Note

We normalize URLs according to the rules in RFC 3986 when serializing to str. Thus roundtripping isn’t guaranteed.

class csvw.datatypes.NMTOKEN[source]: Maps to str

The lexical and value spaces of xs:NMTOKEN are the set of XML 1.0 “name tokens,” i.e., tokens composed of characters, digits, “.”, “:”, “-”, and the characters defined by Unicode, such as “combining” or “extender”.

This type is usually called a “token.”

Valid values include “Snoopy”, “CMS”, “1950-10-04”, or “0836217462”.

Invalid values include “brought classical music to the Peanuts strip” (spaces are forbidden) or “bold,brash” (commas are forbidden).

class csvw.datatypes.base64Binary[source]: Maps to bytes

class csvw.datatypes.hexBinary[source]: Maps to bytes.

Note

We normalize to uppercase hex digits when seriializing to str. Thus, roundtripping is limited.

class csvw.datatypes.boolean[source]

Maps to bool.

>>> from csvw import Datatype
>>> dt = Datatype.fromvalue({"base": "boolean", "format": "Yea|Nay"})
>>> dt.read('Nay')
False
>>> dt.formatted(True)
'Yea'

class csvw.datatypes.dateTime[source]: Maps to datetime.datetime.

class csvw.datatypes.date[source]: Maps to datetime.datetime (in order to be able to preserve timezone information).

class csvw.datatypes.dateTimeStamp[source]: Maps to datetime.datetime.

class csvw.datatypes._time[source]: Maps to datetime.datetime (in order to be able to preserve timezone information).

class csvw.datatypes.duration[source]: Maps to datetime.timedelta.

class csvw.datatypes.dayTimeDuration[source]: Maps to datetime.timedelta.

class csvw.datatypes.yearMonthDuration[source]: Maps to datetime.timedelta.

class csvw.datatypes.decimal[source]

Maps to decimal.Decimal.

xs:decimal is the datatype that represents the set of all the decimal numbers with arbitrary lengths. Its lexical space allows any number of insignificant leading and trailing zeros (after the decimal point).

There is no support for scientific notations.

Valid values include: “123.456”, “+1234.456”, “-1234.456”, “-.456”, or “-456”.

The following values would be invalid: […] “1234.456E+2” (scientific notation (“E+2”) is forbidden).

XML-Schema restricts the lexical space by disallowing “thousand separator” and forcing the decimal separator to be “.”. But these limitations can be overcome within CSVW using a derived datatype:

>>> from csvw import Datatype
>>> dt = Datatype.fromvalue(
...     {"base": "decimal", "format": {"groupChar": ".", "decimalChar": ","}})
>>> dt.read("1.234,5")
Decimal('1234.5')

Note

While mapping to decimal.Decimal rather than float makes handling of the Python object somewhat cumbersome, it makes sure we can roundtrip values correctly.

class csvw.datatypes.integer[source]: Maps to int.

class csvw.datatypes.unsignedInt[source]: Maps to int.

The value space of xs:unsignedInt is the integers between 0 and 4294967295, i.e., the unsigned values that can fit in a word of 32 bits. Its lexical space allows an optional “+” sign and leading zeros before the significant digits.

The decimal point (even when followed only by insignificant zeros) is forbidden.

Valid values include “4294967295”, “0”, “+0000000000000000000005”, or “1”.

Invalid values include “-1” and “1.”.

class csvw.datatypes.unsignedShort[source]: Maps to int.

The value space of xs:unsignedShort is the integers between 0 and 65535, i.e., the unsigned values that can fit in a word of 16 bits. Its lexical space allows an optional “+” sign and leading zeros before the significant digits.

The decimal point (even when followed only by insignificant zeros) is forbidden.

Valid values include “65535”, “0”, “+0000000000000000000005”, or “1”.

Invalid values include “-1” and “1.” .

class csvw.datatypes.unsignedLong[source]: Maps to int.

The value space of xs:unsignedLong is the integers between 0 and 18446744073709551615, i.e., the unsigned values that can fit in a word of 64 bits. Its lexical space allows an optional “+” sign and leading zeros before the significant digits.

The decimal point (even when followed only by insignificant zeros) is forbidden.

Valid values include “18446744073709551615”, “0”, “+0000000000000000000005”, or “1”.

Invalid values include “-1” and “1.”.

class csvw.datatypes.unsignedByte[source]: Maps to int.

The value space of xs:unsignedByte is the integers between 0 and 255, i.e., the unsigned values that can fit in a word of 8 bits. Its lexical space allows an optional “+” sign and leading zeros before the significant digits.

The lexical space does not allow values expressed in other numeration bases (such as hexadecimal, octal, or binary).

The decimal point (even when followed only by insignificant zeros) is forbidden.

Valid values include “255”, “0”, “+0000000000000000000005”, or “1”.

Invalid values include “-1” and “1.”.

class csvw.datatypes.short[source]: Maps to int.

The value space of xs:short is the set of common short integers (16 bits), i.e., the integers between -32768 and 32767; its lexical space allows any number of insignificant leading zeros.

The decimal point (even when followed only by insignificant zeros) is forbidden.

Valid values include “-32768”, “0”, “-0000000000000000000005”, or “32767”.

Invalid values include “32768” and “1.”.

class csvw.datatypes.long[source]: Maps to int.

The value space of xs:long is the set of common double-size integers (64 bits), i.e., the integers between -9223372036854775808 and 9223372036854775807; its lexical space allows any number of insignificant leading zeros.

The decimal point (even when followed only by insignificant zeros) is forbidden.

Valid values for xs:long include “-9223372036854775808”, “0”, “-0000000000000000000005”, or “9223372036854775807”.

Invalid values include “9223372036854775808” and “1.”.

class csvw.datatypes.byte[source]: Maps to int.

The value space of xs:byte is the integers between -128 and 127, i.e., the signed values that can fit in a word of 8 bits. Its lexical space allows an optional sign and leading zeros before the significant digits.

The lexical space does not allow values expressed in other numeration bases (such as hexadecimal, octal, or binary).

Valid values for byte include 27, -34, +105, and 0.

Invalid values include 0A, 1524, and INF.

class csvw.datatypes.nonNegativeInteger[source]: Maps to int.

class csvw.datatypes.positiveInteger[source]: Maps to int.

class csvw.datatypes.nonPositiveInteger[source]: Maps to int.

class csvw.datatypes.negativeInteger[source]: Maps to int.

class csvw.datatypes._float[source]: Maps to float.

Note

Due to the well known issues with representing floating point numbers, roundtripping may not work correctly.

See also

https://docs.python.org/3/tutorial/floatingpoint.html

class csvw.datatypes.number[source]: Maps to float.

class csvw.datatypes.double[source]: Maps to float.

class csvw.datatypes.normalizedString[source]: Maps to str.

The lexical space of xs:normalizedString is unconstrained (any valid XML character may be used), and its value space is the set of strings after whitespace replacement (i.e., after any occurrence of #x9 (tab), #xA (linefeed), and #xD (carriage return) have been replaced by an occurrence of #x20 (space) without any whitespace collapsing).

Note

The CSVW test suite (specifically in test036 and test037) requires that normalizedString is also trimmed, i.e. stripped from leading and trailing whitespace. So that’s we do.

class csvw.datatypes.QName[source]: Maps to str.

class csvw.datatypes.gDay[source]: Maps to str.

class csvw.datatypes.gMonth[source]: Maps to str.

class csvw.datatypes.gMonthDay[source]: Maps to str.

class csvw.datatypes.gYear[source]: Maps to str.

class csvw.datatypes.gYearMonth[source]: Maps to str.

class csvw.datatypes.xml[source]: Maps to str.

class csvw.datatypes.html[source]: Maps to str.

class csvw.datatypes.json[source]

Maps to str, list or dict, i.e. to the result of json.loads.

>>> from csvw import Datatype
>>> dt = Datatype.fromvalue("json")
>>> d = dt.read("{}")
>>> d["a"] = '123'
>>> dt.formatted(d)
'{"a": "123"}'

Additional constraints on JSON data can be imposed by specifying JSON Schema documents as format annotation:

>>> from csvw import Datatype
>>> dt = Datatype.fromvalue({"base": "json", "format": '{"type": "object"}'})
>>> dt.read('{}')
OrderedDict()
>>> dt.read('4')
...
jsonschema.exceptions.ValidationError: 4 is not of type 'object'
...
ValueError: invalid lexical value for json: 4

Note

To ensure proper roundtripping, we load the JSON strings using the object_pairs_hook=collections.OrderedDict keyword.