Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion format/Message.fbs
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,13 @@ table Int {
is_signed: bool;
}

enum Precision:short {SINGLE, DOUBLE}
enum Precision:short {HALF, SINGLE, DOUBLE}

table FloatingPoint {
precision: Precision;
}

/// Unicode with UTF-8 encoding
table Utf8 {
}

Expand Down
258 changes: 258 additions & 0 deletions format/Metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
# Metadata: Logical types, schemas, data headers

This is documentation for the Arrow metadata specification, which enables
systems to communicate the

* Logical array types (which are implemented using the physical memory layouts
specified in [Layout.md][1])

* Schemas for table-like collections of Arrow data structures

* "Data headers" indicating the physical locations of memory buffers sufficient
to reconstruct a Arrow data structures without copying memory.

## Canonical implementation

We are using [Flatbuffers][2] for low-overhead reading and writing of the Arrow
metadata. See [Message.fbs][3].

## Schemas

The `Schema` type describes a table-like structure consisting of any number of
Arrow arrays, each of which can be interpreted as a column in the table. A
schema by itself does not describe the physical structure of any particular set
of data.

A schema consists of a sequence of **fields**, which are metadata describing
the columns. The Flatbuffers IDL for a field is:

```
table Field {
// Name is not required, in i.e. a List
name: string;
nullable: bool;
type: Type;
children: [Field];
}
```

The `type` is the logical type of the field. Nested types, such as List,
Struct, and Union, have a sequence of child fields.

## Record data headers

A record batch is a collection of top-level named, equal length Arrow arrays
(or vectors). If one of the arrays contains nested data, its child arrays are
not required to be the same length as the top-level arrays.

One can be thought of as a realization of a particular schema. The metadata
describing a particular record batch is called a "data header". Here is the
Flatbuffers IDL for a record batch data header

```
table RecordBatch {
length: int;
nodes: [FieldNode];
buffers: [Buffer];
}
```

The `nodes` and `buffers` fields are produced by a depth-first traversal /
flattening of a schema (possibly containing nested types) for a given in-memory
data set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important to flatten the representation of nodes and buffers here?
How about nesting the buffers inside FieldNode?
This would make the metadata more explicit and less error prone.
Right now the type in the fieldnode implies how many Buffers are found in the corresponding buffers list and what each one is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed and resolved in #124


### Buffers

A buffer is metadata describing a contiguous memory region relative to some
virtual address space. This may include:

* Shared memory, e.g. a memory-mapped file
* An RPC message received in-memory
* Data in a file

The key form of the Buffer type is:

```
struct Buffer {
offset: long;
length: long;
}
```

In the context of a record batch, each field has some number of buffers
associated with it, which are derived from their physical memory layout.

Each logical type (separate from its children, if it is a nested type) has a
deterministic number of buffers associated with it. These will be specified in
the logical types section.

### Field metadata

The `FieldNode` values contain metadata about each level in a nested type
hierarchy.

```
struct FieldNode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct can not be extended with new fields in the future.
Is it what we want? That means we can not add optional metadata in it later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed and resolved in #124

/// The number of value slots in the Arrow array at this level of a nested
/// tree
length: int;

/// The number of observed nulls.
null_count: int;
}
```

## Flattening of nested data

Nested types are flattened in the record batch in depth-first order. When
visiting each field in the nested type tree, the metadata is appended to the
top-level `fields` array and the buffers associated with that field (but not
its children) are appended to the `buffers` array.

For example, let's consider the schema

```
col1: Struct<a: Int32, b: List<Int64>, c: Float64>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is Struct what we call Tuple in Message.fbs ?

Copy link
Member Author

@wesm wesm Aug 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Struct is a reserved word in some of the flatbuffers code generators (need to add a comment about this)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be consistent in the code.
Either we call it Struct_ (or something similar) in the fbs file or Tuple everywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have an opinion here. I am OK with either Struct or Tuple. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like Struct is pretty consistently used for this concept in databases at least (in some databases, like BigQuery, the struct fields need not be named), so changing the metadata to Struct_ seems reasonable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created ARROW-278 so this is not blocked

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

col2: Utf8
```

The flattened version of this is:

```
FieldNode 0: Struct name='col1'
FieldNode 1: Int32 name=a'
FieldNode 2: List name='b'
FieldNode 3: Int64 name='item' # arbitrary
FieldNode 4: Float64 name='c'
FieldNode 5: Utf8 name='col2'
```

For the buffers produced, we would have the following (as described in more
detail for each type below):

```
buffer 0: field 0 validity bitmap

buffer 1: field 1 validity bitmap
buffer 2: field 1 values <int32_t*>

buffer 3: field 2 validity bitmap
buffer 4: field 2 list offsets <int32_t*>

buffer 5: field 3 validity bitmap
buffer 6: field 3 values <int64_t*>

buffer 7: field 4 validity bitmap
buffer 8: field 4 values <double*>

buffer 9: field 5 validity bitmap
buffer 10: field 5 offsets <int32_t*>
buffer 11: field 5 data <uint8_t*>
```

## Logical types

A logical type consists of a type name and metadata along with an explicit
mapping to a physical memory representation. These may fall into some different
categories:

* Types represented as fixed-width primitive arrays (for example: C-style
integers and floating point numbers)
* Types having equivalent memory layout to a physical nested type (e.g. strings
use the list representation, but logically are not nested types)

### Integers

In the first version of Arrow we provide the standard 8-bit through 64-bit size
standard C integer types, both signed and unsigned:

* Signed types: Int8, Int16, Int32, Int64
* Unsigned types: UInt8, UInt16, UInt32, UInt64

The IDL looks like:

```
table Int {
bitWidth: int;
is_signed: bool;
}
```

The integer endianness is currently set globally at the schema level. If a
schema is set to be little-endian, then all integer types occurring within must
be little-endian. Integers that are part of other data representations, such as
list offsets and union types, must have the same endianness as the entire
record batch.

### Floating point numbers

We provide 3 types of floating point numbers as fixed bit-width primitive array

- Half precision, 16-bit width
- Single precision, 32-bit width
- Double precision, 64-bit width

The IDL looks like:

```
enum Precision:int {HALF, SINGLE, DOUBLE}

table FloatingPoint {
precision: Precision;
}
```

### Boolean

The Boolean logical type is represented as a 1-bit wide primitive physical
type. The bits are numbered using least-significant bit (LSB) ordering.

Like other fixed bit-width primitive types, boolean data appears as 2 buffers
in the data header (one bitmap for the validity vector and one for the values).

### List

The `List` logical type is the logical (and identically-named) counterpart to
the List physical type.

In data header form, the list field node contains 2 buffers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little bit confusing because lists also have child buffers. We probably also want to state that you can't have a list of no type (e.g. a list that doesn't have a child). Actually, we may want to clarify whether that is allowed in the union and struct cases

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steven's PR (#99) is adding a Null type. would that mean we allow a list of nothing? (all values null)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding a section for nested data to make the handling of child arrays more explicit


* Validity bitmap
* List offsets

The buffers associated with a list's child field are handled recursively
according to the child logical type (e.g. `List<Utf8>` vs. `List<Boolean>`).

### Utf8 and Binary

We specify two logical types for variable length bytes:

* `Utf8` data is unicode values with UTF-8 encoding
* `Binary` is any other variable length bytes

These types both have the same memory layout as the nested type `List<UInt8>`,
with the constraint that the inner bytes can contain no null values. From a
logical type perspective they are primitive, not nested types.

In data header form, while `List<UInt8>` would appear as 2 field nodes (`List`
and `UInt8`) and 4 buffers (2 for each of the nodes, as per above), these types
have a simplified representation single field node (of `Utf8` or `Binary`
logical type, which have no children) and 3 buffers:

* Validity bitmap
* List offsets
* Byte data

### Decimal

TBD

### Timestamp

TBD

## Dictionary encoding

[1]: https://github.com/apache/arrow/blob/master/format/Layout.md
[2]: http://github.com/google/flatbuffers
[3]: https://github.com/apache/arrow/blob/master/format/Message.fbs
1 change: 1 addition & 0 deletions format/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

Currently, the Arrow specification consists of these pieces:

- Metadata specification (see Metadata.md)
- Physical memory layout specification (see Layout.md)
- Metadata serialized representation (see Message.fbs)

Expand Down