2
2
3
3
HdkOnNative execution
4
4
=====================
5
- .. raw :: html
6
-
7
- <style >.red {color :red ; font-weight :bold ;} </style >
8
-
9
- .. role :: red
10
-
11
- :red: `Note\: After migration to HDK, this documentation is temporarily
12
- out-of-date and will be fixed in the next release! `
13
5
14
6
HDK is a low-level execution library for data analytics processing.
15
7
HDK is used as a fast execution backend in Modin. The HDK library provides
@@ -20,8 +12,8 @@ OmniSciDB is an open-source SQL-based relational database designed for the
20
12
massive parallelism of modern CPU and GPU hardware. Its execution engine
21
13
is built on LLVM JIT compiler.
22
14
23
- OmniSciDB can be embedded into an application as a dynamic library that
24
- provides both C++ and Python APIs . A specialized in-memory storage layer
15
+ HDK can be embedded into an application as a python module - `` pyhdk ``. This module
16
+ provides Python APIs to the HDK library . A specialized in-memory storage layer
25
17
provides an efficient way to import data in Arrow table format.
26
18
27
19
`HdkOnNative ` execution uses HDK for both as a storage format and for
@@ -58,7 +50,7 @@ A partition holds data in either ``pandas.DataFrame`` or ``pyarrow.Table``
58
50
format. ``pandas.DataFrame `` is preferred only when we detect unsupported
59
51
data type and therefore have to use ``pandas `` framework for processing.
60
52
In other cases ``pyarrow.Table `` format is preferred. Arrow tables can be
61
- zero-copy imported into OmniSciDB . A query execution result is also
53
+ zero-copy imported into HDK . A query execution result is also
62
54
returned as an Arrow table.
63
55
64
56
Data Ingress
@@ -90,16 +82,16 @@ wrapped into a high-level Modin DataFrame, which is returned to the user.
90
82
.. figure :: /img/hdk/hdk_ingress.svg
91
83
:align: center
92
84
93
- Note that during this ingress flow, no data is actually imported to the OmniSciDB . The need for
94
- importing to OmniSci is decided later at the execution stage by the Modin Core Dataframe layer.
95
- If the query requires for the data to be placed in OmniSciDB , the import is triggered.
85
+ Note that during this ingress flow, no data is actually imported to HDK . The need for
86
+ importing to HDK is decided later at the execution stage by the Modin Core Dataframe layer.
87
+ If the query requires for the data to be placed in HDK , the import is triggered.
96
88
:py:class: `~modin.experimental.core.execution.native.implementations.hdk_on_native.dataframe.dataframe.HdkOnNativeDataframe `
97
89
passes partition to import to the
98
90
:py:class: `~modin.experimental.core.execution.native.implementations.hdk_on_native.partitioning.partition_manager.HdkOnNativeDataframePartitionManager `
99
- that extracts a partition's underlying object and sends a request to import it to the OmniSci
100
- Server. The response for the request is a unique identifier for the just imported table
101
- at OmniSciDB , this identifier is placed in the partition. After that, the partition has
102
- a reference to the concrete table in OmniSciDB to query, and the data is considered to be
91
+ that extracts a partition's underlying object and sends a request to import it to HDK.
92
+ The response for the request is a unique identifier for the just imported table
93
+ at HDK , this identifier is placed in the partition. After that, the partition has
94
+ a reference to the concrete table in HDK to query, and the data is considered to be
103
95
fully imported.
104
96
105
97
.. figure :: /img/hdk/hdk_import.svg
@@ -133,7 +125,7 @@ lazy computation tree or executed immediately.
133
125
Lazy execution
134
126
""""""""""""""
135
127
136
- OmniSciDB has a powerful query optimizer and an execution engine that
128
+ HDK has a powerful query optimizer and an execution engine that
137
129
combines multiple operations into a single execution module. E.g. join,
138
130
filter and aggregation can be executed in a single data scan.
139
131
@@ -142,7 +134,7 @@ overheads, all of the operations that don't require data materialization
142
134
are performed lazily.
143
135
144
136
Lazy operations on a frame build a tree which is later translated into
145
- a query executed by OmniSci . Each of the tree nodes has its input node(s)
137
+ a query executed by HDK . Each of the tree nodes has its input node(s)
146
138
- a frame argument(s) of the operation. When a new node is appended to the
147
139
tree, it becomes its root. The leaves of the tree are always a special node
148
140
type, whose input is an actual materialized frame to execute operations
@@ -174,30 +166,29 @@ Execution of a computation tree
174
166
175
167
Frames are materialized (executed) when their data is accessed. E.g. it
176
168
happens when we try to access the frame's index or shape. There are two ways
177
- to execute required operations: through Arrow or through OmniSciDB .
169
+ to execute required operations: through Arrow or through HDK .
178
170
179
171
Arrow execution
180
172
'''''''''''''''
181
173
182
174
For simple operations which don't include actual computations, execution can use
183
175
Arrow API. We can use it to rename columns, drop columns and concatenate
184
176
frames. Arrow execution is preferable since it doesn't require actual data import/export
185
- to the OmniSciDB .
177
+ from/ to HDK .
186
178
187
- OmniSciDB execution
188
- '''''''''''''''''''
179
+ HDK execution
180
+ '''''''''''''
189
181
190
- To execute query in OmniSciDB engine we need to import data first. We should
182
+ To execute a query in the HDK engine we need to import data first. We should
191
183
find all leaves of an operation tree and import their Arrow tables. Partitions
192
184
with imported tables hold corresponding table names used to refer to them in
193
185
queries.
194
186
195
- OmniSciDB is SQL-based. SQL query parsing is done in a separate process using
196
- the Apache Calcite framework. A parsed query is serialized into JSON format
197
- and is transferred back to OmniSciDB. In Modin, we don't generate SQL queries
198
- for OmniSciDB but use this JSON format instead. Such queries can be directly
199
- executed by OmniSciDB and also they can be transferred to Calcite server for
200
- optimizations.
187
+ HDK executes queries expressed in HDK-specific intermediate representation (IR) format.
188
+ It also provides components to translate SQL queries to relational algebra JSON format
189
+ which can be later optimized and translated to HDK IR. Modin generates queries in relational
190
+ algebra JSON format. These queries are optionally optimized with Apache Calcite
191
+ based optimizer provided by HDK (:py:class: `~pyhdk.sql.Calcite `) and then executed.
201
192
202
193
Operations used by Calcite in its intermediate representation are implemented
203
194
in classes derived from
@@ -235,7 +226,7 @@ Rowid column and sub-queries
235
226
A special case of an index is the default index - 0-based numeric sequence.
236
227
In our representation, such an index is represented by the absence of index columns.
237
228
If we need to access the index value we can use the virtual ``rowid `` column provided
238
- by OmniSciDB . Unfortunately, this special column is available for physical
229
+ by HDK . Unfortunately, this special column is available for physical
239
230
tables only. That means we cannot access it for a node that is not a tree leaf.
240
231
That makes us execute trees with such nodes in several steps. First, we
241
232
materialize all frames that require ``rowid `` column and only after that we can
@@ -257,7 +248,7 @@ by ``DFAlgNode`` based trees. Scalar computations are described by ``BaseExpr``
257
248
* :doc: `Frame nodes <df_algebra >`
258
249
* :doc: `Expression nodes <expr >`
259
250
260
- Interactions with OmniSci engine are done using ``OmnisciWorker `` class. Queries use serialized
251
+ Interactions with HDK engine are done using ``HdkWorker `` class. Queries use serialized
261
252
Calcite relational algebra format. Calcite algebra nodes are based on ``CalciteBaseNode `` class.
262
253
Translation is done by ``CalciteBuilder `` class. Serialization is performed by ``CalciteSerializer ``
263
254
class.
0 commit comments