Skip to content

Commit 88f7b27

Browse files
AndreyPavlenkoYarShevienkovich
authored
DOCS-#5019: Update HDK on native documentation (#5088)
Co-authored-by: Iaroslav Igoshev <[email protected]> Co-authored-by: ienkovich <[email protected]> Signed-off-by: Andrey Pavlenko <[email protected]>
1 parent abcf1e9 commit 88f7b27

File tree

1 file changed

+24
-33
lines changed
  • docs/flow/modin/experimental/core/execution/native/implementations/hdk_on_native

1 file changed

+24
-33
lines changed

docs/flow/modin/experimental/core/execution/native/implementations/hdk_on_native/index.rst

Lines changed: 24 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,6 @@
22

33
HdkOnNative execution
44
=====================
5-
.. raw:: html
6-
7-
<style>.red {color:red; font-weight:bold;}</style>
8-
9-
.. role:: red
10-
11-
:red:`Note\: After migration to HDK, this documentation is temporarily
12-
out-of-date and will be fixed in the next release!`
135

146
HDK is a low-level execution library for data analytics processing.
157
HDK is used as a fast execution backend in Modin. The HDK library provides
@@ -20,8 +12,8 @@ OmniSciDB is an open-source SQL-based relational database designed for the
2012
massive parallelism of modern CPU and GPU hardware. Its execution engine
2113
is built on LLVM JIT compiler.
2214

23-
OmniSciDB can be embedded into an application as a dynamic library that
24-
provides both C++ and Python APIs. A specialized in-memory storage layer
15+
HDK can be embedded into an application as a python module - ``pyhdk``. This module
16+
provides Python APIs to the HDK library. A specialized in-memory storage layer
2517
provides an efficient way to import data in Arrow table format.
2618

2719
`HdkOnNative` execution uses HDK for both as a storage format and for
@@ -58,7 +50,7 @@ A partition holds data in either ``pandas.DataFrame`` or ``pyarrow.Table``
5850
format. ``pandas.DataFrame`` is preferred only when we detect unsupported
5951
data type and therefore have to use ``pandas`` framework for processing.
6052
In other cases ``pyarrow.Table`` format is preferred. Arrow tables can be
61-
zero-copy imported into OmniSciDB. A query execution result is also
53+
zero-copy imported into HDK. A query execution result is also
6254
returned as an Arrow table.
6355

6456
Data Ingress
@@ -90,16 +82,16 @@ wrapped into a high-level Modin DataFrame, which is returned to the user.
9082
.. figure:: /img/hdk/hdk_ingress.svg
9183
:align: center
9284

93-
Note that during this ingress flow, no data is actually imported to the OmniSciDB. The need for
94-
importing to OmniSci is decided later at the execution stage by the Modin Core Dataframe layer.
95-
If the query requires for the data to be placed in OmniSciDB, the import is triggered.
85+
Note that during this ingress flow, no data is actually imported to HDK. The need for
86+
importing to HDK is decided later at the execution stage by the Modin Core Dataframe layer.
87+
If the query requires for the data to be placed in HDK, the import is triggered.
9688
:py:class:`~modin.experimental.core.execution.native.implementations.hdk_on_native.dataframe.dataframe.HdkOnNativeDataframe`
9789
passes partition to import to the
9890
:py:class:`~modin.experimental.core.execution.native.implementations.hdk_on_native.partitioning.partition_manager.HdkOnNativeDataframePartitionManager`
99-
that extracts a partition's underlying object and sends a request to import it to the OmniSci
100-
Server. The response for the request is a unique identifier for the just imported table
101-
at OmniSciDB, this identifier is placed in the partition. After that, the partition has
102-
a reference to the concrete table in OmniSciDB to query, and the data is considered to be
91+
that extracts a partition's underlying object and sends a request to import it to HDK.
92+
The response for the request is a unique identifier for the just imported table
93+
at HDK, this identifier is placed in the partition. After that, the partition has
94+
a reference to the concrete table in HDK to query, and the data is considered to be
10395
fully imported.
10496

10597
.. figure:: /img/hdk/hdk_import.svg
@@ -133,7 +125,7 @@ lazy computation tree or executed immediately.
133125
Lazy execution
134126
""""""""""""""
135127

136-
OmniSciDB has a powerful query optimizer and an execution engine that
128+
HDK has a powerful query optimizer and an execution engine that
137129
combines multiple operations into a single execution module. E.g. join,
138130
filter and aggregation can be executed in a single data scan.
139131

@@ -142,7 +134,7 @@ overheads, all of the operations that don't require data materialization
142134
are performed lazily.
143135

144136
Lazy operations on a frame build a tree which is later translated into
145-
a query executed by OmniSci. Each of the tree nodes has its input node(s)
137+
a query executed by HDK. Each of the tree nodes has its input node(s)
146138
- a frame argument(s) of the operation. When a new node is appended to the
147139
tree, it becomes its root. The leaves of the tree are always a special node
148140
type, whose input is an actual materialized frame to execute operations
@@ -174,30 +166,29 @@ Execution of a computation tree
174166

175167
Frames are materialized (executed) when their data is accessed. E.g. it
176168
happens when we try to access the frame's index or shape. There are two ways
177-
to execute required operations: through Arrow or through OmniSciDB.
169+
to execute required operations: through Arrow or through HDK.
178170

179171
Arrow execution
180172
'''''''''''''''
181173

182174
For simple operations which don't include actual computations, execution can use
183175
Arrow API. We can use it to rename columns, drop columns and concatenate
184176
frames. Arrow execution is preferable since it doesn't require actual data import/export
185-
to the OmniSciDB.
177+
from/to HDK.
186178

187-
OmniSciDB execution
188-
'''''''''''''''''''
179+
HDK execution
180+
'''''''''''''
189181

190-
To execute query in OmniSciDB engine we need to import data first. We should
182+
To execute a query in the HDK engine we need to import data first. We should
191183
find all leaves of an operation tree and import their Arrow tables. Partitions
192184
with imported tables hold corresponding table names used to refer to them in
193185
queries.
194186

195-
OmniSciDB is SQL-based. SQL query parsing is done in a separate process using
196-
the Apache Calcite framework. A parsed query is serialized into JSON format
197-
and is transferred back to OmniSciDB. In Modin, we don't generate SQL queries
198-
for OmniSciDB but use this JSON format instead. Such queries can be directly
199-
executed by OmniSciDB and also they can be transferred to Calcite server for
200-
optimizations.
187+
HDK executes queries expressed in HDK-specific intermediate representation (IR) format.
188+
It also provides components to translate SQL queries to relational algebra JSON format
189+
which can be later optimized and translated to HDK IR. Modin generates queries in relational
190+
algebra JSON format. These queries are optionally optimized with Apache Calcite
191+
based optimizer provided by HDK (:py:class:`~pyhdk.sql.Calcite`) and then executed.
201192

202193
Operations used by Calcite in its intermediate representation are implemented
203194
in classes derived from
@@ -235,7 +226,7 @@ Rowid column and sub-queries
235226
A special case of an index is the default index - 0-based numeric sequence.
236227
In our representation, such an index is represented by the absence of index columns.
237228
If we need to access the index value we can use the virtual ``rowid`` column provided
238-
by OmniSciDB. Unfortunately, this special column is available for physical
229+
by HDK. Unfortunately, this special column is available for physical
239230
tables only. That means we cannot access it for a node that is not a tree leaf.
240231
That makes us execute trees with such nodes in several steps. First, we
241232
materialize all frames that require ``rowid`` column and only after that we can
@@ -257,7 +248,7 @@ by ``DFAlgNode`` based trees. Scalar computations are described by ``BaseExpr``
257248
* :doc:`Frame nodes <df_algebra>`
258249
* :doc:`Expression nodes <expr>`
259250

260-
Interactions with OmniSci engine are done using ``OmnisciWorker`` class. Queries use serialized
251+
Interactions with HDK engine are done using ``HdkWorker`` class. Queries use serialized
261252
Calcite relational algebra format. Calcite algebra nodes are based on ``CalciteBaseNode`` class.
262253
Translation is done by ``CalciteBuilder`` class. Serialization is performed by ``CalciteSerializer``
263254
class.

0 commit comments

Comments
 (0)