-
Hi, Thank you in advance, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 9 replies
-
Hi Alex When you read string typed data with the low level API, values are read as an array of ByteArray values. These just store a pointer and length, and point to memory allocated by the C++ Arrow library to read string data into. This is the same type used to write string values with the low-level API, and you can pass these directly back to a column writer without having to marshal the data to .NET strings. You just need to be aware that the ByteArray values are only guaranteed to remain valid until the next call to Here's an example of re-writing a Parquet file with string typed data using the low level API: // Write test file using high level API
{
var columns = new Column[]
{
new Column<string>("s"),
};
using var writer = new ParquetFileWriter("test.parquet", columns);
using var rowGroup = writer.AppendRowGroup();
using var col = rowGroup.NextColumn().LogicalWriter<string>();
col.WriteBatch(new [] {"a", "b", "c"});
}
// Read and re-write row groups using low level API, without converting
// byte array values to .NET strings.
{
using var reader = new ParquetFileReader("test.parquet");
using var schema = (ParquetSharp.Schema.GroupNode)reader.FileMetaData.Schema.SchemaRoot;
using var writerPropertiesBuilder = new WriterPropertiesBuilder();
using var writerProperties = writerPropertiesBuilder.Build();
using var writer = new ParquetFileWriter("modified.parquet", schema, writerProperties);
for (var rg = 0; rg < reader.FileMetaData.NumRowGroups; rg++)
{
using var readerRowGroup = reader.RowGroup(rg);
using var writerRowGroup = writer.AppendRowGroup();
using var readerCol = (ColumnReader<ByteArray>)readerRowGroup.Column(0);
using var writerCol = (ColumnWriter<ByteArray>)writerRowGroup.NextColumn();
const int bufferSize = 1024;
var valuesBuffer = new ByteArray[bufferSize];
var defLevels = new short[bufferSize];
var repLevels = new short[bufferSize];
long valuesRead = 0;
while (valuesRead < readerCol.ColumnChunkMetaData.NumValues)
{
var numLevels = readerCol.ReadBatch(
valuesBuffer.Length, defLevels, repLevels, valuesBuffer, out var numValues);
valuesRead += numValues;
writerCol.WriteBatch(
(int)numValues,
defLevels.AsSpan(0, (int)numLevels),
repLevels.AsSpan(0, (int)numLevels),
valuesBuffer.AsSpan(0, (int)numValues));
}
writerRowGroup.Close();
}
writer.Close();
}
// Test reading back with high level API
{
using var reader = new ParquetFileReader("modified.parquet");
for (var rg = 0; rg < reader.FileMetaData.NumRowGroups; rg++)
{
using var rowGroup = reader.RowGroup(rg);
using var col = rowGroup.Column(0).LogicalReader<string>();
var values = col.ReadAll((int) rowGroup.MetaData.NumRows);
foreach (var value in values)
{
Console.WriteLine(value);
}
}
} |
Beta Was this translation helpful? Give feedback.
-
Thank you for your prompt reply. Your code relies on ColumnChunkMetaData, which is declared as internal, so I used NumRows from RowGrouReader Another small tweak in code is that for columns defines as Optional and that have null values, numValues will return number of not null values and not number of fetched values. Solution you proposed works in case that Repetition level of source and target column is the same. In my case we can have scenario that source column is with Repetition.Required and target is with Repetition.Optional. What I see is that once I read data from column that is defined as Required defLevels are returned as "0", so if I send fetched data to writer it generates nulls for all values. Is it expected behavior and I need to populate defLevels to "1" on my side or it could be handled on library side? Alex |
Beta Was this translation helpful? Give feedback.
Ah, possibly
ColumnChunkMetaData.NumValues
should be exposed publicly somehow to make this possible then, as this has a different meaning toNumRows
.ColumnChunkMetaData.NumValues
is the number of non-null leaf values (eg. for list typed columns it will be the total number of non-null elements in all lists).Yes this is expected behaviour and is something you would need to handle if you are doing th…