Sunteți pe pagina 1din 35

19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

Cinchoo ETL - Parquet Writer


Cinchoo

17 Jun 2020 CPOL

Simple Parquet writer for .NET

ChoETL is an open source ETL (extract, transform and load) framework for .NET. It is a code based library for extracting data from
multiple sources, transforming, and loading into your very own data warehouse in .NET environment. You can have data in your data
warehouse in no time.

Download source code


Download binary (.NET Standard / .NET Core)

Contents
1. Introduction
2. Requirement
3. "Hello World!" Sample

3.1. Quick write - Data First Approach


3.2. Code First Approach
3.3. Configuration First Approach

4. Writing All Records


5. Write Records Manually
6. Customize Parquet Record
8. Customize Parquet Fields

8.1. DefaultValue
8.2. ChoFallbackValue
8.3. Type Converters

8.3.1. Declarative Approach


8.3.2. Configuration Approach
8.3.3. Custom Value Converter Approach

8.4. Validations
8.5. ChoIgnoreMember
8.6. StringLength
8.6. Display
8.7. DisplayName

10. Callback Mechanism

10.1 Using ParquetWriter events


10.2 Implementing IChoNotifyRecordWrite interface
10.1 BeginWrite
10.2 EndWrite
10.3 BeforeRecordWrite
10.4 AfterRecordWrite
10.5 RecordWriteError
10.6 BeforeRecordFieldWrite
10.7 AfterRecordFieldWrite
https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 1/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

10.8 RecordWriteFieldError

11. Customization
12. Using Dynamic Object
13. Exceptions
15. Using MetadataType Annotation
16. Configuration Choices

16.1 Manual Configuration
16.2 Auto Map Configuration
16.3 Attaching MetadataType class

18. Writing DataReader Helper Method


19. Writing DataTable Helper Method
20. Advanced Topics

20.1 Override Converters Format Specs


20.2 Currency Support
20.3 Enum Support
20.4 Boolean Support
20.5 DateTime Support

21. Fluent API

21.1. NullValueHandling
21.2. Formatting
21.3 WithFields
21.4 WithField
21.5. IgnoreFieldValueMode
21.6 ColumnCountStrict
21.7. Configure
21.8. Setup

22. FAQ

22.1. How to serialize an object?


22.2. How to serialize collection of objects?
22.3. How to serialize dynamic object?
22.4. How to serialize anonymous object?
22.5. How to serialize collection?
22.6. How to serialize dictionary?
22.7. How to serialize DataTable?
22.8. How to serialize Parquet to a file?
22.9. How to serialize byte array to a file?
22.10. How to serialize enum as integer to a file?
22.11. How to exclude property from Serialization?
22.12. How to convert Xml to Parquet?
22.13. How to convert CSV to Parquet?
22.14. How to convert JSON to Parquet?

1. Introduction
ChoETL is an open source ETL (extract, transform and load) framework for .NET. It is a code based library for extracting data from
multiple sources, transforming, and loading into your very own data warehouse in .NET environment. You can have data in your data
warehouse in no time.

Apache Parquet, an open source file format for Hadoop. Parquet stores nested data structures in a flat columnar format.
Compared to a traditional approach where data is stored in row-oriented approach, parquet is more efficient in terms of storage
and performance. 

This article talks about using ChoParquetWriter component offered by ChoETL framework. It is a simple utility class to save
Parquet data to a file / external data source.

Corresponding ChoParquetReader, a Parquet reader article can be found here.

Features:

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 2/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

Uses Parquet.NET library under the hood, to generate Parquet file in seconds.
Supports culture specific date, currency and number formats while generating files.
Provides fine control of date, currency, enum, boolean, number formats when writing files.
Detailed and robust error handling, allowing you to quickly find and fix problems.
Shorten your development time.

2. Requirement
This framework library is written in C# using .NET 4.5 Framework / .NET core 2.x.

3. "Hello World!" Sample


Open VS.NET 2017 or higher
Create a sample VS.NET (.NET Framework 4.x / .NET Core 2.x) Console Application project
Install ChoETL via Package Manager Console using Nuget Command based on working .NET version:

Install-Package ChoETL.Parquet

Use the ChoETL namespace

Let's begin by looking into a simple example of generating the below Parquet file having 2 columns

Image 3.1 Sample Parquet data file (emp.parquet)

There are number of ways you can get the Parquet file be parsed with minimal setup.

3.1. Quick write - Data First Approach


This is the zero-config and quickest approach to create Parquet file in no time. No typed POCO object is needed. Sample code
below shows how to generate sample Parquet file using dynamic objects

Listing 3.1.1 Write list of objects to Parquet file

List<ExpandoObject> objs = new List<ExpandoObject>();


dynamic rec1 = new ExpandoObject();
rec1.Id = 1;
rec1.Name = "Mark";
objs.Add(rec1);

dynamic rec2 = new ExpandoObject();


https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 3/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

rec2.Id = 2;
rec2.Name = "Jason";
objs.Add(rec2);

using (var parser = new ChoParquetWriter("emp.parquet"))


{
parser.Write(objs);
}

In the above sample, we give the list of dynamic objects to ParquetWriter at one pass to write them to Parquet file.

Listing 3.1.2 Write each object to Parquet file

using (var parser = new ChoParquetWriter("emp.parquet"))


{
dynamic rec1 = new ExpandoObject();
rec1.Id = 1;
rec1.Name = "Mark";
parser.Write(item);

dynamic rec1 = new ExpandoObject();


rec1.Id = 2;
rec1.Name = "Jason";
parser.Write(item);
}

In the above sample, we take control of constructing, passing each and individual dynamic record to the ParquetWriter to generate
the Parquet file using Write overload.

3.2. Code First Approach


This is another zeo-config way to generate Parquet file using typed POCO class. First define a simple POCO class to match the
underlying Parquet file layout

Listing 3.2.1 Simple POCO entity class

public partial class EmployeeRecSimple


{
public int Id { get; set; }
public string Name { get; set; }
}

In above, the POCO class defines two properties matching the sample Parquet file template.

Listing 3.2.2 Saving to Parquet file

List<EmployeeRecSimple> objs = new List<EmployeeRecSimple>();

EmployeeRecSimple rec1 = new EmployeeRecSimple();


rec1.Id = 1;
rec1.Name = "Mark";
objs.Add(rec1);

EmployeeRecSimple rec2 = new EmployeeRecSimple();


rec2.Id = 2;
rec2.Name = "Jason";
objs.Add(rec2);

using (var parser = new ChoParquetWriter<EmployeeRecSimple>("emp.parquet"))


{
parser.Write(objs);
}

Above sample shows how to create Parquet file from typed POCO class objects.

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 4/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

3.3. Configuration First Approach
In this model, we define the Parquet configuration with all the necessary parameters along with Parquet columns required to
generate the sample Parquet file.

Listing 3.3.1 Define Parquet configuration

ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();


config.ParquetRecordFieldConfigurations.Add(new ChoParquetRecordFieldConfiguration("Id"));
config.ParquetRecordFieldConfigurations.Add(new ChoParquetRecordFieldConfiguration("Name"));

In above, the class defines two Parquet properties matching the sample Parquet file template.

Listing 3.3.2 Generate Parquet file without POCO object

List<ExpandoObject> objs = new List<ExpandoObject>();

dynamic rec1 = new ExpandoObject();


rec1.Id = 1;
rec1.Name = "Mark";
objs.Add(rec1);

dynamic rec2 = new ExpandoObject();


rec2.Id = 2;
rec2.Name = "Jason";
objs.Add(rec2);

using (var parser = new ChoParquetWriter("emp.parquet", config))


{
parser.Write(objs);
}

The above sample code shows how to generate Parquet file from list of dynamic objects using predefined Parquet configuration
setup. In the ParquetWriter constructor, we specified the Parquet configuration configuration object to obey the Parquet layout
schema while creating the file. If there are any mismatch in the name or count of Parquet columns, will be reported as error and
stops the writing process.

Listing 3.3.3 Saving Parquet file with POCO object

List<EmployeeRecSimple> objs = new List<EmployeeRecSimple>();

EmployeeRecSimple rec1 = new EmployeeRecSimple();


rec1.Id = 1;
rec1.Name = "Mark";
objs.Add(rec1);

EmployeeRecSimple rec2 = new EmployeeRecSimple();


rec2.Id = 2;
rec2.Name = "Jason";
objs.Add(rec2);

using (var parser = new ChoParquetWriter<EmployeeRecSimple>("emp.parquet", config))


{
parser.Write(objs);
}

Above sample code shows how to generate Parquet file from list of POCO objects with Parquet configuration object. In the
ParquetWriter constructor, we specified the Parquet configuration configuration object.

3.4. Code First with declarative configuration


This is the combined approach to define POCO entity class along with attaching Parquet configuration parameters declaratively. id
is required column and name is optional value column with default value "XXXX". If name is not present, it will take the default
value.

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 5/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

Listing 3.4.1 Define POCO Object

public class EmployeeRec


{
[ChoParquetRecordField]
[Required]
public int? Id
{
get;
set;
}

[ChoParquetRecordField]
[DefaultValue("XXXX")]
public string Name
{
get;
set;
}

public override string ToString()


{
return "{0}. {1}.".FormatString(Id, Name);
}
}

The code above illustrates about defining POCO object with nessasary attributes required to generate Parquet file. First thing
defines property for each record field with ChoParquetRecordFieldAttribute to qualify for Parquet record mapping.
Id is a required property. We decorated it with RequiredAttribute. Name is given default value using
DefaultValueAttribute. It means that if the Name value is not set in the object, ParquetWriter spits the default value
'XXXX' to the file.

It is very simple and ready to save Parquet data in no time.

Listing 3.4.2 Saving Parquet file with POCO object

List<EmployeeRec> objs = new List<EmployeeRec>();

EmployeeRec rec1 = new EmployeeRec();


rec1.Id = 1;
rec1.Name = "Mark";
objs.Add(rec1);

EmployeeRec rec2 = new EmployeeRec();


rec2.Id = 2;
rec2.Name = "Jason";
objs.Add(rec2);

using (var parser = new ChoParquetWriter<EmployeeRec>("emp.parquet"))


{
parser.Write(objs);
}

We start by creating a new instance of ChoParquetWriter object. That's all. All the heavy lifting of genering Parquet data from
the objects is done by the writer under the hood.

By default, ParquetWriter discovers and uses default configuration parameters while saving Parquet file. These can be
overridable according to your needs. The following sections will give you in-depth details about each configuration attributes.

4. Writing All Records


It is as easy as setting up POCO object match up with Parquet file structure, construct the list of objects and pass it to
ParquetWriter's Write method. This will write the entire list of objects into Parquet file in one single call.

Listing 4.1 Write to Parquet File


https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 6/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

List<EmployeeRec> objs = new List<EmployeeRec>();


//Construct and attach objects to this list
...

using (var parser = new ChoParquetWriter<EmployeeRec>("emp.parquet"))


{
parser.Write(objs);
}

or:

Listing 4.2 Writer to Parquet file stream

List<EmployeeRec> objs = new List<EmployeeRec>();


//Construct and attach objects to this list
...

using (var tx = File.OpenWrite("emp.parquet"))


{
using (var parser = new ChoParquetWriter<EmployeeRec>(tx))
{
parser.Write(objs);
}
}

This model keeps your code elegant, clean, easy to read and maintain.

5. Write Records Manually
This is an alternative way to write each and individual record to Parquet file in case when the POCO objects are constructed in a
disconnected way. 

Listing 5.1 Wrting to Parquet file

var writer = new ChoParquetWriter<EmployeeRec>("emp.parquet");

EmployeeRec rec1 = new EmployeeRec();


rec1.Id = 1;
rec1.Name = "Mark";

writer.Write(rec1);

EmployeeRec rec2 = new EmployeeRec();


rec1.Id = 2;
rec1.Name = "Jason";

writer.Write(rec2);

6. Customize Parquet Record


Using ChoParquetRecordObjectAttribute, you can customize the POCO entity object declaratively.

Listing 6.1 Customizing POCO object for each record

[ChoParquetRecordObject]
public class EmployeeRec
{
[ChoParquetRecordField]
public int Id { get; set; }
[ChoParquetRecordField]
[Required]
[DefaultValue("ZZZ")]
https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 7/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

public string Name { get; set; }


}

Here are the available attributes to carry out customization of Parquet load operation on a file.

Culture - The culture info used to read and write.


ColumnCountStrict - This flag indicates if an exception should be thrown if Parquet field configuration mismatch with
the data object members.
ErrorMode - This flag indicates if an exception should be thrown if writing and an expected field is failed to write. This
can be overridden per property. Possible values are:

IgnoreAndContinue - Ignore the error, record will be skipped and continue with next.
ReportAndContinue - Report the error to POCO entity if it is of IChoNotifyRecordWrite type
ThrowAndStop - Throw the error and stop the execution
ObjectValidationMode - A flag to let the reader know about the type of validation to be performed with record
object. Possible values are:

Off - No object validation performed. (Default)


MemberLevel - Validation performed before each Parquet property gets written to the file.
ObjectLevel - Validation performed before all the POCO properties are written to the file.

8. Customize Parquet Fields


For each Parquet column, you can specify the mapping in POCO entity property using
ChoParquetRecordFieldAttribute.
Listing 6.1 Customizing POCO object for Parquet columns

public class EmployeeRec


{
[ChoParquetRecordField]
public int Id { get; set; }
[ChoParquetRecordField]
[Required]
[DefaultValue("ZZZ")]
public string Name { get; set; }
}

Here are the available members to add some customization to it for each property:

FieldName - Parquet field name. If not specified, POCO object property name will be used as field name.
Size - Size of Parquet column value.
NullValue - Special null value text expect to be treated as null value from Parquet file at the field level.
ErrorMode - This flag indicates if an exception should be thrown if writing and an expected field failed to convert and
write. Possible values are:

IgnoreAndContinue - Ignore the error and continue to load other properties of the record.
ReportAndContinue - Report the error to POCO entity if it is of IChoRecord type.
ThrowAndStop - Throw the error and stop the execution.

8.1. DefaultValue
Any POCO entity property can be specified with default value using
System.ComponentModel.DefaultValueAttribute. It is the value used to write when the Parquet value null
(controlled via IgnoreFieldValueMode).

8.2. ChoFallbackValue

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 8/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

Any POCO entity property can be specified with fallback value using ChoETL.ChoFallbackValueAttribute. It is the
value used when the property is failed to writer to Parquet. Fallback value only set when ErrorMode is either
IgnoreAndContinue or ReportAndContinue.

8.3. Type Converters


Most of the primitive types are automatically converted to string/text and save them to Parquet file. If the value of the Parquet field
aren't automatically be converted into the text value, you can specify a custom / built-in .NET converters to convert the value to text.
These can be either IValueConverter, IChoValueConverteror TypeConverter converters.

There are couple of ways you can specify the converters for each field

Declarative Approach
Configuration Approach

8.3.1. Declarative Approach

This model is applicable to POCO entity object only. If you have POCO class, you can specify the converters to each property to
carry out necessary conversion on them. Samples below shows the way to do it.

Listing 8.3.1.1 Specifying type converters

public class EmployeeRec


{
[ChoParquetRecordField]
[ChoTypeConverter(typeof(IntConverter))]
public int Id { get; set; }
[ChoParquetRecordField]
[Required]
[DefaultValue("ZZZ")]
public string Name { get; set; }
}

Listing 8.3.1.2 IntConverter implementation

public class IntConverter : IValueConverter


{
public object Convert(object value, Type targetType, object parameter, CultureInfo culture)
{
return value;
}

public object ConvertBack(object value, Type targetType, object parameter, CultureInfo


culture)
{
int intValue = (int)value;
return intValue.ToString("D4");
}
}

In the example above, we defined custom IntConverter class. And showed how to format 'Id' Parquet property with leading
zeros.

8.3.2. Configuration Approach

This model is applicable to both dynamic and POCO entity object. This gives freedom to attach the converters to each property at
runtime. This takes the precedence over the declarative converters on POCO classes.

Listing 8.3.2.1 Specifying TypeConverters

ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();

ChoParquetRecordFieldConfiguration idConfig = new ChoParquetRecordFieldConfiguration("Id");


https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 9/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

idConfig.AddConverter(new IntConverter());
config.ParquetRecordFieldConfigurations.Add(idConfig);

config.ParquetRecordFieldConfigurations.Add(new ChoParquetRecordFieldConfiguration("Name"));
config.ParquetRecordFieldConfigurations.Add(new ChoParquetRecordFieldConfiguration("Name1"));

In above, we construct and attach the IntConverter to 'Id' field using AddConverter helper method in
ChoParquetRecordFieldConfiguration object.

Likewise, if you want to remove any converter from it, you can use RemoveConverter on ChoParquetRecordFieldConfiguration
object.

8.3.3. Custom Value Converter Approach

This approach allows to attach value converter to each Parquet member using Fluenrt API. This is quick way to handle any odd
conversion process and avoid creating value converter class.

Listing 8.3.3.1 POCO class

public class EmployeeRec


{
[ChoParquetRecordField]
public int Id { get; set; }
[ChoParquetRecordField(2, FieldName ="Name", QuoteField = true)]
[Required]
[DefaultValue("ZZZ")]
public string Name { get; set; }
}

With the fluent API, sample below shows how to attach value converter to Id column

Listing 8.3.3.2 Attaching Value Converter

using (var dr = new ChoParquetWriter<EmployeeRec>(@"Test.parquet")


.WithField(c => c.Id, valueConverter: (v) => ((int)value).ToString("C3",
CultureInfo.CurrentCulture))
)
{
Console.WriteLine(rec);
}

8.4. Validations
ParquetWriter leverages both System.ComponentModel.DataAnnotations and Validation
Block validation attributes to specify validation rules for individual fields of POCO entity. Refer to the MSDN site for a list of
available DataAnnotations validation attributes.

Listing 8.4.1 Using validation attributes in POCO entity

[ChoParquetRecordObject]
public partial class EmployeeRec
{
[ChoParquetRecordField(1, FieldName = "id")]
[ChoTypeConverter(typeof(IntConverter))]
[Range(1, int.MaxValue, ErrorMessage = "Id must be > 0.")]
[ChoFallbackValue(1)]
public int Id { get; set; }

[ChoParquetRecordField]
[Required]
[DefaultValue("ZZZ")]
[ChoFallbackValue("XXX")]
public string Name { get; set; }
}

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 10/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

In example above, used Range validation attribute for Id property. Required validation attribute to Name property. ParquetWriter


performs validation on them before saving the data to file when Configuration.ObjectValidationMode is set to
ChoObjectValidationMode.MemberLevel or ChoObjectValidationMode.ObjectLevel.

Some cases, you may want to take control and perform manual self validation within the POCO entity class. This can be achieved
by inheriting POCO object from IChoValidatable interface.

Listing 8.4.2 Manual validation on POCO entity

[ChoParquetRecordObject]
public partial class EmployeeRec : IChoValidatable
{
[ChoParquetRecordField]
[ChoTypeConverter(typeof(IntConverter))]
[Range(1, int.MaxValue, ErrorMessage = "Id must be > 0.")]
[ChoFallbackValue(1)]
public int Id { get; set; }

[ChoParquetRecordField]
[Required]
[DefaultValue("ZZZ")]
[ChoFallbackValue("XXX")]
public string Name { get; set; }

public bool TryValidate(object target, ICollection<ValidationResult> validationResults)


{
return true;
}

public bool TryValidateFor(object target, string memberName, ICollection<ValidationResult>


validationResults)
{
return true;
}
}

Sample above shows how to implement custom self-validation in POCO object.

IChoValidatable interface exposes below methods

TryValidate - Validate entire object, return true if all validation passed. Otherwise return false.
TryValidateFor - Validate specific property of the object, return true if all validation passed. Otherwise return false.

8.5. ChoIgnoreMember
If you want to ignore a POCO class member from Parquet parsing in OptOut mode, decorate them with
ChoIgnoreMemberAttribute. Sample below shows Title member is ignored from Parquet loading process.

Listing 8.5.1 Ignore a member

Hide   Copy Code

public class EmployeeRec


{
public int Id { get; set; }
public string Name { get; set; }
[ChoIgnoreMember]
public string Title { get; set; }
}

8.6. StringLength
In OptOut mode, you can specify the size of the Parquet column by using
System.ComponentModel.DataAnnotations.StringLengthAttribute. 

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 11/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

Listing 8.6.1 Specifying Size of Parquet member

Hide   Copy Code

public class EmployeeRec


{
public int Id { get; set; }
[StringLength(25)]
public string Name { get; set; }
[ChoIgnoreMember]
public string Title { get; set; }
}

8.6. Display
In OptOut mode, you can specify the name of Parquet column mapped to member using
System.ComponentModel.DataAnnotations.DisplayAttribute. 
Listing 8.6.1 Specifying name of Parquet field

public class EmployeeRec


{
public int Id { get; set; }
[Display(Name="FullName")]
[StringLength(25)]
public string Name { get; set; }
[ChoIgnoreMember]
public string Title { get; set; }
}

8.7. DisplayName
In OptOut mode, you can specify the name of Parquet column mapped to member using
System.ComponentModel.DataAnnotations.DisplayNameAttribute. 
Listing 8.7.1 Specifying name of Parquet field

public class EmployeeRec


{
public int Id { get; set; }
[Display(Name="FullName")]
[StringLength(25)]
public string Name { get; set; }
[ChoIgnoreMember]
public string Title { get; set; }
}

10. Callback Mechanism


ParquetWriter offers industry standard Parquet data file generation out of the box to handle most of the needs. If the
generation process is not handling any of your needs, you can use the callback mechanism offered by ParquetWriter to
handle such situations.  In order to participate in the callback mechanism, you can use either of the following models

Using event handlers exposed by ParquetWriter via IChoWriter interface.


Inheriting POCO entity object from IChoNotifyRecordWrite / IChoNotifyFileWrite
/ IChoNotifyRecordFieldWrite interfaces
Inheriting DataAnnotation's MetadataType type object
by IChoNotifyRecordWrite / IChoNotifyFileWrite / IChoNotifyRecordFieldWrite interfac
es.

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 12/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

In order to participate in the callback mechanism, Either POCO entity object or DataAnnotation's MetadataType type object must be
inherited by IChoNotifyRecordWrite interface.

Tip: Any exceptions raised out of these interface methods will be ignored.

IChoRecorder exposes the below methods:


BeginWrite - Invoked at the begin of the Parquet file write
EndWrite - Invoked at the end of the Parquet file write
BeforeRecordWrite - Raised before the Parquet record write
AfterRecordWrite - Raised after Parquet record write
RecordWriteError - Raised when Parquet record errors out while writing
BeforeRecordFieldWrite - Raised before Parquet column value write
AfterRecordFieldWrite - Raised after Parquet column value write
RecordFieldWriteError - Raised when Parquet column value errors out while writing
IChoNotifyRecordWrite exposes the below methods:

BeforeRecordWrite - Raised before the Parquet record write


AfterRecordWrite - Raised after Parquet record write
RecordWriteError - Raised when Parquet record write errors out
IChoNotifyFileWrite exposes the below methods:

BeginWrite - Invoked at the begin of the Parquet file write


EndWrite - Invoked at the end of the Parquet file write
IChoNotifyRecordFieldWrite exposes the below methods:

BeforeRecordFieldWrite - Raised before Parquet column value write


AfterRecordFieldWrite - Raised after Parquet column value write
RecordFieldWriteError - Raised when Parquet column value write errors out
IChoNotifyFileHeaderArrange exposes the below methods:

FileHeaderArrange - Raised before Parquet file header is written to file, an opportunity to rearrange the Parquet
columns

IChoNotifyFileHeaderWrite exposes the below methods:

FileHeaderWrite - Raised before Parquet file header is written to file, an opportunity to customize the header.

10.1 Using ParquetWriter events


This is more direct and simplest way to subscribe to the callback events and handle your odd situations in parsing Parquet files.
Downside is that code can't be reusable as you do by implementing IChoNotifyRecordRead with POCO record object.

Sample below shows how to use the BeforeRecordLoad callback method to skip lines stating with '%' characters.

Listing 10.1.1 Using ParquetWriter callback events

static void IgnoreLineTest()


{
using (var parser = new ChoParquetWriter("emp.parquet"))
{
parser.BeforeRecordWrite += (o, e) =>
{
if (e.Source != null)
{
e.Skip = ((IDictionary<string, object>)e.Source).ContainsKey("name1");
}
};

parser.Write(rec);

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 13/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

}
}

Likewise you can use other callback methods as well with ParquetWriter.

10.2 Implementing IChoNotifyRecordWrite interface


Sample below shows how to implement IChoNotifyRecordWrite interface to direct POCO class.

Listing 10.2.1 Direct POCO callback mechanism implementation

[ChoParquetRecordObject]
public partial class EmployeeRec : IChoNotifyrRecordWrite
{
[ChoParquetRecordField]
[ChoTypeConverter(typeof(IntConverter))]
[Range(1, int.MaxValue, ErrorMessage = "Id must be > 0.")]
[ChoFallbackValue(1)]
public int Id { get; set; }

[ChoParquetRecordField]
[Required]
[DefaultValue("ZZZ")]
[ChoFallbackValue("XXX")]
public string Name { get; set; }

public bool AfterRecordWrite(object target, int index, object source)


{
throw new NotImplementedException();
}

public bool BeforeRecordWrite(object target, int index, ref object source)


{
throw new NotImplementedException();
}

public bool RecordWriteError(object target, int index, object source, Exception ex)
{
throw new NotImplementedException();
}
}

Sample below shows how to attach Metadata class to POCO class by using MetadataTypeAttribute on it.

Listing 10.2.2 MetaDataType based callback mechanism implementation

[ChoParquetRecordObject]
public class EmployeeRecMeta : IChoNotifyRecordWrite
{
[ChoParquetRecordField]
[ChoTypeConverter(typeof(IntConverter))]
[Range(1, int.MaxValue, ErrorMessage = "Id must be > 0.")]
[ChoFallbackValue(1)]
public int Id { get; set; }

[ChoParquetRecordField]
[Required]
[DefaultValue("ZZZ")]
[ChoFallbackValue("XXX")]
public string Name { get; set; }

public bool AfterRecordWrite(object target, int index, object source)


{
throw new NotImplementedException();
}

public bool BeforeRecordWrite(object target, int index, ref object source)


https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 14/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

{
throw new NotImplementedException();
}

public bool RecordWriteError(object target, int index, object source, Exception ex)
{
throw new NotImplementedException();
}
}

[MetadataType(typeof(EmployeeRecMeta))]
public partial class EmployeeRec
{
public int Id { get; set; }
public string Name { get; set; }
}

Sample below shows how to attach Metadata class for sealed or third party POCO class by using ChoMetadataRefTypeAttribute on
it.

Listing 10.2.2 MetaDataType based callback mechanism implementation

[ChoMetadataRefType(typeof(EmployeeRec))]
[ChoParquetRecordObject]
public class EmployeeRecMeta : IChoNotifyRecordWrite
{
[ChoParquetRecordField]
[ChoTypeConverter(typeof(IntConverter))]
[Range(1, int.MaxValue, ErrorMessage = "Id must be > 0.")]
[ChoFallbackValue(1)]
public int Id { get; set; }

[ChoParquetRecordField]
[Required]
[DefaultValue("ZZZ")]
[ChoFallbackValue("XXX")]
public string Name { get; set; }

public bool AfterRecordWrite(object target, int index, object source)


{
throw new NotImplementedException();
}

public bool BeforeRecordWrite(object target, int index, ref object source)


{
throw new NotImplementedException();
}

public bool RecordWriteError(object target, int index, object source, Exception ex)
{
throw new NotImplementedException();
}
}

public partial class EmployeeRec


{
public int Id { get; set; }
public string Name { get; set; }
}

10.1 BeginWrite
This callback invoked once at the beginning of the Parquet file write. source is the Parquet file stream object. In here you have
chance to inspect the stream, return true to continue the Parquet generation. Return false to stop the generation.

Listing 10.1.1 BeginWrite Callback Sample

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 15/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

public bool BeginWrite(object source)


{
StreamReader sr = source as StreamReader;
return true;
}

10.2 EndWrite
This callback invoked once at the end of the Parquet file generation. source is the Parquet file stream object. In here you have
chance to inspect the stream, do any post steps to be performed on the stream.

Listing 10.2.1 EndWrite Callback Sample

public void EndWrite(object source)


{
StreamReader sr = source as StreamReader;
}

10.3 BeforeRecordWrite
This callback invoked before each POCO record object is written to Parquet file. target is the instance of the POCO record
object. index is the line index in the file. source is the Parquet record line. In here you have chance to inspect the POCO object, and
generate the Parquet record line if needed.

Tip: If you want to skip the record from writing, set the source to null.

Return true to continue the load process, otherwise return false to stop the process.

Listing 10.3.1 BeforeRecordWrite Callback Sample

public bool BeforeRecordWrite(object target, int index, ref object source)


{
return true;
}

10.4 AfterRecordWrite
This callback invoked after each POCO record object is written to Parquet file. target is the instance of the POCO record object.
index is the line index in the file. source is the Parquet record line. In here you have chance to do any post step operation with the
record line.

Return true to continue the load process, otherwise return false to stop the process.

Listing 10.4.1 AfterRecordWrite Callback Sample

public bool AfterRecordWrite(object target, int index, object source)


{
return true;
}

10.5 RecordWriteError
This callback invoked if error encountered while writing POCO record object. target is the instance of the POCO record object. index
is the line index in the file. source is the Parquet record line. ex is the exception object. In here you have chance to handle the
exception. This method invoked only when Configuration.ErrorMode is ReportAndContinue.

Return true to continue the load process, otherwise return false to stop the process.

Listing 10.5.1 RecordWriteError Callback Sample

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 16/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

public bool RecordLoadError(object target, int index, object source, Exception ex)
{
return true;
}

10.6 BeforeRecordFieldWrite
This callback invoked before each Parquet record column is written to Parquet file. target is the instance of the POCO record object.
index is the line index in the file. propName is the Parquet record property name. value is the Parquet column value. In here, you
have chance to inspect the Parquet record property value and perform any custom validations etc.

Return true to continue the load process, otherwise return false to stop the process.

Listing 10.6.1 BeforeRecordFieldWrite Callback Sample

public bool BeforeRecordFieldWrite(object target, int index, string propName, ref object value)
{
return true;
}

10.7 AfterRecordFieldWrite
This callback invoked after each Parquet record column value is written to Parquet file. target is the instance of the POCO record
object. index is the line index in the file. propName is the Parquet record property name. value is the Parquet column value. Any post
field operation can be performed here, like computing other properties, validations etc.

Return true to continue the load process, otherwise return false to stop the process.

Listing 10.7.1 AfterRecordFieldWrite Callback Sample

public bool AfterRecordFieldWrite(object target, int index, string propName, object value)
{
return true;
}

10.8 RecordWriteFieldError
This callback invoked when error encountered while writing Parquet record column value. target is the instance of the POCO record
object. index is the line index in the file. propName is the Parquet record property name. value is the Parquet column value. ex is the
exception object. In here you have chance to handle the exception. This method invoked only after the below two sequences of
steps performed by the ParquetWriter

ParquetWriter looks for FallbackValue value of each Parquet property. If present, it tries to use it to write.
If the FallbackValue value not present and the Configuration.ErrorMode is specified as ReportAndContinue., this callback will
be executed.

Return true to continue the load process, otherwise return false to stop the process.

Listing 10.8.1 RecordFieldWriteError Callback Sample

public bool RecordFieldWriteError(object target,


int index, string propName, object value, Exception ex)
{
return true;
}

11. Customization
ParquetWriter automatically detects and loads the configuration settings from POCO entity. At runtime, you can customize
and tweak these parameters before Parquet generation. ParquetWriter exposes Configuration property, it is of
https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 17/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

ChoParquetRecordConfiguration object. Using this property, you can perform the customization.
Listing 11.1 Customizing ParquetWriter at run-time

class Program
{
static void Main(string[] args)
{
List<ExpandoObject> objs = new List<ExpandoObject>();
dynamic rec1 = new ExpandoObject();
rec1.Id = 1;
rec1.Name = "Mark";
objs.Add(rec1);

dynamic rec2 = new ExpandoObject();


rec2.Id = 2;
rec2.Name = "Jason";
objs.Add(rec2);

using (var parser = new ChoParquetWriter("emp.parquet"))


{
parser.Configuration.ColumnCountStrict = true;
parser.Write(objs);
}
}
}

12. Using Dynamic Object


So far, the article explained about using ParquetWriter with POCO object. ParquetWriter also supports generating
Parquet file without POCO entity objects It leverages .NET dynamic feature. The sample below shows how to generate Parquet
stream using dynamic objects. The Parquet schema is determined from first object. If there is mismatch found in the dynamic
objects member values, error will be raised and stop the generation process.

The sample below shows it:

Listing 12.1 Generating Parquet file from dynamic objects

class Program
{
static void Main(string[] args)
{
List<ExpandoObject> objs = new List<ExpandoObject>();
dynamic rec1 = new ExpandoObject();
rec1.Id = 1;
rec1.Name = "Mark";
objs.Add(rec1);

dynamic rec2 = new ExpandoObject();


rec2.Id = 2;
rec2.Name = "Jason";
objs.Add(rec2);

using (var parser = new ChoParquetWriter("emp.parquet"))


{
parser.Configuration.ColumnCountStrict = true;
parser.Write(objs);
}
}
}

13. Exceptions
https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 18/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

ParquetWriter throws different types of exceptions in different situations.

ChoParserException - Parquet file is bad and parser not able to recover.


ChoRecordConfigurationException - Any invalid configuration settings are specified, this exception will be raised.
ChoMissingRecordFieldException - A property is missing for a Parquet column, this exception will be raised.

15. Using MetadataType Annotation


Cinchoo ETL works better with data annotation's MetadataType model. It is way to attach MetaData class to data model class. In this
associated class, you provide additional metadata information that is not in the data model. It roles is to add attribute to a class
without having to modify this one. You can add this attribute that takes a single parameter to a class that will have all the attributes.
This is useful when the POCO classes are auto generated (by Entity Framework, MVC etc) by an automatic tools. This is why second
class come into play. You can add new stuffs without touching the generated file. Also this promotes modularization by separating
the concerns into multiple classes.

For more information about it, please search in MSDN.

Listing 15.1 MetadataType annotation usage sample

[MetadataType(typeof(EmployeeRecMeta))]
public class EmployeeRec
{
public int Id { get; set; }
public string Name { get; set; }
}

[ChoParquetRecordObject]
public class EmployeeRecMeta : IChoNotifyRecordWrite, IChoValidatable
{
[ChoParquetRecordField]
[ChoTypeConverter(typeof(IntConverter))]
[Range(1, 1, ErrorMessage = "Id must be > 0.")]
[ChoFallbackValue(1)]
public int Id { get; set; }

[ChoParquetRecordField]
[StringLength(1)]
[DefaultValue("ZZZ")]
[ChoFallbackValue("XXX")]
public string Name { get; set; }

public bool AfterRecordWrite(object target, int index, object source)


{
throw new NotImplementedException();
}

public bool BeforeRecordWrite(object target, int index, ref object source)


{
throw new NotImplementedException();
}

public bool RecordWriteError(object target, int index, object source, Exception ex)
{
throw new NotImplementedException();
}

public bool TryValidate(object target, ICollection<ValidationResult> validationResults)


{
return true;
}

public bool TryValidateFor(object target, string memberName, ICollection<ValidationResult>


validationResults)
{
return true;

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 19/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

}
}

In above EmployeeRec is the data class. Contains only domain specific properties and operations. Mark it very simple class to look at
it.

We separate the validation, callback mechanism, configuration etc into metadata type class, EmployeeRecMeta.

16. Configuration Choices


If the POCO entity class is an auto-generated class or exposed via library or it is a sealed class, it limits you to attach Parquet schema
definition to it declaratively. In such case, you can choose one of the options below to specify Parquet layout configuration

Manual Configuration
Auto Map Configuration
Attaching MetadataType class 

I'm going to show you how to configure the below POCO entity class on each approach

Listing 16.1 Sealed POCO entity class

public sealed class EmployeeRec


{
public int Id { get; set; }
public string Name { get; set; }
}

16.1 Manual Configuration
Define a brand new configuration object from scratch and add all the necessary Parquet fields to the
ChoParquetConfiguration.ParquetRecordFieldConfigurations collection property. This option gives you greater flexibility to control
the configuration of Parquet parsing. But the downside is that possibility of making mistakes and hard to manage them if the
Parquet file layout is large,

Listing 16.1.1 Manual Configuration

ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();


config.ThrowAndStopOnMissingField = true;
config.ParquetRecordFieldConfigurations.Add(new ChoParquetRecordFieldConfiguration("Id"));
config.ParquetRecordFieldConfigurations.Add(new ChoParquetRecordFieldConfiguration("Name"));

16.2 Auto Map Configuration
This is an alternative approach and very less error-prone method to auto map the Parquet columns for the POCO entity class.

First define a schema class for EmployeeRec POCO entity class as below

Listing 16.2.1 Auto Map class

public class EmployeeRecMap


{
[ChoParquetRecordField]
public int Id { get; set; }

[ChoParquetRecordField]
public string Name { get; set; }
}

Then you can use it to auto map Parquet columns by using ChoParquetRecordConfiguration.MapRecordFields method

Listing 16.2.2 Using Auto Map configuration

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 20/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();


config.MapRecordFields<EmployeeRecMap>();

EmployeeRec rec1 = new EmployeeRec();


rec1.Id = 2;
rec1.Name = "Jason";

foreach (var e in new ChoParquetWriter<EmployeeRec>("emp.parquet", config))


w.Write(rec1);

16.3 Attaching MetadataType class


This is one another approach to attach MetadataType class for POCO entity object. Previous approach simple care for auto mapping
of Parquet columns only. Other configuration properties like property converters, parser parameters, default/fallback values etc. are
not considered.

This model, accounts for everything by defining MetadataType class and specifying the Parquet configuration parameters
declaratively. This is useful when your POCO entity is sealed and not partial class. Also it is one of favorable and less error-
prone approach to configure Parquet parsing of POCO entity.

Listing 16.3.1 Define MetadataType class

[ChoParquetRecordObject]
public class EmployeeRecMeta : IChoNotifyRecordWrite, IChoValidatable
{
[ChoParquetRecordField]
[ChoTypeConverter(typeof(IntConverter))]
[Range(1, 1, ErrorMessage = "Id must be > 0.")]
public int Id { get; set; }

[ChoParquetRecordField]
[StringLength(1)]
[DefaultValue("ZZZ")]
[ChoFallbackValue("XXX")]
public string Name { get; set; }

public bool AfterRecordWrite(object target, int index, object source)


{
throw new NotImplementedException();
}

public bool BeforeRecordWrite(object target, int index, ref object source)


{
throw new NotImplementedException();
}

public bool RecordWriteError(object target, int index, object source, Exception ex)
{
throw new NotImplementedException();
}

public bool TryValidate(object target, ICollection<ValidationResult> validationResults)


{
return true;
}

public bool TryValidateFor(object target, string memberName, ICollection<ValidationResult>


validationResults)
{
return true;
}
}

Listing 16.3.2 Attaching MetadataType class

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 21/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

//Attach metadata
ChoMetadataObjectCache.Default.Attach<EmployeeRec>(new EmployeeRecMeta());

using (var tx = File.OpenWrite("emp.parquet"))


{
using (var parser = new ChoParquetWriter<EmployeeRec>(tx))
{
parser.Write(objs);
}
}

18. Writing DataReader Helper Method


This helper method lets you to create Parquet file / stream from ADO.NET DataReader.

static void WriteDataReaderTest()


{
SqlConnection conn = new SqlConnection(connString);
conn.Open();
SqlCommand cmd = new SqlCommand("SELECT * FROM Members", conn);
IDataReader dr = cmd.ExecuteReader();

using (var parser = new ChoParquetWriter("emp.parquet"))


{
parser.Write(dr);
}
}

19. Writing DataTable Helper Method


This helper method lets you to create Parquet file / stream from ADO.NET DataTable.

static void WriteDataTableTest()


{
string connString = @"Data Source=
(localdb)\v11.0;Initial Catalog=TestDb;Integrated Security=True";

SqlConnection conn = new SqlConnection(connString);


conn.Open();
SqlCommand cmd = new SqlCommand("SELECT * FROM Members", conn);
SqlDataAdapter da = new SqlDataAdapter(cmd);
DataTable dt = new DataTable();
da.Fill(dt);

using (var parser = new ChoParquetWriter("emp.parquet")


)
{
parser.Write(dt);
}
}

20. Advanced Topics

20.1 Override Converters Format Specs


Cinchoo ETL automatically parses and converts each Parquet column values to the corresponding Parquet column's underlying data
type seamlessly. Most of the basic .NET types are handled automatically without any setup needed.

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 22/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

This is achieved through two key settings in the ETL system

1. ChoParquetRecordConfiguration.CultureInfo - Represents information about a specific culture including the names of the
culture, the writing system, and the calendar used, as well as access to culture-specific objects that provide information for
common operations, such as formatting dates and sorting strings. Default is 'en-US'.
2. ChoTypeConverterFormatSpec - It is global format specifier class holds all the intrinsic .NET types formatting specs.

In this section, I'm going to talk about changing the default format specs for each .NET intrinsic data types according to parsing
needs.

ChoTypeConverterFormatSpec is singleton class, the instance is exposed via 'Instance' static member. It is thread local, means that
there will be separate instance copy kept on each thread.

There are 2 sets of format specs members given to each intrinsic type, one for loading and another one for writing the value, except
for Boolean, Enum, DataTime types. These types have only one member for both loading and writing operations.

Specifying each intrinsic data type format specs through ChoTypeConverterFormatSpec will impact system wide. ie. By
setting ChoTypeConverterFormatSpec.IntNumberStyle = NumberStyles.AllowParentheses, will impact all integer members of Parquet
objects to allow parentheses. If you want to override this behavior and take control of specific Parquet data member to handle its
own unique parsing of Parquet value from global system wide setting, it can be done by specifying TypeConverter at the Parquet
field member level. Refer section 13.4 for more information.

Listing 20.1.1 ChoTypeConverterFormatSpec Members

public class ChoTypeConverterFormatSpec


{
public static readonly ThreadLocal<ChoTypeConverterFormatSpec> Instance = new ThreadLocal<C
hoTypeConverterFormatSpec>(() => new ChoTypeConverterFormatSpec());

public string DateTimeFormat { get; set; }


public ChoBooleanFormatSpec BooleanFormat { get; set; }
public ChoEnumFormatSpec EnumFormat { get; set; }

public NumberStyles? CurrencyNumberStyle { get; set; }


public string CurrencyFormat { get; set; }

public NumberStyles? BigIntegerNumberStyle { get; set; }


public string BigIntegerFormat { get; set; }

public NumberStyles? ByteNumberStyle { get; set; }


public string ByteFormat { get; set; }

public NumberStyles? SByteNumberStyle { get; set; }


public string SByteFormat { get; set; }

public NumberStyles? DecimalNumberStyle { get; set; }


public string DecimalFormat { get; set; }

public NumberStyles? DoubleNumberStyle { get; set; }


public string DoubleFormat { get; set; }

public NumberStyles? FloatNumberStyle { get; set; }


public string FloatFormat { get; set; }

public string IntFormat { get; set; }


public NumberStyles? IntNumberStyle { get; set; }

public string UIntFormat { get; set; }


public NumberStyles? UIntNumberStyle { get; set; }

public NumberStyles? LongNumberStyle { get; set; }


public string LongFormat { get; set; }

public NumberStyles? ULongNumberStyle { get; set; }


public string ULongFormat { get; set; }

public NumberStyles? ShortNumberStyle { get; set; }


public string ShortFormat { get; set; }

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 23/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

public NumberStyles? UShortNumberStyle { get; set; }


public string UShortFormat { get; set; }
}

Sample below shows how to load Parquet data stream having 'se-SE' (Swedish) culture specific data using ParquetWriter. Also the
input feed comes with 'EmployeeNo' values containing parentheses. In order to make the load successful, we have to set the
ChoTypeConverterFormatSpec.IntNumberStyle to NumberStyles.AllowParenthesis.

Listing 20.1.2 Using ChoTypeConverterFormatSpec in code

static void FormatSpecDynamicTest()


{
ChoTypeConverterFormatSpec.Instance.DateTimeFormat = "d";
ChoTypeConverterFormatSpec.Instance.BooleanFormat = ChoBooleanFormatSpec.YOrN;

List<ExpandoObject> objs = new List<ExpandoObject>();


dynamic rec1 = new ExpandoObject();
rec1.Id = 10;
rec1.Name = "Mark";
rec1.JoinedDate = new DateTime(2001, 2, 2);
rec1.IsActive = true;
objs.Add(rec1);

dynamic rec2 = new ExpandoObject();


rec2.Id = 200;
rec2.Name = "Lou";
rec2.JoinedDate = new DateTime(1990, 10, 23);
rec2.IsActive = false;
objs.Add(rec2);

using (var parser = new ChoParquetWriter("emp.parquet"))


{
parser.Write(objs);
}
}

20.2 Currency Support
Cinchoo ETL provides ChoCurrency object to read and write currency values in Parquet files. ChoCurrency is a wrapper class to hold
the currency value in decimal type along with support of serializing them in text format during Parquet load. 

Listing 20.2.1 Using Currency members in dynamic model

static void CurrencyDynamicTest()


{
ChoTypeConverterFormatSpec.Instance.CurrencyFormat = "C2";

List<ExpandoObject> objs = new List<ExpandoObject>();


dynamic rec1 = new ExpandoObject();
rec1.Id = 10;
rec1.Name = "Mark";
rec1.JoinedDate = new DateTime(2001, 2, 2);
rec1.IsActive = true;
rec1.Salary = new ChoCurrency(100000);
objs.Add(rec1);

dynamic rec2 = new ExpandoObject();


rec2.Id = 200;
rec2.Name = "Lou";
rec2.JoinedDate = new DateTime(1990, 10, 23);
rec2.IsActive = false;
rec2.Salary = new ChoCurrency(150000);
objs.Add(rec2);

using (var parser = new ChoParquetWriter("emp.parquet"))


{

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 24/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

parser.Write(objs);
}
}

Sample above shows how to output currency values using dynamic object model. As the currency output will have thousand comma
separator, this will fail to generate Parquet file. To overcome this issue, we specify the writer to quote all fields.

PS: The format of the currency value is figured by ParquetWriter through ChoRecordConfiguration.Culture and
ChoTypeConverterFormatSpec.CurrencyFormat.

Sample below shows how to use ChoCurrency Parquet field in POCO entity class.

Listing 20.2.2 Using Currency members in POCO model

public class EmployeeRecWithCurrency


{
public int Id { get; set; }
public string Name { get; set; }
public ChoCurrency Salary { get; set; }
}

static void CurrencyPOCOTest()


{
List<EmployeeRecWithCurrency> objs = new List<EmployeeRecWithCurrency>();
EmployeeRecWithCurrency rec1 = new EmployeeRecWithCurrency();
rec1.Id = 10;
rec1.Name = "Mark";
rec1.Salary = new ChoCurrency(100000);
objs.Add(rec1);

EmployeeRecWithCurrency rec2 = new EmployeeRecWithCurrency();


rec2.Id = 200;
rec2.Name = "Lou";
rec2.Salary = new ChoCurrency(150000);
objs.Add(rec2);

using (var parser = new ChoParquetWriter<EmployeeRecWithCurrency>("emp.parquet"))


{
parser.Write(objs);
}
}

20.3 Enum Support
Cinchoo ETL implicitly handles parsing/writing of enum column values from Parquet files. If you want to fine control the parsing of
these values, you can specify them globally via ChoTypeConverterFormatSpec.EnumFormat. Default
is ChoEnumFormatSpec.Value

FYI, changing this value will impact system wide.

There are 3 possible values can be used

1. ChoEnumFormatSpec.Value - Enum value is used for parsing.


2. ChoEnumFormatSpec.Name - Enum key name is used for parsing.
3. ChoEnumFormatSpec.Description - If each enum key is decorated with DescriptionAttribute, its value will be use for parsing.

Listing 20.3.1 Specifying Enum format specs during parsing

public enum EmployeeType


{
[Description("Full Time Employee")]
Permanent = 0,
[Description("Temporary Employee")]
Temporary = 1,
[Description("Contract Employee")]
Contract = 2
}
https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 25/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

static void EnumTest()


{
ChoTypeConverterFormatSpec.Instance.EnumFormat = ChoEnumFormatSpec.Description;

List<ExpandoObject> objs = new List<ExpandoObject>();


dynamic rec1 = new ExpandoObject();
rec1.Id = 10;
rec1.Name = "Mark";
rec1.JoinedDate = new DateTime(2001, 2, 2);
rec1.IsActive = true;
rec1.Salary = new ChoCurrency(100000);
rec1.Status = EmployeeType.Permanent;
objs.Add(rec1);

dynamic rec2 = new ExpandoObject();


rec2.Id = 200;
rec2.Name = "Lou";
rec2.JoinedDate = new DateTime(1990, 10, 23);
rec2.IsActive = false;
rec2.Salary = new ChoCurrency(150000);
rec2.Status = EmployeeType.Contract;
objs.Add(rec2);

using (var parser = new ChoParquetWriter("emp.parquet"))


{
parser.Write(objs);
}
}

20.4 Boolean Support
Cinchoo ETL implicitly handles parsing/writing of boolean Parquet column values from Parquet files. If you want to fine control the
parsing of these values, you can specify them globally via ChoTypeConverterFormatSpec.BooleanFormat. Default value
is ChoBooleanFormatSpec.ZeroOrOne

FYI, changing this value will impact system wide.

There are 4 possible values can be used

1. ChoBooleanFormatSpec.ZeroOrOne - '0' for false. '1' for true.


2. ChoBooleanFormatSpec.YOrN - 'Y' for true, 'N' for false.
3. ChoBooleanFormatSpec.TrueOrFalse - 'True' for true, 'False' for false.
4. ChoBooleanFormatSpec.YesOrNo - 'Yes' for true, 'No' for false.

Listing 20.4.1 Specifying boolean format specs during parsing

static void BoolTest()


{
ChoTypeConverterFormatSpec.Instance.BooleanFormat = ChoBooleanFormatSpec.YOrN;

List<ExpandoObject> objs = new List<ExpandoObject>();


dynamic rec1 = new ExpandoObject();
rec1.Id = 10;
rec1.Name = "Mark";
rec1.JoinedDate = new DateTime(2001, 2, 2);
rec1.IsActive = true;
rec1.Salary = new ChoCurrency(100000);
rec1.Status = EmployeeType.Permanent;
objs.Add(rec1);

dynamic rec2 = new ExpandoObject();


rec2.Id = 200;
rec2.Name = "Lou";
rec2.JoinedDate = new DateTime(1990, 10, 23);
rec2.IsActive = false;
rec2.Salary = new ChoCurrency(150000);

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 26/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

rec2.Status = EmployeeType.Contract;
objs.Add(rec2);

using (var parser = new ChoParquetWriter("emp.parquet"))


{
parser.Write(objs);
}
}

20.5 DateTime Support
Cinchoo ETL implicitly handles parsing/writing of datetime Parquet column values from Parquet files using system Culture or custom
set culture. If you want to fine control the parsing of these values, you can specify them globally
via ChoTypeConverterFormatSpec.DateTimeFormat. Default value is 'd'.

FYI, changing this value will impact system wide.

You can use any valid standard or custom datetime .NET format specification to parse the datetime Parquet values from the file.

Listing 20.5.1 Specifying datetime format specs during parsing

static void DateTimeDynamicTest()


{
ChoTypeConverterFormatSpec.Instance.DateTimeFormat = "MMM dd, yyyy";

List<ExpandoObject> objs = new List<ExpandoObject>();


dynamic rec1 = new ExpandoObject();
rec1.Id = 10;
rec1.Name = "Mark";
rec1.JoinedDate = new DateTime(2001, 2, 2);
rec1.IsActive = true;
rec1.Salary = new ChoCurrency(100000);
objs.Add(rec1);

dynamic rec2 = new ExpandoObject();


rec2.Id = 200;
rec2.Name = "Lou";
rec2.JoinedDate = new DateTime(1990, 10, 23);
rec2.IsActive = false;
rec2.Salary = new ChoCurrency(150000);
objs.Add(rec2);

using (var parser = new ChoParquetWriter("emp.parquet"))


{
parser.Write(objs);
}
}

Sample above shows how to generate custom datetime values to Parquet file. 

Note: As the datetime values contains Parquet separator, we instruct the writer to quote all fields. 

21. Fluent API


ParquetWriter exposes few frequent to use configuration parameters via fluent API methods. This will make the programming of
generating Parquet files quicker.

21.1. NullValueHandling
Specifies null value handling options for the ChoParquetWriter

Ignore - Ignore null values while writing Parquet

Default - Include null values while writing Parquet


https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 27/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

21.2. Formatting
Specifies formatting options for the ChoParquetWriter

None - No special formatting is applied. This is the default.

Intented- Causes child objects to be indented.

21.3 WithFields
This API method specifies the list of Parquet fields to be considered for writing Parquet file. Other fields will be discarded. Field
names are case-insensitive.

static void QuickDynamicTest()


{
List<ExpandoObject> objs = new List<ExpandoObject>();
dynamic rec1 = new ExpandoObject();
rec1.Id = 10;
rec1.Name = "Mark";
rec1.JoinedDate = new DateTime(2001, 2, 2);
rec1.IsActive = true;
rec1.Salary = new ChoCurrency(100000);
objs.Add(rec1);

dynamic rec2 = new ExpandoObject();


rec2.Id = 200;
rec2.Name = "Lou";
rec2.JoinedDate = new DateTime(1990, 10, 23);
rec2.IsActive = false;
rec2.Salary = new ChoCurrency(150000);
objs.Add(rec2);

using (var parser = new ChoParquetWriter("emp.parquet")


.WithFields("Id", "Name")
)
{
parser.Write(objs);
}
}

21.4 WithField
This API method used to add Parquet column with specific date type, quote flag, and/or quote character. This method helpful in
dynamic object model, by specifying each and individual Parquet column with appropriate datatype.  

static void QuickDynamicTest()


{
List<ExpandoObject> objs = new List<ExpandoObject>();
dynamic rec1 = new ExpandoObject();
rec1.Id = 10;
rec1.Name = "Mark";
rec1.JoinedDate = new DateTime(2001, 2, 2);
rec1.IsActive = true;
rec1.Salary = new ChoCurrency(100000);
objs.Add(rec1);

dynamic rec2 = new ExpandoObject();


rec2.Id = 200;
rec2.Name = "Lou";
rec2.JoinedDate = new DateTime(1990, 10, 23);
rec2.IsActive = false;
rec2.Salary = new ChoCurrency(150000);
objs.Add(rec2);

using (var parser = new ChoParquetWriter("emp.parquet")


https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 28/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

.WithField("Id", typeof(int))
.WithField("Name"))
)
{
parser.Write(objs);
}
}

21.5. IgnoreFieldValueMode
Specifies ignore field value for the ChoParquetWriter

None - Ignore field value is turned off. This is the default.

DbNull- DBNull value will be ignored.

Empty - Empty text value will be ignored.

WhiteSpace - Whitespace text will be ignored.

21.6 ColumnCountStrict
This API method used to set the ParquetWriter to perform check on column countnness before writing Parquet file.

static void ColumnCountTest()


{
List<ExpandoObject> objs = new List<ExpandoObject>();
dynamic rec1 = new ExpandoObject();
rec1.Id = 10;
rec1.Name = "Mark";
rec1.JoinedDate = new DateTime(2001, 2, 2);
rec1.IsActive = true;
rec1.Salary = new ChoCurrency(100000);
objs.Add(rec1);

dynamic rec2 = new ExpandoObject();


rec2.Id = 200;
rec2.Name = "Lou";
rec2.JoinedDate = new DateTime(1990, 10, 23);
rec2.IsActive = false;
rec2.Salary = new ChoCurrency(150000);
objs.Add(rec2);

using (var parser = new ChoParquetWriter("emp.parquet")


.ColumnCountStrict()
)
{
parser.Write(objs);
}
}

21.7. Configure
This API method used to configure all configuration parameters which are not exposed via fluent API. 

static void ConfigureTest()


{
List<ExpandoObject> objs = new List<ExpandoObject>();
dynamic rec1 = new ExpandoObject();
rec1.Id = 10;
rec1.Name = "Mark";
objs.Add(rec1);

dynamic rec2 = new ExpandoObject();

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 29/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

rec2.Id = 200;
rec2.Name = "Lou";
objs.Add(rec2);

using (var parser = new ChoParquetWriter("emp.parquet")


.Configure(c => c.ErrorMode = ChoErrorMode.ThrowAndStop)
)
{
parser.Write(objs);
}
}

21.8. Setup
This API method used to setup the writer's parameters / events via fluent API. 

static void SetupTest()


{
List<ExpandoObject> objs = new List<ExpandoObject>();
dynamic rec1 = new ExpandoObject();
rec1.Id = 10;
rec1.Name = "Mark";
objs.Add(rec1);

dynamic rec2 = new ExpandoObject();


rec2.Id = 200;
rec2.Name = "Lou";
objs.Add(rec2);

using (var parser = new ChoParquetWriter("emp.parquet")


.Setup(r => r.BeforeRecordWrite += (o, e) =>
{
})
)
{
parser.Write(objs);
}
}

22. FAQ

22.1. How to serialize an object?


This sample serializes object to Parquet

byte[] payload = ChoParquetWriter.Serialize(new Account


{
Email = "james@example.com",
Active = true,
Roles = new List<string>()
{
"DEV",
"OPS"
}
});

22.2. How to serialize collection of objects?


This sample serializes collections of objects to Parquet

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 30/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

byte[] payload = ChoParquetWriter.SerializeAll<Account>(new Account[] {


new Account
{
Email = "james@example.com",
Active = true,
Roles = new List<string>()
{
"DEV",
"OPS"
}
}
}
);

22.3. How to serialize dynamic object?


This sample serializes dynamic object to Parquet

dynamic obj = new ExpandoObject();


obj.Email = "james@example.com";
obj.Active = true;
obj.Roles = new List<string>()
{
"DEV",
"OPS"
};

byte[] payload = ChoParquetWriter.Serialize(obj);

22.4. How to serialize anonymous object?


This sample serializes anonymous object to Parquet

byte[] payload = ChoParquetWriter.Serialize(new


{
Email = "james@example.com",
Active = true,
Roles = new List<string>()
{
"DEV",
"OPS"
}
});

22.5. How to serialize collection?


This sample serializes collection to Parquet

byte[] payload = ChoParquetWriter.SerializeAll(new int[] { 1, 2, 3 });

22.6. How to serialize dictionary?


This sample serializes dictionary to Parquet

byte[] payload = ChoParquetWriter.SerializeAll(new Dictionary<string, int>[] {


new Dictionary<string, int>()
{
["key1"] = 1,
["key2"] = 2

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 31/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

}
});

22.7. How to serialize DataTable?


This sample serializes datatable to Parquet

string connectionstring = @"Data Source=(localdb)\MSSQLLocalDB;Initial


Catalog=Northwind;Integrated Security=True";
using (var conn = new SqlConnection(connectionstring))
{
conn.Open();
var comm = new SqlCommand("SELECT TOP 2 * FROM Customers", conn);
SqlDataAdapter adap = new SqlDataAdapter(comm);

DataTable dt = new DataTable("Customer");


adap.Fill(dt);

using (var parser = new ChoParquetWriter("emp.parquet"))


parser.Write(dt);
}

22.8. How to serialize Parquet to a file?


This sample serializes object to to a Parquet file.

// serialize Parquet to a byte array and then write to a file


File.WriteAllBytes(@"c:\emp.parquet", ChoParquetWriter.Serialize(employee));

Sample below shows how to write directly to a file

using (var r = new ChoParquetWriter(@"c:\emp.parquet"))


r.Write(employee);

22.9. How to serialize byte array to a file?


This sample serializes byte array to Parquet file.

using (var w = new ChoParquetWriter("ByteArrayTest.parquet"))


{
w.Write(new Dictionary<int, byte[]>()
{
[1] = Encoding.Default.GetBytes("Tom"),
[2] = Encoding.Default.GetBytes("Mark")
});
}

22.10. How to serialize enum as integer to a file?


This sample serializes enum value as integer to Parquet file.

using (var w = new ChoParquetWriter("EnumTest.parquet")


.WithField("Id")
.WithField("Name")
.WithField("EmpType", valueConverter: o => (int)o, fieldType: typeof(int))
)
{
w.Write(new
{
Id = 1,
https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 32/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

Name = "Tom",
EmpType = EmployeeType.Permanent
});
}

22.11. How to exclude property from Serialization?


These samples shows how to exclude property from Parquet serialization

Sample1: Using ChoIgnoreMemberAttribute

public class Account


{
[ChoIgnoreMember]
public string Email { get; set; }
public bool Active { get; set; }
public DateTime CreatedDate { get; set; }
public IList<string> Roles { get; set; }
}

static void ExcludePropertyTest()


{
byte[] payload = ChoParquetWriter.Serialize(new Account
{
Email = "james@example.com",
Active = true,
Roles = new List<string>()
{
"DEV",
"OPS"
}

});
}

Sample2: Using IgnoreField on ChoParquetConfiguration

public class Account


{
public string Email { get; set; }
public bool Active { get; set; }
public DateTime CreatedDate { get; set; }
public IList<string> Roles { get; set; }
}

static void ExcludePropertyTest()


{
byte[] payload = ChoParquetWriter.Serialize(new Account
{
Email = "james@example.com",
Active = true,
Roles = new List<string>()
{
"DEV",
"OPS"
}

}, new ChoParquetRecordConfiguration<Account>().Ignore(f => f.Email));


}

Sample3: Using IgnoreField on ChoParquetWriter

public class Account


{
public string Email { get; set; }
public bool Active { get; set; }
https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 33/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

public DateTime CreatedDate { get; set; }


public IList<string> Roles { get; set; }
}

static void ExcludePropertyTest()


{
using (var w = new ChoParquetWriter<Account>("emp.parquet")
.IgnoreField(f => f.Email)
)
{
w.Write(new Account
{
Email = "james@example.com",
Active = true,
Roles = new List<string>()
{
"DEV",
"OPS"
}

});
}
}

22.12. How to convert Xml to Parquet?


This sample shows converting Xml file to Parquet.

string xml = @"<Employees xmlns=""http://company.com/schemas"">


<Employee>
<FirstName>name1</FirstName>
<LastName>surname1</LastName>
</Employee>
<Employee>
<FirstName>name2</FirstName>
<LastName>surname2</LastName>
</Employee>
<Employee>
<FirstName>name3</FirstName>
<LastName>surname3</LastName>
</Employee>
</Employees>";

using (var r = ChoXmlReader.LoadText(xml))


{
using (var w = new ChoParquetWriter(@"c:\emp.parquet"))
w.Write(r);
}

22.13. How to convert CSV to Parquet?


This sample shows converting CSV file to Parquet.

string csv = @"Id, First Name


1, Tom
2, Mark";

using (var r = ChoCSVReader.LoadText(csv)


.WithFirstLineHeader()
.WithMaxScanRows(2)
)
{
using (var w = new ChoParquetWriter(@"emp.parquet"))
{
w.Write(r);
https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 34/35
19/06/2020 Cinchoo ETL - Parquet Writer - CodeProject

}
}

22.14. How to convert JSON to Parquet?


This sample shows converting JSON file to Parquet.

string json = @"[


{
"Id": 1,
"Name": "Mark"
},
{
"Id": 2,
"Name": "Jason"
}
]";

using (var r = ChoJSONReader.LoadText(json))


{
using (var w = new ChoParquetWriter(@"emp.parquet"))
{
w.Write(r);
}
}

License
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author


Cinchoo No Biography provided
United States

Comments and Discussions


0 messages have been posted for this article Visit https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-
Writer to post and view comments on this article, or click here to get a print view with messages.

Permalink Article Copyright 2020 by Cinchoo


Advertise Everything else Copyright © CodeProject, 1999-
Privacy 2020
Cookies
Terms of Use Web01 2.8.200618.1

https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer?display=Print 35/35

S-ar putea să vă placă și