Sunteți pe pagina 1din 8

11,096,028 members 79,394 online

home

articles

quick answers

discussions

1 Member 11319423

features

community

help

351

Sign out

Searchforarticles,questions,tips

Articles Database Database Utilities

A Portable and Efficient Generic Parser for Flat Files


Andrew Rissing, 17 Jul 2012

CPOL

Rate:

4.93 76 votes

GenericParser is a C# implementation of a parser for delimited and fixed width format files.

Is your email address OK? You are signed up for our newsletters but your email address is either
unconfirmed, or has not been reconfirmed in a long time. Please click here to have a confirmation
email sent so we can confirm your email address and start sending you newsletters again.
Alternatively, you can update your subscriptions.

Download version 1.1.3 binaries for .NET 2.0 281.7 KB


Download version 1.1.3 source for .NET 2.0 904 KB

Introduction
We, as developers, are often faced with converting data from one format to another. For a project at work, I
needed a portable solution that was efficient, had minimal external requirements, and parsed delimited and fixed
width data. As shown below, the GenericParser is a good replacement for any Microsoft provided solution and
provides some unique functionality. The code is well organized and easy to follow to allow modification as
necessary.
Note: The project was built using Visual Studio 2010, but the code is designed for .NET 2.0.

Definitions
Delimited data Data whose columns are separated by a specific character e.g., CSV Comma Separated
Values.
Fixedwidth data Data whose columns are of a set number of characters wide.

Features
The GenericParser and the derived GenericParserAdapter contains the following features:
Efficient See Benchmarking below for more details.
Time: Approximately 3 to 10 times faster than any Microsoft provided solution
Memory: Approximately equal to or less than any Microsoft provided solution
Supports delimited and fixedwidth formats
Supports a custom delimiter character single character only
Supports comment rows single character marker
Supports escape characters single character only
Supports a custom text qualifier to allow column/row delimiters to be ignored e.g., multiline data
Supports escaped text qualifiers by doubling them up
Supports ignoring/including rows that contain no characters
Supports a header row
Supports the ability to dynamically add more columns to match the data
Supports the ability to enforce the number of columns to a specific number

Info
First Posted

19 Sep 2005

Views

665,256

Downloads

20,267

Bookmarked

273 times

Supports the ability to enforce the number of columns based on the first row
Supports trimming the strings of a column
Supports stripping off control characters
Supports reusing the same instance of the parser for different data sources
Supports TextReader and String the file location as data sources
Supports limiting the maximum number of rows to read
Supports customizing the size of the internal buffer
Supports skipping rows at the beginning of the data after the header row
Supports XML configuration which can be loaded/saved in numerous formats
Supports access to data via column name when a header row is supplied
Supports Unicode encoding
GenericParserAdapter supports skipping rows at the end of the data
GenericParserAdapter supports adding a line number to each row of output
GenericParserAdapter supports the following outputs XML, DataTable, and DataSet
Thorough unit testing 91.94% code coverage tests supplied in source download
Thorough XML documentation in code including a .chm help file in the binary/source downloads

Benchmarking
To benchmark the GenericParser, I chose to compare it to:
Microsoft's Text Driver
Microsoft's Text Field Parser
Sebastien Lorion's CsvReader 3.7 CSV only code found here[^]
GenericParser 1.0
To get a realistic datasource for benchmarking, I took 10 rows of data from the Northwind Database and
replicated them for successively larger and larger sets of CSV and FixedWidthdata. Using
System.Diagnostics.Stopwatch to measure CPU usage, I executed each benchmark 10 times and averaged
the results to minimize the amount of error in the instrumentation. For the memory usage, I used Visual Studio
2010's memory profiling and executed each benchmark only once.
I've tried to generate tests that exercise the code equally for each solution. As a caveat, these tests do not test
every possible scenario your mileage may vary. Please feel free to use my code as a basis or create your own
tests to compare the code before you draw any conclusions.
For example, the tests below did not take into account escaped characters. In GenericParser 1.0, it allocated an
additional buffer for escaped characters, which essentially doubled its memory requirements. In GenericParser
1.1, it reuses the existing buffer to unescape the column of data. You wouldn't see this benefit unless you
specifically geared your tests to account for this.
Just because I know someone will comment about this, I am aware of FileHelpers[^], but I believe they fit into a
different category which doesn't map easily for comparison to the above solutions. FileHelpers rely on a declarative
definition of the file schema through attributes on concrete classes. My solution depends on defining the
schema through properties or XML. You may feel free to compare them, if they fit into your problem space.

CPU Usage

Memory Usage

Note: Because profiling the memory was generating .vsp files upwards of 2 gigs and the memory usage seemed
pretty stable, I only executed memory profiling for 10 to 10,000 rows of data.

Conclusion
As can be seen in the charts, GenericParser meets or exceeds anything Microsoft has put together in all areas.
Furthermore, version 1.1 out performs version 1.0 in performance considerably, especially considering the bug
fixes and the new features added.
As can be seen by the graphs, Sebastien Lorion's CsvReader is definitely the top contender for parsing delimited
files. So, if you are looking at only parsing delimited files, I would highly recommend checking out his library.
Otherwise, I find my library to be an effective implementation for being able to parse both formats.
In the source download, you can find all of my performance tests and results, including an Excel 2010 workbook
that has all of the collected raw data together for charting purposes.

Using the Code


The code itself mimics most readers found within the .NET Framework, but the usage follows four basic steps:
1. Set the data source through either the constructor or the SetDataSource() method.
2. Configure the parser for the data source's format, either through properties, or by loading an XML CONFIG

file via the Load() method.


3. Call the Read() method and access the columns of data underneath, or for the GenericParserAdapter
GetXml(), GetDataTable(), GetDataSet() to extract data.
4. Call Close() or Dispose().
Collapse | Copy Code

DataSetdsResult;

//UsinganXMLConfigfile.
using(GenericParserAdapterparser=newGenericParserAdapter("MyData.txt"))
{
parser.Load("MyData.xml");
dsResult=parser.GetDataSet();
}

//Or...programmaticallysettinguptheparserforTSV.
stringstrID,strName,strStatus;
using(GenericParserparser=newGenericParser())
{
parser.SetDataSource("MyData.txt");

parser.ColumnDelimiter="\t".ToCharArray();
parser.FirstRowHasHeader=true;
parser.SkipStartingDataRows=10;
parser.MaxBufferSize=4096;
parser.MaxRows=500;
parser.TextQualifier='\"';

while(parser.Read())
{
strID=parser["ID"];
strName=parser["Name"];
strStatus=parser["Status"];

//Yourcodehere...
}
}

//Or...programmaticallysettinguptheparserforFixedwidth.
using(GenericParserparser=newGenericParser())
{
parser.SetDataSource("MyData.txt");

parser.ColumnWidths=newint[4]{10,10,10,10};
parser.SkipStartingDataRows=10;
parser.MaxRows=500;

while(parser.Read())
{
strID=parser["ID"];
strName=parser["Name"];
strStatus=parser["Status"];

//Yourcodehere...
}
}

Acknowledgements
While I did not create a derivative of Sebastien Lorion's CsvReader, I did use some of his concepts of provided
functionality in his CsvReader for the GenericParser.

Tools Used
Visual Studio 2010 including unit testing/profiling
.NET Framework 2.0
HTML Help Workshop documentation[^]
SandCastle documentation[^]
Sandcastle Help File Builder documentation[^]
Microsoft Excel 2010

History

Article

September 17, 2005 1.0 First release Browse Code


June 20, 2010 1.1
Stats
New features:

Revisions 12

Alternatives
Supports ignoring/including blank
rows of data no characters found in row
Supports the ability to enforce
the number
Comments
323 of columns based on the first row
Supports stripping off control characters

Add yourskipping
own
GenericParserAdapter supports
rows at the end of the data
alternative version
Reduced memory overhead when using escaped characters
Support for specifying the data's encoding

Bug fixes:

Tagged as

Fixed a bug with parsing data with


a header and no data
.NET1.1
Fixed a bug in not handling text qualifiers/escape/comment characters consistently
VS.NET2003
Fixed a bug in reading a file across a high latency network
C#
Fixed a bug with text qualifiers being interpreted in the middle of the column only works if at
the start and end of a column Windows
.NET at the very end of the buffer
Fixed a bug with skipping row ends
Breaking changes:

VisualStudio
DBA

Fixed width parsing will no longer


Devtake text qualifiers or escape characters into account
The following properties have been converted to a char?:
Intermediate

ColumnDelimiter
CommentCharacter
EscapeCharacter
TextQualifier

Related Articles

RowDelimiter has been removed, and the code automatically handles looking for '\n' or '\r'
An extensible math
to indicate a new row assuming '\r' is not a column delimiter. If one of these characters is
expression parser
found, it will skip the paired '\n'
'\r' assuming '\r' is not a column delimiter.
withorplugins
SkipDataRows has been renamed to SkipStartingDataRows
An XML parser and
The FixedWidth property has
been
replaced
editor
with
shades ofby a property called TextFieldType, which is
of the enumtype FieldType
a Design Pattern
Due to the changes in the properties
listed above, the XML produced by version 1.0 will not be
muParserSSE
100% compatible with version 1.1
Mathematical
Read() will return true if it Fast
parses
a header row and no data
Expressions Parser
ParserSetupException has been replaced by InvalidOperationException
A .NETinassembly
Reworked the messages supplied
the exceptions to be more descriptive.
June 26, 2010 1.1.1

viewer

Go to top

New features:
Reworked benchmarking to be more representative of real world data and switched over
testing to not use DataSets
Slightly more efficient loading of configuration files
February 5, 2012 1.1.2
Bug fixes:
Fixed an issue where an exception was being thrown for the MaxBufferSize was too small,
when it was indeed large enough Reported by uberblue.
March 16, 2012 1.1.3
Bug fixes:
Fixed an issue where control characters were being removed accidently Reported by John
Voelk.
Fixed an issue where data at the end of the stream wasn't extracted properly introduced in
version 1.1.2.

License
This article, along with any associated source code and files, is licensed under The Code Project Open License
CPOL

Share
EMAIL

About the Author

Andrew Rissing
Software Developer Senior
United States

Since I've begun my profession as a software developer, I've learned one important fact change is inevitable.
Requirements change, code changes, and life changes.
So..If you're not moving forward, you're moving backwards.

Comments and Discussions


Search Comments

Add a Comment or Question


Profile popups Spacing Relaxed

Noise Medium

Layout Normal

Go
Per page 25

Update

First Prev Next

Very good!
Re: Very good!

J4Nch

16Dec14 8:57

Andrew Rissing

16Dec14 10:41

Works very good

Arghost

MyData.xml

Member 3972215

Re: MyData.xml
Missing last column
Re: Missing last column
Is it possible to set Character Encoding?
Re: Is it possible to set Character Encoding?
Skipping columns [modified]
Re: Skipping columns
Awesome!
Re: Awesome!
XmlComment bug
Re: XmlComment bug
Re: XmlComment bug
Re: XmlComment bug
Re: XmlComment bug
Re: XmlComment bug

5Nov14 14:59
17Sep14 17:00

Andrew Rissing

18Sep14 10:07

MrWax

3Sep14 11:56

Andrew Rissing
Wakabajashij
Andrew Rissing
RichardMcCutchen

4Sep14 11:21
22Aug14 10:52
25Aug14 10:26
20Aug14 15:31

Andrew Rissing

21Aug14 10:20

Gwunhar

8Aug14 16:47

Andrew Rissing

11Aug14 10:40

JacDev

16Jul14 12:38

Andrew Rissing

16Jul14 13:26

JacDev

16Jul14 14:39

Andrew Rissing

16Jul14 16:32

JacDev

16Jul14 19:02

Andrew Rissing

17Jul14 10:44

Re: XmlComment bug


Re: XmlComment bug
My vote of 5
Re: My vote of 5

Re: Example

General

News

17Jul14 10:56

Andrew Rissing

17Jul14 10:58

docXmaier

3Jul14 12:20
3Jul14 13:48

Andrew Rissing

Example

Last Visit: 1Jan00 0:00

JacDev

Last Update: 17Dec14 20:39


Suggestion

Question

RichardMcCutchen

15Jun14 1:12

Andrew Rissing

16Jun14 10:34

Refresh
Bug

Answer

1 2 3 4 5 6 7 8 9 10 11 Next
Joke

Rant

Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile


Web02 | 2.8.141216.1 | Last Updated 17 Jul 2012

Select Language

Layout: fixed | fluid

Article Copyright 2005 by Andrew Rissing


Everything else Copyright CodeProject, 19992014

S-ar putea să vă placă și