Sunteți pe pagina 1din 19

Loading documents and other file data into the

Oracle Database
Roger Ford
PRODUCT MANAGER

Introduction

With recent versions of Oracle, there is more and more emphasis on handling
unstructured text, as well as the traditional fully structured data such as numbers, dates
and short text fields.

Here I want to consider various methods of getting "documents" into the Oracle
database.

None of these methods are completely trivial, and the choice of which to use will
depend on various factors.

What are the characteristics of a "document"? In general, I'm referring to text which may
be too long to fit into a PL/SQL VARCHAR2 variable. These variables are limited to
32767 bytes. The document most likely originates from a file on a file system, and it
may be in a character set which differs from the database character set. In many cases,
documents will have metadata (such as author and creation date) that we want to load
at the same time.
We should distinguish between four types of data:

 Metadata. This is short textual, numeric or date data which describes the main
document. For example: document id, author, date created.

 Semi-structured Data. This is textual strings in a specific format, such as

JSON or XML. We may be able to make assertions about such data, such as
"will always start with '<'", or "does not contain any newline characters".

 Full Text. This is text of arbitrary length. It can and usually will contain all
sorts of characters including newlines and quotes. It may consist of multibyte

characters in UTF-8 or UCS16 (or other) character sets.


 Binary Data. Arbitrary length data which is not character encoded. May

contain any bytes or combinations of bytes including nulls and end-of-file

markers.
Loading a mixture of these types of data can be problematic. If we want to load
metadata and full text from a single file, then we need some method of determining
where the full text starts and finishes. The easiest way to do that is to enclose the full
text in some sort of quotes. But then what if the full text contains those quote
characters? So it's usually easier to arrange for each full text document to exist in a
separate file, referenced from the main loading method.

It would be impossible to load multiple binary data records from a single file, unless the
binary data was first encoded using a scheme such as BASE64, since by its nature you
can never specify a delimiter string that could not appear in the binary data.

For these examples, we're going to consider three use-cases:

1. A set of JSON documents. These documents reside in a single file, one per

line.

2. A set of files, which contain one document per file. Each file/document has

metadata associated with it in a "summary" file.

3. As above, but one of the files contains multibyte (UTF-8) data.

Client / Server considerations - where is your source data?

If your client and server are the same machine, then any of these methods should work.
However, if you are connecting to your database from a remote client over SQL*Net,
you might first decide whether you want to load files from the client (the machine from
which you connect) or the server (the machine where the database resides).

Load Method Data on Client Data on Server


SQL*Plus Yes No
SQL*Loader Yes No
External Tables No Yes
BFILES No Yes
Method 1: Using SQL*Plus Scripts

Let's consider we have a table with a several short columns and a CLOB column:

CREATE TABLE my_table

(id NUMBER,

author VARCHAR2(30),

created DATE,

text CLOB

);

If I just want to do some "quick and dirty" loading of this table, I can just create a SQL
script file and run that in SQL*Plus. For example:

INSERT INTO my_table VALUES ( 1, 'Roger', SYSDATE, 'the quick brown fox
jumps over the lazy dog' );

INSERT INTO my_table VALUES ( 2, 'John', SYSDATE, 'the angry aligator ate
the antelope');

COMMIT;

Simple and quick, and fairly easy to generate from source files using a language such
as Perl or Python. But do this with any quantity of data and you'll rapidly hit limitations:

 Any single quotes in the text need to be doubled up:


INSERT INTO my_table VALUES ( 3, 'Roger', SYSDATE, 'my best
friend''s brother said ''hi''');
 Any ampersand ("&") characters in the script will cause problems unless you've

put SET DEFINE OFF somewhere in the script

 An input line to SQL*Plus cannot be longer than 2499 characters. If you want

to include more than this you'll need to concatenate strings:


 INSERT INTO my_table VALUES ( 3, 'Roger', SYSDATE, 'this is a long
line that continues over more than one line'

|| ' in the source code file');

 Even then, that concatenation method will only allow you to assemble up to

4000 characters. To continue longer than that would require you to create

temporary lobs, and use DBMS_LOB.APPEND for each chunk.

 It's slower than most other methods which will be an issue if you're loading

large quantities of data.


Let's take our usecase 1 - the JSON documents. We could cut-and-paste from the file
into a hand-coded script, but it's probably better to do it using a text-processing
language such as Perl. The attached file jsonToSql.pl replaces any single quotes with
two single quotes, and breaks each line into chunks no longer than 2400 characters
(allowing some space for syntactic overheads.

Assuming we have perl installed and available, we can call it from the command line
(Windows or Unix) using
perl jsonToSql.pl jsonsource.txt > json.sql

And we can then load the resulting file json.sql (after creating the necessary table - see
above) using
sqlplus user/pass @json.sql

SQL*Plus isn't very good for loading from separate files, so we'll leave scenarios two
and three out for this method.
Multibyte Considerations
SQL*Plus isn't very good at dealing with multibyte / UTF-8 characters. To get useful
correct (and viewable!) output you will probably want to use SQL Developer to run the
scripts.

Method 2: Using SQL*Loader

SQL*Loader is a utility from the very early days of Oracle. It's reasonably simple to use,
but the vast number of options can make it intimidating for the uninitiated.

SQL*Loader loads files from the client. It has no access to files on the server.

It is normal to provide SQL*Loader with two inputs: A data file containing the data to be
loaded to the database, and a control file which specifies how the data file is laid out,
and which columns the data should be written to (it's possible to include the data inside
the control file, but we'll stick with separate files for this example).

Use case 1: JSON documents in a single file


We'll use the same table "my_table" as defned earlier.

This is fairly straightforward. We create a control file called jsonloader.ctl which


consists of
LOAD DATA

INFILE 'jsonsource.txt'

INTO TABLE jsondocs

FIELDS

( jsontext CHAR(32000) )

Then we load it using the command:

sqlldr roger/roger control=jsonloader.ctl log=loader.log bad=loader.bad


The "log" and "bad" arguments aren't required, but can be most useful for debugging.

Note that we've limited the length of the jsontext field to 32000 characters, but you can
increase that size if you need to.
Use case 2: Loading from separate files

A quick reminder about our second scenario. We have a table which we'll define as:
CREATE TABLE my_table

(id NUMBER,

author VARCHAR2(30),

created DATE,

text CLOB

);

and into this table we want to load the files file1.txt and file2.txt

file1.txt:
The quick brown fox jumps over the lazy dog

file2.txt:

The angry alligator ate the antelope.

These files only have one line in each, but they could just as easily have multiple lines
per file. One file is loaded into each database record.

If we want to use SQL*Loader to load CLOB or BLOB data from separate files, then the
filenames for those files must be included in the data file. We use the "FILLER" syntax
to specify that the filename isn't to be loaded into any column, and then the LOBFILE
syntax to load the text into a column.
So here's our control file. We can see that it's similar to the previous control file but the
"FIELDS" section is more complicated, to allow for multiple metadata fields, and the
reference to the external file to be loaded.

loader.ctl:
LOAD DATA

INFILE 'loader.txt'

INTO TABLE my_table

FIELDS TERMINATED BY ','

( id CHAR(10),

author CHAR(30),

created DATE "YYYY-MM-DD" ":created",

fname FILLER CHAR(80),

text LOBFILE(fname) TERMINATED BY EOF

"fname" is an arbitrary label, we could have used "fred" and it would have worked
exactly the same. It just needs to be the same on the two lines where it is used.

Note that the control file includes the name of our data file, loader.txt. loader.txt contains
the metadata associated with each file, and the filename.

loader.txt:
1,John Smith,2015-04-29,file1.txt
2,Pete Jones,2013-01-31,file2.txt
So we're loading two files, along with the metadata for each.

Now we can run SQL*Loader from the command line as follows:


sqlldr roger/roger control=loader.ctl log=loader.log bad=loader.bad

(again, the "log" and "bad" arguments aren't required, but can be most useful for
debugging).

Then we can use SQL*Plus or SQL Developer to examine our loaded table:

SQL> select * from my_table;

ID AUTHOR CREATED TEXT

---------- ---------------- --------- ------------------------------------


------------

1 John Smith 29-APR-15 The quick brown fox jumps over the lazy
dog

2 Pete Jones 31-JAN-13 The angry alligator ate the antelope.

Common mistakes:
Do not be tempted to add spacing to format the loader.txt file nicely. If you use

1, John Smith, 2015-04-29, file1.txt

Then not only will the author be loaded with a space in front of "John", but you'll see an
obscure error:

SQL*Loader-509: System error: The device does not recognize the command.

SQL*Loader-502: unable to open data file ' file1.txt' for field TEXT table
Muti-byte considerations
So what if your file contains multi-byte characters? Perhaps some Chinese text mixed
up with English?
You will simply need to specify the character set of the file in the LOBFILE directive. For
example, I have a file encoded in UTF-8, so I modify the control file to read:
LOAD DATA

INFILE 'loader.txt'

INTO TABLE my_table

FIELDS TERMINATED BY ','

( id CHAR(10),

author CHAR(30),

created DATE "YYYY-MM-DD" ":created",

fname FILLER CHAR(80),

text LOBFILE(fname CHARACTERSET UTF8) TERMINATED BY EOF

Loading Binary Files with SQL*Loader


The above method (without the CHARACTERSET addition) will work exactly the same if
our column is a binary BLOB column rather than a character CLOB. There is no
difference in the loading mechanism - all that changes is that the database does not
attempt to do any kind of character set conversation on the data as it is loaded into the
database.

Method 3: Using BFILES and PL/SQL


A BFILE is a non-standard datatype which provides read-only access to data held on an
external file system.

NOTE: This method will only work with files on the server. BFILES have no access to
files on the client machine.
It's possible to work directly with BFILEs, but it's perhaps more common to use them as
a kind of temporary staging area to allow you to read data into database columns.

BFILEs require a DIRECTORY - which is an internal database pointer to a


folder/directory on the file system. To create a directory you must have CREATE ANY
DIRECTORY privilege - or you can have a DBA create the directory and grant you
access to it. This provides a layer of protection to prevent users from reading or writing
files on the database server.

There is a provided package DBMS_LOB which provides a number of utility functions


including LOADFROMFILE.

On the face of it, this might seem all we need, but in fact there's some preparation work
we need to do.

LOADFROMFILE will only load into an open clob. So we need to prepare a lob locator,
open it, call LOADFROMFILE, then close it afterwards.

The following example shows a procedure which writes to our previously-created


my_table.

Firstly it creates a BFILE from the supplied directory and filename, then it inserts the
metadata into my_table, fetching back the lob locator for the newly created row. It then
uses LOADFROMFILE to write into that lob:
CREATE SEQUENCE loadtest_seq;

CREATE TABLE my_table(id NUMBER, author VARCHAR2(100), load_date DATE,


text CLOB);

CREATE OR REPLACE PROCEDURE LOAD_A_FILE

( dir_name VARCHAR2,
file_name VARCHAR2,

author VARCHAR2

) IS

l_bfile BFILE;

l_clob CLOB;

BEGIN

l_bfile := BFILENAME(dir_name, file_name);

IF (dbms_lob.fileexists(l_bfile) = 1) THEN

dbms_output.put_line('File Exists');

INSERT INTO my_table T

VALUES (loadtest_seq.NEXTVAL,

author,

sysdate,

EMPTY_CLOB()

) RETURN text INTO l_clob;

L_BFILE := bfilename(dir_name, file_name);

dbms_lob.fileopen( l_bfile, dbms_lob.FILE_READONLY );


dbms_lob.loadfromfile( l_clob, l_bfile, dbms_lob.getlength(l_bfile)
);

dbms_lob.fileclose( l_bfile );

COMMIT;

ELSE

dbms_output.put_line('File does not exist');

END IF;

END;

To use this procedure, we might do:

CREATE OR REPLACE DIRECTORY my_directory AS


'/scratch/raford/Projects/loadingFiles';

-- remember that needs CREATE ANY DIRECTORY priv.

EXECUTE load_a_file( 'MY_DIRECTORY', 'file1.txt', 'Roger');

Note that in this date we're using SYSDATE for the creation date, which varies slightly
from previous examples. We could of course include create_date as another argument
to our procedure.

Common Mistakes:
Remember that the file must be found on the database server. You cannot use this
method to load a file which is located on a local machine if you are using SQL*Plus or
SQL Developer remotely.

The directory name is an identifier, similar to a username, table name or column name.
As such, if you specify it without double quotes, it will be automatically upper-cased. But
LOADCLOBFROMFILE expects the directory name to be passed in as a string, so you
must specify it in upper-case if it was originally created without quotes.

No character set conversion is applied by LOADCLOBFROMFILE. This generally


means the above procedure will not work if your data is not simple single-byte
characters. Often you'll find the text is inexplicably truncated, becauuse of the difference
between the number of characters and the number of bytes. See the next section for a
solution.

Multibyte Considerations

If we want to load multibyte character data, we must


forget DBMS_LOB.LOADFROMFILE and copy the contents of the BFILE into the CLOB
using a more "manual" technique. This involves reading chunks into a RAW buffer, and
then using UTL_RAW.CAST_TO_VARCHAR2 to allow us to write the RAW data into
the CLOB value:
CREATE OR REPLACE PROCEDURE LOAD_MULTIBYTE_FILE

( dir_name VARCHAR2,

file_name VARCHAR2,

author VARCHAR2

) IS

l_bfile BFILE;

l_clob CLOB;
buf RAW(32767);

vc VARCHAR2(32767);

maxsize INTEGER := 8192; -- a char can take up to 4 bytes,

-- so this is the maximum safe length in


chars

amt INTEGER;

amtvc INTEGER;

v_offset INTEGER := 1;

BEGIN

l_bfile := bfilename(dir_name, file_name);

IF (dbms_lob.fileexists(l_bfile) = 1) THEN

dbms_output.put_line('File Exists');

INSERT INTO my_table T

VALUES (loadtest_seq.NEXTVAL,

author,

sysdate,

EMPTY_CLOB()

) RETURN text INTO l_clob;


l_bfile := BFILENAME(dir_name, file_name);

amt := dbms_lob.getlength( l_bfile );

dbms_lob.fileopen( l_bfile, dbms_lob.FILE_READONLY );

WHILE amt > 0 LOOP

IF amt > maxsize THEN

amt := maxsize;

END IF;

dbms_lob.read( l_bfile,amt, v_offset, buf );

vc := utl_raw.cast_to_varchar2(buf);

amtvc := LENGTH(vc);

dbms_lob.writeappend( l_clob, amtvc, vc );

v_offset := v_offset + amt;

amt := dbms_lob.getlength( l_bfile ) - v_offset + 1;

END LOOP;

dbms_lob.fileclose( l_bfile );

COMMIT;

ELSE
dbms_output.put_line('File does not exist');

END IF;

END;

Method 4: Using External Tables

External tables appear to the database much like normal tables, but their data resides
on the file system. The layout of the external files is described using a variation on
SQL*Loader syntax.

However, SQL*Loader loads from the client file system, whereas external tables expect
their data files to exist on the server file system.

External tables are particularly useful for loading large numbers of records from a single
file, so each record appears in its own row in the table.

CREATE TABLE my_ext_table ( id NUMBER, author VARCHAR2(30), created DATE,


text CLOB )

ORGANIZATION EXTERNAL (

TYPE ORACLE_LOADER

DEFAULT DIRECTORY MY_DIRECTORY

ACCESS PARAMETERS (

RECORDS DELIMITED BY 0x'0A'

FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'


( id CHAR(10),

author CHAR(30),

created DATE "YYYY-MM-DD",

text CHAR(131071)

LOCATION (

MY_DIRECTORY:'external.dmp'

);

And here's our data file external.dmp:


1,Roger,2015-08-08,The quick brown fox jumps over the lazy dog

2,John,2012-01-01,"The angry aligator, acting alone, ate the antelope"

Note the second line requires quotes around the full text string, as otherwise the field
would be terminated at the comma. Of course we meet the usual problem that the full
text string may contain quotes. We can counter that by doubling them:

2,John,2012-01-01,"The angry aligator, acting ""alone"", ate the antelope"

But we still have the issue that the full text might contain the record delimiter character
(0x'0A', or newline). So as with SQL*Loader, we can use specify a separate file for each
document. External tables do not support the FILLER directive, so instead we have to
use a COLUMN TRANSFORMS line to specify that our fname field contains the name
of our external file:

DROP TABLE my_ext_table2;

CREATE TABLE my_ext_table2 ( id NUMBER, author VARCHAR2(30), created DATE,


text CLOB )

ORGANIZATION EXTERNAL (

TYPE ORACLE_LOADER

DEFAULT DIRECTORY MY_DIRECTORY

ACCESS PARAMETERS (

RECORDS DELIMITED BY 0x'0A'

FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'

( id CHAR(10),

author CHAR(30),

created DATE "YYYY-MM-DD",

fname char(100)

COLUMN TRANSFORMS (text FROM LOBFILE(fname) FROM (MY_DIRECTORY) )

)
LOCATION (

'loader.txt'

);

select * from my_ext_table2;

Note: the "FROM (MY_DIRECTORY)" clause is not necessary here since we've already
specified that as the default directory.
However if the loader.txt file and the individual clob files were in different locations, it
would be needed.

Once your data is in an external table, you can either leave it there and perform normal
table operations (DML and most DDL) on the external table, or you can use the
external table as a "staging" table to get the data into a normal table. You can simply do
CREATE TABLE normaltable SELECT * FROM externaltable;

to create a new normal table. And of course you can similarly do:

INSERT INTO normaltable SELECT * FROM externaltable;

Conclusion
There are several ways to load documents into LOB columns in Oracle database tables.
The choice of method will depend on the nature and location of your data, the
performance required for loading, and perhaps the loading method you're already most
familiar with.

S-ar putea să vă placă și