Documente Academic
Documente Profesional
Documente Cultură
Oracle Database
Roger Ford
PRODUCT MANAGER
Introduction
With recent versions of Oracle, there is more and more emphasis on handling
unstructured text, as well as the traditional fully structured data such as numbers, dates
and short text fields.
Here I want to consider various methods of getting "documents" into the Oracle
database.
None of these methods are completely trivial, and the choice of which to use will
depend on various factors.
What are the characteristics of a "document"? In general, I'm referring to text which may
be too long to fit into a PL/SQL VARCHAR2 variable. These variables are limited to
32767 bytes. The document most likely originates from a file on a file system, and it
may be in a character set which differs from the database character set. In many cases,
documents will have metadata (such as author and creation date) that we want to load
at the same time.
We should distinguish between four types of data:
Metadata. This is short textual, numeric or date data which describes the main
document. For example: document id, author, date created.
JSON or XML. We may be able to make assertions about such data, such as
"will always start with '<'", or "does not contain any newline characters".
Full Text. This is text of arbitrary length. It can and usually will contain all
sorts of characters including newlines and quotes. It may consist of multibyte
markers.
Loading a mixture of these types of data can be problematic. If we want to load
metadata and full text from a single file, then we need some method of determining
where the full text starts and finishes. The easiest way to do that is to enclose the full
text in some sort of quotes. But then what if the full text contains those quote
characters? So it's usually easier to arrange for each full text document to exist in a
separate file, referenced from the main loading method.
It would be impossible to load multiple binary data records from a single file, unless the
binary data was first encoded using a scheme such as BASE64, since by its nature you
can never specify a delimiter string that could not appear in the binary data.
1. A set of JSON documents. These documents reside in a single file, one per
line.
2. A set of files, which contain one document per file. Each file/document has
If your client and server are the same machine, then any of these methods should work.
However, if you are connecting to your database from a remote client over SQL*Net,
you might first decide whether you want to load files from the client (the machine from
which you connect) or the server (the machine where the database resides).
Let's consider we have a table with a several short columns and a CLOB column:
(id NUMBER,
author VARCHAR2(30),
created DATE,
text CLOB
);
If I just want to do some "quick and dirty" loading of this table, I can just create a SQL
script file and run that in SQL*Plus. For example:
INSERT INTO my_table VALUES ( 1, 'Roger', SYSDATE, 'the quick brown fox
jumps over the lazy dog' );
INSERT INTO my_table VALUES ( 2, 'John', SYSDATE, 'the angry aligator ate
the antelope');
COMMIT;
Simple and quick, and fairly easy to generate from source files using a language such
as Perl or Python. But do this with any quantity of data and you'll rapidly hit limitations:
An input line to SQL*Plus cannot be longer than 2499 characters. If you want
Even then, that concatenation method will only allow you to assemble up to
4000 characters. To continue longer than that would require you to create
It's slower than most other methods which will be an issue if you're loading
Assuming we have perl installed and available, we can call it from the command line
(Windows or Unix) using
perl jsonToSql.pl jsonsource.txt > json.sql
And we can then load the resulting file json.sql (after creating the necessary table - see
above) using
sqlplus user/pass @json.sql
SQL*Plus isn't very good for loading from separate files, so we'll leave scenarios two
and three out for this method.
Multibyte Considerations
SQL*Plus isn't very good at dealing with multibyte / UTF-8 characters. To get useful
correct (and viewable!) output you will probably want to use SQL Developer to run the
scripts.
SQL*Loader is a utility from the very early days of Oracle. It's reasonably simple to use,
but the vast number of options can make it intimidating for the uninitiated.
SQL*Loader loads files from the client. It has no access to files on the server.
It is normal to provide SQL*Loader with two inputs: A data file containing the data to be
loaded to the database, and a control file which specifies how the data file is laid out,
and which columns the data should be written to (it's possible to include the data inside
the control file, but we'll stick with separate files for this example).
INFILE 'jsonsource.txt'
FIELDS
( jsontext CHAR(32000) )
Note that we've limited the length of the jsontext field to 32000 characters, but you can
increase that size if you need to.
Use case 2: Loading from separate files
A quick reminder about our second scenario. We have a table which we'll define as:
CREATE TABLE my_table
(id NUMBER,
author VARCHAR2(30),
created DATE,
text CLOB
);
and into this table we want to load the files file1.txt and file2.txt
file1.txt:
The quick brown fox jumps over the lazy dog
file2.txt:
These files only have one line in each, but they could just as easily have multiple lines
per file. One file is loaded into each database record.
If we want to use SQL*Loader to load CLOB or BLOB data from separate files, then the
filenames for those files must be included in the data file. We use the "FILLER" syntax
to specify that the filename isn't to be loaded into any column, and then the LOBFILE
syntax to load the text into a column.
So here's our control file. We can see that it's similar to the previous control file but the
"FIELDS" section is more complicated, to allow for multiple metadata fields, and the
reference to the external file to be loaded.
loader.ctl:
LOAD DATA
INFILE 'loader.txt'
( id CHAR(10),
author CHAR(30),
"fname" is an arbitrary label, we could have used "fred" and it would have worked
exactly the same. It just needs to be the same on the two lines where it is used.
Note that the control file includes the name of our data file, loader.txt. loader.txt contains
the metadata associated with each file, and the filename.
loader.txt:
1,John Smith,2015-04-29,file1.txt
2,Pete Jones,2013-01-31,file2.txt
So we're loading two files, along with the metadata for each.
(again, the "log" and "bad" arguments aren't required, but can be most useful for
debugging).
Then we can use SQL*Plus or SQL Developer to examine our loaded table:
1 John Smith 29-APR-15 The quick brown fox jumps over the lazy
dog
Common mistakes:
Do not be tempted to add spacing to format the loader.txt file nicely. If you use
Then not only will the author be loaded with a space in front of "John", but you'll see an
obscure error:
SQL*Loader-509: System error: The device does not recognize the command.
SQL*Loader-502: unable to open data file ' file1.txt' for field TEXT table
Muti-byte considerations
So what if your file contains multi-byte characters? Perhaps some Chinese text mixed
up with English?
You will simply need to specify the character set of the file in the LOBFILE directive. For
example, I have a file encoded in UTF-8, so I modify the control file to read:
LOAD DATA
INFILE 'loader.txt'
( id CHAR(10),
author CHAR(30),
NOTE: This method will only work with files on the server. BFILES have no access to
files on the client machine.
It's possible to work directly with BFILEs, but it's perhaps more common to use them as
a kind of temporary staging area to allow you to read data into database columns.
On the face of it, this might seem all we need, but in fact there's some preparation work
we need to do.
LOADFROMFILE will only load into an open clob. So we need to prepare a lob locator,
open it, call LOADFROMFILE, then close it afterwards.
Firstly it creates a BFILE from the supplied directory and filename, then it inserts the
metadata into my_table, fetching back the lob locator for the newly created row. It then
uses LOADFROMFILE to write into that lob:
CREATE SEQUENCE loadtest_seq;
( dir_name VARCHAR2,
file_name VARCHAR2,
author VARCHAR2
) IS
l_bfile BFILE;
l_clob CLOB;
BEGIN
IF (dbms_lob.fileexists(l_bfile) = 1) THEN
dbms_output.put_line('File Exists');
VALUES (loadtest_seq.NEXTVAL,
author,
sysdate,
EMPTY_CLOB()
dbms_lob.fileclose( l_bfile );
COMMIT;
ELSE
END IF;
END;
Note that in this date we're using SYSDATE for the creation date, which varies slightly
from previous examples. We could of course include create_date as another argument
to our procedure.
Common Mistakes:
Remember that the file must be found on the database server. You cannot use this
method to load a file which is located on a local machine if you are using SQL*Plus or
SQL Developer remotely.
The directory name is an identifier, similar to a username, table name or column name.
As such, if you specify it without double quotes, it will be automatically upper-cased. But
LOADCLOBFROMFILE expects the directory name to be passed in as a string, so you
must specify it in upper-case if it was originally created without quotes.
Multibyte Considerations
( dir_name VARCHAR2,
file_name VARCHAR2,
author VARCHAR2
) IS
l_bfile BFILE;
l_clob CLOB;
buf RAW(32767);
vc VARCHAR2(32767);
amt INTEGER;
amtvc INTEGER;
v_offset INTEGER := 1;
BEGIN
IF (dbms_lob.fileexists(l_bfile) = 1) THEN
dbms_output.put_line('File Exists');
VALUES (loadtest_seq.NEXTVAL,
author,
sysdate,
EMPTY_CLOB()
amt := maxsize;
END IF;
vc := utl_raw.cast_to_varchar2(buf);
amtvc := LENGTH(vc);
END LOOP;
dbms_lob.fileclose( l_bfile );
COMMIT;
ELSE
dbms_output.put_line('File does not exist');
END IF;
END;
External tables appear to the database much like normal tables, but their data resides
on the file system. The layout of the external files is described using a variation on
SQL*Loader syntax.
However, SQL*Loader loads from the client file system, whereas external tables expect
their data files to exist on the server file system.
External tables are particularly useful for loading large numbers of records from a single
file, so each record appears in its own row in the table.
ORGANIZATION EXTERNAL (
TYPE ORACLE_LOADER
ACCESS PARAMETERS (
author CHAR(30),
text CHAR(131071)
LOCATION (
MY_DIRECTORY:'external.dmp'
);
Note the second line requires quotes around the full text string, as otherwise the field
would be terminated at the comma. Of course we meet the usual problem that the full
text string may contain quotes. We can counter that by doubling them:
But we still have the issue that the full text might contain the record delimiter character
(0x'0A', or newline). So as with SQL*Loader, we can use specify a separate file for each
document. External tables do not support the FILLER directive, so instead we have to
use a COLUMN TRANSFORMS line to specify that our fname field contains the name
of our external file:
ORGANIZATION EXTERNAL (
TYPE ORACLE_LOADER
ACCESS PARAMETERS (
( id CHAR(10),
author CHAR(30),
fname char(100)
)
LOCATION (
'loader.txt'
);
Note: the "FROM (MY_DIRECTORY)" clause is not necessary here since we've already
specified that as the default directory.
However if the loader.txt file and the individual clob files were in different locations, it
would be needed.
Once your data is in an external table, you can either leave it there and perform normal
table operations (DML and most DDL) on the external table, or you can use the
external table as a "staging" table to get the data into a normal table. You can simply do
CREATE TABLE normaltable SELECT * FROM externaltable;
to create a new normal table. And of course you can similarly do:
Conclusion
There are several ways to load documents into LOB columns in Oracle database tables.
The choice of method will depend on the nature and location of your data, the
performance required for loading, and perhaps the loading method you're already most
familiar with.