Sunteți pe pagina 1din 8

AT X specication Word-to-L E Michal Kebrt April 19, 2005

Contents
1 Introduction 1.1 Text processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 1.2 TEX and L TEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 1.3 Word vs. L TEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A 2 Word to L TEX conversion 2.1 Internal conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 External conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 What to expect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 3 Word-to-L TEX 3.1 Introduction . . . . . . . . . . . . . 3.2 Figures . . . . . . . . . . . . . . . 3.3 Mathematical equations . . . . . . 3.4 Structural parts of a document . . 3.5 Formatting . . . . . . . . . . . . . A 3.6 Interaction with L TEX . . . . . . . 3.7 Output options . . . . . . . . . . . 3.8 Miscellaneous options and features 3.9 Program settings . . . . . . . . . . 3.10 Libraries . . . . . . . . . . . . . . . 3.11 Future improvements . . . . . . . .

1 2 2 2 3 3 3 3 3 3 4 4 5 5 6 6 6 6 6 6 7 7 7 7

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

4 Conversion programs 4.1 Word2TEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 4.2 RTF to L TEX convertors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Other convertors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction

A Word-to-L TEX will be a program for converting documents written in Microsoft Word into A L TEX format. The program will be written as a software project (PRG033) at Charles University, Faculty of Mathematics and Physics.

1.1

Text processors

Microsoft Word, WordPerfect, OpenOce.org Writer and many more are examples of word processors and so-called WYSIWIG1 text editors. These programs enable a user to create documents and their design and layout interactively by selecting from a wide variety of commands in the program menu. The user always sees the document in its nal form all the document formatting is displayed on the screen (for example, a heading appears in a bold and bigger font).

1.2

A TEX and L TEX

A L TEX is a typographic system, which is used for typesetting of science and mathematical documents in a high typographic quality. This system is also appropriate for creating many A dierent kinds of documents, from plain letters to large books. L TEX is also a standard for contributing manuscripts to a lot of (scientic) conferences. A A The main dierence between L TEX and Word is that when you make a document in L TEX you must usually write all the text and commands (for example the \textbf{foo} command makes the text foo bold) directly into a plain text le and you cant see the nal document look until you run a program which generates PostScript or PDF le. A L TEX uses TEX [1] a computer program for typesetting which was developed by professor Donald E. Knuth.

1.3

A Word vs. L TEX

A Its not easy to say whats better whether Word as an example of word processor or L TEX. Everybody needs and likes dierent approaches for writing documents. Here are some advanA tages of L TEX and Word. A L TEX advantages:

User can use predened document templates (e.g. for articles and books) with professional look. Great means for writing mathematical expressions. Its not necessary to specify the document formatting and look, it depends on a document style. User writes only commands determining the logical structure of a document (e.g. sections and footnotes). Many add-ons (e.g. for inserting graphics or hyperlinks).
A Wide portability of TEX and L TEX system.

Its free. Word advantages: Easy to use and learn for most of people. User always see the document in its nal form.
1

what you see is what you get

A Word to L TEX conversion

A There are two possible ways how to convert Word documents to L TEX format. A lot of information and also the terminology internal and external conversion come from the article [3].

2.1

Internal conversion

Internal conversion is carried out within the Word program using its object model. Its not signicant whether you use a VBA macro or some external program. The most important thing is that all document parts and all document information including formatting, Word application settings, etc. is accessible and usable. An example of a program using internal conversion is Word2TEX [4].

2.2

External conversion

External conversion is performed without the Word application by an external program. A There are at least two ways how to externally convert a Word document into L TEX either directly access the Word document as a binary le or save the document in a more accessible A format (typically RTF) and then convert it into L TEX. External conversion has one big disadvantage in comparison with internal conversion. Its usually impossible to retrieve all the document information especially about the logical structure of the document. The rst method is completely independent on Word installation so it can be performed outside the Windows environment. Although the idea of parsing the Word binary format is rather unimaginable there are a few programs that use this method e.g. word2x [5] or Antiword [8]. A Programs that convert RTF into L TEX are: rtf2latex2e [6], w2latex [7].

2.3

What to expect

A Its not possible to perform 1:1 conversion as Word and L TEX are very dierent document preparation systems. The most important is surely to convert all the text content it especially means to correctly translate special characters (e.g. , , etc. or % to \%). Conversion programs will generate the better results the better the Word document is stuctured and formatted. This the reason why users should use paragraph styles and appropriate Word functions for inserting footnotes, bibliography, index, etc. Once users follow these rules conversion programs can properly convert almost every part of a document. Another important question is how to convert gures (including embedded ActiveX objects) and mathematical equations. This issue is not very easy and will be described in the next section in details.

3
3.1

A Word-to-L TEX

Introduction

A Word-to-L TEX will perform so-called internal conversion since it will use Word Object Model to access all the document parts and information. Microsoft Visual Studio 2003 and C#

A language were chosen as a develop environment. Word-to-L TEX will run only on Windows with Microsoft Word installed. A Following sections describe Word-toL TEX features and options that can be set.

3.2

Figures

One of the most important things to convert are gures images, ActiveX controls (e.g. MiA crosoft Excel graphs), automatic shapes and so on. Word-to-L TEX will support two dierent kinds of gure conversion as an EPS image (containing PostScript commands) or as an image in its original format (JPEG photo will be exported as a JPEG le, Excel graph or automatic shape as a GIF le). User will have to choose one type of the conversion that will be used for the whole document. A Word-to-L TEX will have an option to export only gures (not text, lists, etc.) so users can rst save all gures as EPS, then as raster images and nally choose whats better for each gure. Conversion to EPS format will be performed by an external PostScript printer driver (e.g. Apple LaserJet II) which can be easily installed on Windows. The conversion procedure is rather ponderously the gure will be rst copied into the clipboard, then pasted in a temporary Word document which will be printed into an EPS le using the PostScript printer driver. Once this is done, the Bounding Box property specifying the picture size must be edited to match the picture size in the Word document. Unix command-line program ps2eps [10] can edit the Bounding Box property automatically but on Windows it requires a few dependencies so I will edit this proterty without any external program it means to change four numbers in the head of an EPS le which is a plain ASCII text le. On the other hand the export to the pictures original format is quite easy. When the document is saved as a web page all the gures (including ActiveX objects etc.) are exported as JPEG, PNG or GIF les.

3.3

Mathematical equations

There are three ways how to insert mathematical equations into a Word document. The rst is the EQ eld (Insert Field ) which can be used even for quite complicated expressions containing sums, brackets, matrices, fractions, etc. The expressions prepared with the EQ A eld must be written in a source code similar to L TEX (e.g. \f(5;3) for a fraction) but it has some limitations for example you cant create a triple integral. As there is no API A for the EQ eld the conversion to L TEX must be performed by parsing the source code of expressions. Equation Editor (typically in version 3) is a part of Microsoft Oce package. Its a visual editor without any mode for writing expressions in a source code like in the EQ eld. In spite of this Equation Editor can convert EQ eld expression into its own format but not A back. The only way how to convert Equation Editor expressions to L TEX is to parse their binary format (theres no public API). Although this format is public [11] for me its a hard imaginable method. MathType [12] is a professional (and commercial) version of Equation Editor with some great improvements numbered equations, automatic recognition of variables, functions and A constants, export to GIF or EPS, converting to MathML or L TEX. MathType also has an API for basic work with expressions (setting converting traslators, converting and saving

expressions, etc.). As MathType can work with Equation Editor and the EQ eld expressions A this API enables to convert all the expressions within a Word document to L TEX. But the MathType SDK Derivative Works Distribution License Agreement [13] tells that programs using the SDK must be distributed free of charge and only within the programmers company or faculty. A So, what mathematical expressions will Word-to-L TEX convert? There are two possibilities either Ill make a convertor for EQ eld expressions and all the other mathematical expressions will be converted as gures (typically EPS). Or I can use MathType SDK for A converting all the expressions to L TEX.

3.4

Structural parts of a document

footnotes cross-references (to headings, gures, etc.) titles (for gures, tables, etc.) hyperlinks (will be converted to footnotes or references) tables lists (ordered and unordered) headings (built-in Word paragraph styles Heading x will be converted to appropriate section commands by default (e.g. Heading 1 to \section), this mapping can be changed) index (inserted by the XE eld)
A table of contents (not converted, L TEX makes the table of contents automatically with \tableofcontents command)

table of gures references (inserted by the TA eld)

3.5

Formatting

A user can dene mapping of paragraph and character styles to L TEX commands (e.g. style preformated to \verb), optionally a special environment for each style can be created A the text in italics, bold and other styles will be converted to appropriate L TEX commands (e.g. \emph, textbf), the default mapping can be changed A the font size cant be set exactly in L TEX so therell be a point range for each command (e.g. 8 10 for \small), the default ranges can be changed

page breaks paragraph indenting and aligning + text boxes +


2 A + marks features which Word2TEX [4] (the best Word to L TEX convertor) doesnt have

3.6

A Interaction with L TEX

editable document preamble; macros like %Author, %Title used in the preamle will be replaced with respective information from the Word document [LATEX:cmd1]...[\\LATEX:cmd2] like commands can be used in a Word document, for example [LATEX:\textbf{]foo[\\LATEX:}] results in bold foo +

3.7

Output options

A program will produce L TEX 2 output le, but itll be designed so that other formats could be easily added or programmed A character set of the output L TEX le

symbol for the end of a line (CRLF, LF, CR) in the output le wrap paragraphs in the output le after x characters or not

3.8

Miscellaneous options and features

automatic detection of a page size or symbolically setting (A4, etc.) + automatic detection of page margins + option for setting document class (article, book, etc.) + translation of special characters and symbols (the default mapping can be changed)

3.9

Program settings

All the program settings will be saved in a XML le with public format so users will be able to edit it and suit the program behaviour their needs. Itll be also possible to set the program by a dialog. The option for saving and loading the program settings will enable users to create a couple of converting styles for dierent kinds of documents and then just select one and use it.

3.10

Libraries

A MathType SDK will be probably used for converting mathematical equations to L TEX

.NET System.Xml library for parsing and creating les with program settings .NET System.Encoding library for converting between dierent character sets

3.11

Future improvements

processing of numbered equations better recognizing of mathematical expressions in regular text

4
4.1

Conversion programs
Word2TEX

A Word2TEX [4] is surely the best Word to L TEX convertor. It has all the features from the previous section which werent marked with + and a few additional functions: A A A output le in a couple of formats (e.g. L TEX 2 , L TEX 2.09, AMS-L TEX)

converts coloured text using special package converts hyperlinks using hypertex package numbered equations inserting extra commands for PDFTEX user can dene own mapping for mathematical expressions

4.2

A RTF to L TEX convertors

A rtf2latex2e [6] produces quite nice L TEX output, processes font styles, footnotes, tables, paragraph styles, Equation Editor 3.0 equations and some gures A other RTF to L TEX can be found at CTAN sites [9], but they cant usually process new version of RTF format

4.3

Other convertors

word2x [5] wide portable external convertor Antiword [8] converts to only plain text or PostScript, performs external conversion, wide portable; processes font styles and sizes, footnotes, lists, tables, etc., has problems with gures a couple of very old converters (e.g. Word TEX) can be found at CTAN sites [9]

References
[1] Donald E. Knuth. The TEXbook, Volume A of Computers and Typesetting, AddisonWesley Publishing Company (1984), ISBN 0-201-13448-9.
A [2] Ne p r li s stru cn yu vod do syst emu L TEX 2 . http://www.penguin.cz/~kocer A [3] Marion Neubauer. Conversion from WORD/WordPerfect to L TEX, MAPS 14, 1995, 120-124, http://www.ntg.nl/maps/maps14.html.

[4] Word2TEX, http://www.chikrii.com [5] word2x, http://word2x.sourceforge.net [6] rtf2latex2e, http://sourceforge.net/projects/rtf2latex2e 7

[7] w2latex, http://www.tug.org/utilities/texconv/w2latex.html [8] Antiword, http://www.winfield.demon.nl [9] CTAN, ftp://ftp.cstug.cz/pub/tex/CTAN [10] ps2eps, http://www.telematik.informatik.uni-karlsruhe.de/~bless/ps2eps.html [11] Equation Editor expressions format, http://www.dessci.com/en/reference/sdk/default.htm#MTEF [12] MathType, http://www.dessci.com/en/products/mathtype [13] MathType SDK Derivative Works Distribution License Agreement, http://www.dessci.com/en/support/eula/mathtype/mtderivlic.htm