Next: Chapter 12  Printing Up: Part I  User's Guide Previous: Chapter 10  Finding and Replacing Text


Chapter 11
Importing, Exporting and Deploying Documents

The RichDoc framework can import and export documents from/to several formats. It also contains a component for presenting documents on the Web using the Apache Tomcat technology. This component may be readily used as a stand-alone web application, or may be easily integrated into an existing Tomcat application.

In particular, the RichDoc framework includes quality components for importing LaTeX (see Section 11.7) and exporting HTML (Section 11.6). This enables to use the RichDoc as a LaTeX to HTML converter, using the RichDoc format as an intermediate format.

11.1 Interface to Import/Export modules

All import/export modules have uniform interface for setting properties of the conversion process. Generally, there are two options of invoking a conversion: either from the command line, or from the user interface of the BookEditor.

Each module has different set of parameters that affect its operation. You can define a named sets of these parameters, called profiles. For example, if you frequently export certain document with particular settings, you can save these settings under appropriate name, and then recall them by that name whenever needed. If you invoke the conversion process from the BookEditor, such as using the command Export → LaTeX, a visual profile management tool shows up, see Figure 11.1.

[picture]
[end of picture]

Figure 11.1 Profile Management Window

At the top part of the window, there are controls for profile management. You can select active profile in the drop-down list, create new profile, save modified profile, or delete profile. Below, there are details of the selected profile. The profile consists of list of options for which you can specify custom values. The values that are in italic (text) or are blended (check boxes) indicate default values. The red ‘A’ letter indicates that the value is computed automatically from other value(s). For instance, the Main File value is associated with the Input Path value – it is the file title of the Input Path. If you change some value, the Save button becomes available, allowing you to save the modified settings. After pressing the OK button, the conversion process starts.

Alternatively, you may invoke the conversion from the command line using the command

richDocIo -mode mode [-profile profile] [-interactive] profile_options

where the mode may be any of values specified in Table 11.1. The list of options is then documented separately for each mode.

Table 11.1 Export / Import Modes

Mode Section Mode Section
exportHtml Section 11.2 importHtml Section 11.6
exportLatex Section 11.3 importLatex Section 11.7
exportPdf Section 11.4 importDocbook Section 11.8

11.2 Exporting HTML

This module converts RichDoc documents into the Hypertext Markup Language format. The module translates the logical structure of the RichDoc document into appropriate HTML markup, including lists, tables, hyperlinks etc. Embedded figures and formulas are automatically converted into inline images linked from generated HTML files. The document may be split into separate files at specified level. Optionally navigation may be added to the generated files. A complete list of options follows.

General
Input Path (-inputPath)
Input path to the source RichDoc document.
Output Path (-outputPath)
Output path to the desired HTML output. It may be either a directory or a ZIP file.
Character Encoding (-characterEncoding)
Desired character encoding of the generated HTML files.
Split Level (-splitLevel)
Specifies at which level sections should get split into separate HTML files. Zero value specifies the level of chapters, value 1 corresponds to the level of sections, and so on.
Condensed Code (-condensedCode)
Specifies whether the generated HTML should omit any whitespace to reduce size.
Style Sheet
Style Sheet Path (-styleSheetPath)
Path to the style sheet file. The file is put into the generated directory or ZIP file, and generated HTML files are linked to the style sheet. If not specified, default style sheet is used.
Style Sheet Local Path (-styleSheetLocalPath)
Using this option you may specify different local path to the generated stylesheet.
Navigation
Create Top Navigation (-createTopNavigation)
Add a banner to the top of each generated HTML file with links to previous, next and up sections.
Create Bottom Navigation (-createBottomNavigation)
Add a banner to the bottom of each generated HTML file with links to previous, next and up sections.
Create List Of Child Links (-createListOfChildLinks)
Add links to immediate child sections to the end of each HTML pages.
Frame Set
Create Table Of Contents Frame (-createTableOfContentsFrame)
Create a series of HTML files representing expandable Table of Contents for the document. The generated index.html file would correspond to a HTML frameset containing the Table of Contents on the left and the main document body on the right.
contentFrameName (-contentFrameName)
Name of HTML frame containing the document body.
Decoration
Bottom Line (-bottomLine)
Specify text that should be added to the end of each generated HTML page, such as authorship or copyright information.
animation
imageMagickConvert (-imageMagickConvert)
Microsoft Html Help
Create Chm File (-createChmFile)
Compile generated HTML files into Microsoft HTML Help format.
Chm Output Path (-chmOutputPath)
Hhw Title Page (-hhwTitlePage)
Hhw Executable (-hhwExecutable)

11.3 Exporting LaTeX

Sometimes you may want to export RichDoc document into the LaTeX typesetting format. For instance, you have prepared an article using the BookEditor, and you want to send it to a publisher that requires special visual style of articles, and provides LaTeX-compatible style file for that reason. You may tune BookEditor's style sheets to match the visual appearance of your document to the required style, but the easiest way is just to export the document to LaTeX, add the publisher's LaTeX style, and use LaTeX to generate the final form.

This export module is still under construction, all of the problems are not solved yet, but in general, most features of the RichDoc framework can be easily converted to LaTeX. A list of issues that deserve particular attention is listed below.

Support for non-English languages
The language of the document is used to generate the reference to the LaTeX babel package. This ensures that titles of appropriate language are used, and that correct hyphenation patterns are activated. Regarding non-English characters, two options are possible. For accented characters (such as Č or ü), it is possible to convert them to latex macros, such as \v{C} and \:{u}. The second option is to save the document in some eight-bit encoding (such as iso-8859-1 for Western European languages, or iso-8859-2 for Central European Languages). The exported document then uses the inputenc package to specify the encoding.
Support for equations
Most equations should be exported to LaTeX without major problems, but some of the more exotic features may not be supported.
Support for 2D drawing
The embedded pictures are converted to the Encapsulated Postscript format, and if you have installed the GhostScript package, they may be optionally converted to the PDF format (you'll need this if you want to use PDFLaTeX to process your LaTeX document.) There is also a problem with non-English characters, because PostScript format does not have any uniform support for handling non-English characters. In this case, the export module converts words containing non-English characters to curves. This approach usually yields acceptable results.
Support for Tables
Converting complex tables to LaTeX may cause major problems, as the LaTeX table layout algorithm has quite limited capabilities. It cannot, for example, automatically determine appropriate width of columns containing long paragraphs. In fact, if you want to have a word-wrapped paragraph in a table, you must set the width of the table column to some fixed value.
Support for Index and Bibliography
Index and bibliography is automatically converted to LaTeX conventions. Moreover, the finishing phase, if enabled, automatically calls bibtex and mkindex programs to make index and bibliography correct. LaTeX-style glossary is not yet supported.

A complete list of options follows.

General
Input Path (-inputPath)
Input path to the source RichDoc document.
Output Path (-outputPath)
Output path to the desired LaTeX output. It may be either a directory or a ZIP file.
Main File (-mainFile)
Name of the file corresponding to the LaTeX document root, i.e. file to be processed with LaTeX.
Character Encoding (-characterEncoding)
Specifies character encoding to be used for non-ASCII characters. If "LaTeX" encoding is selected, non-English characters are escaped using common TeX/LaTeX convention, e.g. ü (u with umlaut) is escaped as \:{u}. If other encoding is selected, characters are encoded into single bytes, and an appropriate command is added to the document preamble.
Local Path (-localPath)
Specifies a path offset from the root file to other files. This path is used to generate inclusion commands, such as \include or \includegraphics.
Process Embedded Images (-processDrawings)
Specifies whether Encapsulated Postscript files should be generated for embedded 2D pictures. This option may be useful for temporarily disabling picture generation, if a large document need to be converted several times and picture generation slows down the conversion too much.
Process Inline Bitmaps (-processInlineBitmaps)
Specifies whether Encapsulated Postscript files should be generated for embedded inline bitmaps, see also Process Embedded Images option.
Generate PDF For Embedded Images (-generatePdfForEmbedded)
Generates embedded images in the PDF format besides the Encapsulated Postscript format. This is needed if you want to process the generated LaTeX document with PDFLaTeX to generate a PDF file.
Options
Font Size (-fontSize)
Specifies the font size of the target document.
Sloppy (-sloppy)
Specifies whether LaTeX \sloppy mode should be turned on. This mode ensures that LaTeX breaks paragraphs even if the breaking creates very large spaces between words. When disabled, problematic word is not put on the next line, but rather exceeds the printable area.
Max Width (-maxWidth)
Specifies the maximum width of tables and figures, in pts (1pt = 0.35mm or 1/72"). If a figure or table exceeds the maximum width, it is automatically scaled down to fit into the width.
Max Height (-maxHeight)
Specifies the maximum height of tables and figures, in pts (1pt = 0.35mm or 1/72"). If a figure or table exceeds the maximum height, it is automatically scaled down to fit into the height.
Finishing
Finishing Mode (-finishingMode)
Specifies whether the generated LaTeX document should be processed with LaTeX to generate a DVI file, with LaTeX + DVIPS to generate a PostScript file, or PDFLaTeX to generate a PDF file.
Finish Dir (-finishDir)
Specifies the directory where the LaTeX processor is invoked. Changing the directory may be useful if you want to use main file different from the automatically generated main file.
Finish Main File (-finishMainFile)
Specifies the directory for which the LaTeX processor is invoked.

11.4 Exporting PDF

This module can be used to convert a RichDoc document into a PDF format. The result is similar as if the document is printed to a regular printer. Optionally, hyperlinks in the document are converted to PDF hyperlinks, and/or an interactive table of contents (called bookmarks in PDF terminology) is added to the PDF document.

General
Input Path (-inputPath)
Input path to the source RichDoc document.
Output Path (-outputPath)
Output path to the desired PDF output.
Options
orientation (-orientation)
paperSize (-paperSize)
Hyperlinks (-hyperlinks)
Whether RichDoc hyperlinks should be converted to PDF hyperlinks.
Bookmarks (-bookmarks)
Whether PDF bookmarks, i.e. interactive table of contents, should be added to the PDF document.
Incremental (-incremental)

11.5 Exporting SCORM

This module converts RichDoc document into a file conforming to the ADL SCORM standard.

11.6 Importing HTML

This module converts HTML documents into a RichDoc document.

General
Input Path (-inputPath)
Specifies the path to the HTML files to be imported. The path may be either a directory containing HTML files, a ZIP file containing HTML files, or output from the KSMSA Web Crawler.
Output Path (-outputPath)
Specifies the path to the RichDoc document to be generated.
Language (-language)
Specifies the primary language of the generated RichDoc document.
Character Encoding (-characterEncoding)
Specifies the character encoding of the input HTML documents, if they contain encoded non-ASCII characters. Note that known HTML entities corresponding to non-ASCII characters (such as ü) are automatically converted to single UNICODE characters. If encoding is specified in the Web Crawler database (i.e. HTML document was downloaded from the web and the web server returned encoding for it in the HTTP response header), that value is used instead. Note that encoding value from the http-equiv header tag is not used.
Document Class (-documentClass)
Specifies the desired document class of the target document.
File Filter
Include File (-includeFile)
Specifies a list of HTML files that should be imported. You may use wildcards, such as *.html for all files with html extension in the main directory, or **/*.html for all HTML files in all directories. If this list is empty, all files are included.
Exclude File (-excludeFile)
Specifies list of files that should not be imported. You may use wildcards.
Content Filter
Content Start (-contentStart)
Specifies a regular expression defining a position in a file where the conversion should start, such as <body>. This is useful to omit e.g. navigation code or banners. If not specified, or the start sequence is not found in the file, conversion starts from the beginning of the file. Otherwise, the conversion starts from the first occurrence of the start sequence.
Include Content Start (-includeContentStart)
Specifies whether the start sequence should be converted.
Content End (-contentEnd)
Specifies a regular expression defining a position in a file where the conversion should end, such as </body>. If not specified, or the end sequence is not found in the file, conversion continues to the end of the file. Otherwise, the conversion ends at the last occurrence of the end sequence.
Include Content End (-includeContentEnd)
Specifies whether the end sequence should be converted.
Exclude Fragment (-excludeFragment)
Specifies a regular expression of fragments that should be excluded from the conversion process.
Replace Fragment (-replaceFragment)
Specifies a regular expression-replacement pairs that should be used to pre-process the input HTML file. All occurrences of the regular expressions are replaced with the corresponding replacement.
Ignore Tags (-ignoreTags)
List of tag names (without < and > delimiters) that should be ignored. Note that only the tag delimiters are ignored, the content of the tag is processed normally. If you want to ignore entire tag, use Exclude Fragment option with regular expression <tag[^>]>.*</tag>.
Output Filtered (-outputFiltered)
Specifies a path to a ZIP file that will contain HTML files that were actually used for conversion, when filtering and replacements were applied. Since any problems found during conversion are reported w.r.t. the filtered content, the filtered files may be useful to
Content Modification
Number Sections (-numberSections)
Whether the sections generated from the <h*> tags should be numbered by the RichDoc framework.
Remove Original Section Numbers (-removeOriginalSectionNumbers)
Whether original section numbers from titles should be removed. This option should be checked if you also checked the Number Sections option.
Detect Unnumbered Sections (-detectUnnumberedSections)
This option marks sections that didn't contain number in original HTML source as unnumbered. Otherwise, they are automatically numbered unless you disabled the Number Sections option.
Misc
Print Font Size (-printFontSize)
Specifies the font size used when the generated RichDoc document is printed. If not specified the default font size is used.

11.7 Importing LaTeX

Documentation not yet available.

General
Input Path (-inputPath)
Specifies the path to the main LaTeX file to import.
Output Path (-outputPath)
Specifies the path to the RichDoc document to be generated.
Language (-language)
Specifies the primary language of the generated RichDoc document.
Character Encoding (-characterEncoding)
Specifies the character encoding of the input LaTeX document, if it contains encoded non-ASCII characters. Note that LaTex-escaped non-ASCII characters (such as \:{u}) are automatically converted to single UNICODE characters.

11.8 Importing DocBook

This module imports documents obeying the DocBook [1] XML markup, see http://www.docbook.org.

General
Input Path (-inputPath)
Specifies the path to the XML files with DocBook markup to be imported. The path may be either a directory containing XML files, a ZIP file containing HTML files, or output from the KSMSA Web Crawler.
Main File (-mainFile)
The main XML file corresponding to the root of the DocBook document.
Output Path (-outputPath)
Specifies the path to the RichDoc document to be generated.
Language (-language)
Specifies the primary language of the generated RichDoc document.
Character Encoding (-characterEncoding)
Specifies the character encoding of the input HTML documents, if they contain encoded non-ASCII characters.
Document Class (-documentClass)
Specifies the document class of the generated RichDoc document.
Create TOC (-createToc)
Whether Table of Contents should be added to the generated RichDoc document.
Create Short TOC (-createShortToc)
Whether short Table of Contents (chapters only) should be added to the generated RichDoc document.
File Filter
Include File (-includeFile)
Specifies a list of files or wildcards to be imported.
Exclude File (-excludeFile)
Specifies a list of files or wildcards to be excluded from the conversion process.
Misc
Print Font Size (-printFontSize)
Specifies the font size used when the generated RichDoc document is printed. If not specified the default font size is used.

11.9 Deploying Documents

Documentation not yet available.


Next: Chapter 12  Printing Up: Part I  User's Guide Previous: Chapter 10  Finding and Replacing Text