Converting Word to ThML
The conversion from microsoft word to an eventual XML document has
several steps. First, the document is converted to XML, using
the Transdoc DTD. This is done with the free @rtf2xml package
(v0.4 or v0.5).
Then, these XML documents are converted to SGML using the ThML DTD
using custom perl scripts. These are development level -- not
entirely finished, and some bugs are known to exist, but they have
successfully converted several documents. In order to keep each
script small and manageable, I separated the job into several
smaller jobs:
- trn2thm -- basic transdoc to ThML conversion. Results in ThML
documents that can be validated, but more cleanup remains to be done.
- classify -- change all of the style attributes spread throughout
the document into classes and a document stylesheet.
- ncrease -- interpolate n attributes in and
tags, e.g.
if one page is
and the next page is
, change it to
. Also interpolate href in pb elements, for images of
pages.
- parsescr -- parse scripture references; store parsed result in
parsed="..." element of , , or .
- identify -- add id attribute to each element in body.
- index -- convert elements to actual indexes.
- thm2htm -- convert to HTML web
You can download these perl scripts as a zip file:
@trn2thm.zip.
Some jobs than remain:
- Map some known fonts to unicode
- do lists right, with
- instead of
- make parsescr handle scripture references with roman numerals
- make index work with manually added IDs
- speed up identify program with better regexes
- write an XSL stylesheet for some future browser that support them
- In thm2html, link scriprefs to Bible Gateway
This document (last modified November 25, 1998) from the
Christian
Classics Ethereal Library server, at
@Wheaton College