Week 4 - Markup Languages
The Virtue of Text
As mentioned in the first week's notes, text was chosen as the medium of
choice on the Internet because it is a "least common denominator" which any
computer can understand. (Binary formats suffer from endianness and wordsize
incompatibilities between architectures.)
The Virtue of Markup Languages
In order to apply typographic features to otherwise boring text, markup
languages have been invented which describe the text they will modify. Many of
these have been invented over time. The following is by no means a complete
list.
- troff / nroff - This form of markup was originally invented at AT&T Bell
Labs by the same guys who wrote the original UNIX (Kernighan, Ritchie, &
Thompson). This format is still used for man pages.
- Post Script - This format was created by Adobe. It was originally a
proprietary format, but it's proliferation led to greater and greater
openness. Many printers support Post Script to format text when printing.
- TeX & LaTeX - A Mathematics and Computer Science professor by the name of
Donald E. Knuth set to the task of writing a textbook for his students.
Along the way, he learned that he would need some kind of markup language
to format his textbook. Seven years later, the TeX typesetting language
emerged. Many believe that it was well worth the wait, as it is a markup
language which gives authors a high level of control in the formatting of
mathematical equations, and can handle very large sized documents.
- SGML - The SGML Project is funded by the Information Systems Committee of
the UFC. It is a a project to create a Mother-Of-All markup language which
allows you to rigorously, precisely, author a document using data type
definitions (DTDs). The idea here is that you can create a document which
is completely independent of the tools used to author it. In addition to
authoring documents, you can also define new markup languages with SGML.
This format was submitted to all major word processor makers as the format
of choice for document saving, and practically none of them went for it.
- HTML - As the World Wide Web was being developed, it was apparent that a
markup language was needed. SGML, however, was just too fat and didn't have
built-in support for things like hyperlinks. A stripped down version of
SGML was made, and lo, HTML was born.
Shortcomings of HTML
HTML worked well as the ML for the World Wide Web: It's small, fast, supports
embedded documents like sounds and images, and supports hyperlinks to other
sites. It is not without its disadvantages, however.
- It is not precise - This is by design, actually, and from a certain
standpoint cannot be considered a shortcoming. For people who want
control over their document's layout, this is a real bugaboo.
- It does not extend easily - This shortcoming was manifested most
prominantly durring the "browser wars" where the two prominent browser
vendors did their level best to break or harrass each other's browsers
by adding non-standard extensions, all in the name of "product
distinction".
- It does not allow for re-use of formatting (or styles) - Desktop
publishing software has supported style setting and reuse for years.
Long-time users of said software were aghast to find this feature
missing from the HTML spec.
- It does not have built-in support for dynamic content - Lots of people
that do programming and want to make spiffy, neat-o Websites complained
about this.
The Answer: XML
The W3C took these criticisms to heart and came up with a new standard for
document markup which would address these problems. XML provides authors with
the following features:
- The ability to define and re-use styles - This is a feature that is
present in SGML but was excluded from HTML in the interest of efficiency
(remember, bandwidth was (and still is) scarce).
- A standard way to make non-standard extensions - This is also done
through the "style" mechanism mentioned previously.
- Support for scripting - Most notably, the ONMOUSE-OVER/LEAVE/OUT
attributes which allow you to handle mouse events.
- A graceful transition from HTML to XML - XML supports default styles for
the existing HTML tags that you have already learned. This is
particularly bright of them as an abrupt transition would have been
rather painful.
You can look at styles and how they are used in the styles section
of the HTML 4.0 documentation.
Beyond ASCII
The Shortcomings of ASCII
One last little problem with Internet communications is ASCII. As many of you
already know, the 'A' in ASCII stands for 'American'. Well, as it turns out,
people other than Americans use the Internet too, and a lot of them use
different characters than the Roman/Latin set. What's worse, the 8-bit
character format does not allow for more than 255 characters, and whoops! all
of those spots are taken.
The Answer: Unicode
A new, universal (well, global anyway) character set based on a 16-bit size
has been created. This allows for 64 million different characters, which is
enough to hold not only the existing Roman/Latin characters (still available
at their original 255 slots), but other character sets like Arabic and Hebrew,
and very large character sets like Kanji.
Changelog
2/3/98 - Initial revision