markup languages. Introduction to XML A special markup language for text documents is called

Logical and visual markup

Distinguish between logical and visual markup. In the first case, it is only about what role this section of the document plays in its general structure (for example, “this line is the heading”). The second defines exactly how this element will be displayed (for example, "this line should be displayed in bold"). The idea of markup languages is that the visual representation of a document should automatically derive from the logical markup and be independent of its immediate content. This simplifies the automatic processing of the document and its display in different conditions (for example, the same file may be displayed differently on the computer screen, mobile phone and in print, since the properties of these output devices differ significantly). However, this rule is often violated: for example, when creating a document in an editor like MS Word, the user can highlight headings in bold, but nowhere indicate that this line is a heading.

Examples of markup languages

Markup languages are used wherever formatted text output is required: in typography (SGML, TeX, PostScript, PDF), computer user interfaces (Microsoft Word, OpenOffice, troff), the World Wide Web (HTML, XHTML, XML, WML, VML, PGML, SVG, XBRL).

Lightweight markup languages

Languages designed for easy and fast writing of text in a simple text editor are called lightweight(en:Lightweight markup language). Features of such languages:

Minimum features.
Small set of supported tags .
Easy to learn.
The source text in such a language is read with the same ease as the finished document.

They are used where a person has to prepare text in a regular text editor (blogs, forums, wikis), or where it is important that a user with a regular text editor can also read the text. Here are some widely used lightweight markup languages:

Wiki markup (see Wikipedia:How to edit articles)
Various auto-documentation systems (eg Javadoc).

Story

The term "markup" (as a result of the process of the same name, eng. markup) comes from the English phrase " marking up"("marking (as a process)", lit. "marking, marking"), taken from the traditional publishing practice of putting special conditional marks on the margins and in the text of a manuscript or proofreading before sending it to print. Thus, "markup men" indicated the typeface, style and font size for each part of the text. Nowadays, text markup is handled by editors, proofreaders, graphic designers - and, of course, the authors themselves.

GenCode

The idea of using markup languages in computer word processing was most likely first introduced by William Tunnicliffe. William W. Tunnicliffe ) at a conference in 1967. He himself called his proposal "universal coding" (Eng. generic coding). During the 1970s, Tunnicliffe led the development of the GenCode standard for the publishing industry and later became chairman of a committee of the International Organization for Standardization (ISO). International Organization for Standardization ), who created SGML, the first descriptive markup language. Brian Reid (ur. Brian Reid ) in his dissertation, which he defended in 1980 at Carnegie University (Eng. Carnegie Mellon University ), in the development of the proposed concept, carried out the practical implementation of descriptive markup.

However, IBM researcher Charles Goldfarb is now commonly referred to as the "father" of markup languages. Charles Goldfarb ). The basic concept came to him in 1969 while working on a primitive document management system designed for law firms. In the same year, he took part in the creation of the IBM GML language, which was first introduced in 1973.

Some early implementations of computer markup languages can be found in UNIX typography utilities such as troff and nroff. They allow you to insert formatting commands into the text of a document to format it according to the requirements of the editor.

Availability of publishing software with WYSIWYG function (eng. "what you see is what you get" what you see is what you get) has supplanted most of these languages among general users, although serious publishing work still uses markup for specific non-visual text structures, and WYSIWYG editors now most commonly save documents in formats based on markup languages. .

Τ Ε Χ

Another important publishing standard is Τ Ε Χ , created and subsequently improved by Donald Knuth in the 70s and 80s of the twentieth century. Τ Ε Χ brought together high-end text formatting and font description capabilities, especially for professional-quality math books. Currently Τ Ε Χ is the de facto standard in many scientific disciplines. In addition to Tech, there is LaTeX, which is a widely used descriptive markup system based on Τ Ε Χ .

Scribe, GML and SGML

In the early 80s, the idea that markup should focus on the structural aspects of a document and should leave the external representation of the document to the interpreter led to the creation of SGML. The language was developed by a committee headed by Goldfarb. He combined ideas from many sources, including the Tunnikofflick project, GenCode. Sharon Adler, Anders Berglund and James A. Marke were also key members of the SGML committee.

SGML precisely defined the syntax for including markup in text, and also separately described which tags are allowed and where (DTD - Document Type Definition). This allowed authors to create and use any markup they wanted, choosing which tags to use and giving them names in the normal language. Thus, SGML should be considered a metalanguage; multiple special markup languages have descended from it. The late 80s were most significant in the emergence of new markup languages based on SGML, such as TEI and DocBook.

In 1986, SGML was published as an International Standard by ISO 8879. SGML has found wide acceptance and has been widely used in very large projects. However, it was generally found to be cumbersome and difficult to learn, a side effect of the language being that it tried to do too much and be too flexible. For example, SGML created end tags (or start tags, or even both) that were not always needed because it believed that this markup would be added manually by the project support staff, who would appreciate the savings in keystrokes.

HTML

By 1991, the use of SGML was limited to business programs and databases, while WYSIWYG tools (which saved documents in proprietary binary formats) were used for other document processing programs. The situation changed when Sir Tim Berners-Lee learned about SGML from his colleague Anders Bergland. Anders Berglund ) and others at CERN, used the SGML syntax to generate the HTML. The language had similarities to other markup languages based on the SGML syntax, but it was much easier to get started, even for developers who had never done so. Steven DeRose argued that HTML using descriptive markup (and from SGML in particular) was a major factor in the development of the Web because it was designed to be flexible and extensible (as well as other factors including the notion of URLs and free use by browsers). HTML is the most attractive and most used markup language in the world today.

However, HTML's status as a markup language has been disputed by some computer scientists. Their main argument is that HTML restricts the placement of tags by requiring both tags to be nested within other tags or within the document's main tags. As a result, these scholars consider HTML to be a container language following a hierarchical model.

XML

XML (Extensible Markup Language) is a meta markup language widely used today. XML was developed by the World Wide Web Consortium in a committee chaired by Jon Bosak. The main purpose of XML is to be simpler than SGML and to focus on a specific problem - documents on the web. XML is a meta language like SGML, users are allowed to create any tags they want (hence "extensible"). The rise of XML was helped because every XML document could be written in the same way as an SGML document, and programs and users using SGML could migrate to XML fairly easily.

However, XML lost many of the human-centric features of SGML that made it easier to use (until the amount of markup increased and readability and editability were restored to the same level). Other enhancements fixed some SGML issues internationally and made it possible to parse a document hierarchically even if no DTD was available.

XML was designed primarily for semi-structured environments such as documents and publications. However, it resulted in a sweet spot between flexibility and simplicity, and it was quickly adopted by many users. Nowadays, XML is widely used for passing data between programs. Like HTML, it can be described as a "container" language.

XHTML

Since January 2000, all recommendations to the W3C have been based on XML rather than SGML, the acronym XHTML (Extensible HyperText Markup Language - Extensible HyperText Markup Language) has been proposed. The language specifications required that XHTML documents be formatted as XML documents, this allows XHTML to be used for clearer and more precise documents using tags from HTML.

One of the most noteworthy differences between HTML and XHTML is the rule that all tags must be closed: empty tags, for example<br/> must both be closed with a standard end tag or a special entry:<br/> (the space before the "/" in the closing tag is optional, but often used because it is used by some pre-XML browsers, also by SGML parsers). Other attributes in the tags must be in quotes. Finally, all tags and attribute names must be written in lowercase to be read correctly; HTML is case insensitive.

Other XML based developments

Many XML-based developments are now in use, such as RDF (Resource Description Framework), XFORMS, DocBook, SOAP, and OWL (Ontology Web Language).

Peculiarities

A common feature of all markup languages is that they mix document text with markup instructions in a data stream or file. It is not necessary, it is possible to isolate markup from text using pointers, labels, identifiers, or other coordination methods. This "separated markup" is typical for the internal representation of programs that work with markup documents. However, embedded or "interline" markup is more accepted elsewhere. For example, here is a small piece of text marked up with HTML:

Anatidae

The family Anatidae includes ducks, geese, and swans, but not the closely related screamers.

Markup instruction code (known as tags) surrounded by angle brackets<как здесь>. The text between these instructions is the text of the document. Codes h1, p and em- examples of structural markup, they describe the position, purpose or meaning of the text included in them.

More accurately, h1 means "this is a first level heading", p means "this is a paragraph", and em means "this is the underlined word or phrase". The interpreter can apply these rules or styles to display different parts of the text using different typefaces, font sizes, indentation, color, or other styles as needed. A tag such as h1 may, for example, be rendered in large, bold typeface, or in a document with monospaced text (like a typewriter) may be underlined, or may not change appearance at all.

For contrast, tag i in HTML, an example of visual markup; it is usually used to identify specific features of text (use italic typeface in this block) without explanation.

The TEI (Tex Encoding Initiative) has published comprehensive guidance documents specifying how to encode text for the benefit of humanity and scientific societies. These guides were used to code historical documents, specific works of scientists, periodicals and so on.

Alternative uses

While the idea of using markup languages with text documents was being developed, it increased the use of markup languages in other areas, suggesting that they be used to represent various types of information, including playlists, vector graphics, web services, user interfaces. Most of these applications are based on XML because it is a highly structured and extensible language.

Technical Translator's Handbook

markup language- 06/23/33 markup language : A language consisting of built-in commands that provides support for marking up text during its processing.

In word processing systems, additional information, called markup, is included in the document and performs the following functions:

Selection logical elements this document;

Setting the processing functions for selected elements.

In conventional word processors, there are built-in commands for turning on / off fonts, etc., similar to the commands for controlling the placement of information on the screen or when printing (the so-called escape sequences). This approach is called command or procedural markup (Table 2.1).

An alternative way to markup is to select a portion of the text without specifying how the selection is handled. Then the other commands assign the processing to the fragments. This markup is called descriptive(descriptive). It includes labels (tags) the start and end of a text element and specifies how to interpret the given fragment.

By changing the set of procedures corresponding to the descriptive markup, it is possible to change the external representation of the same document. The development of the ideas of descriptive markup led to the definition of markup as a formal language. This allows you to check the correctness of the markup and minimize its volume by substituting defaults.

The main advantage of descriptive markup is its flexibility, since pieces of text are marked as "what they are" (and not "how they should be displayed"), and in the future, software may be written to handle these fragments in a way that was not even envisaged by language designers. . For example, HTML hyperlinks, originally intended for users to navigate through a collection of links on the web, were later used by search and indexing mechanisms on the web, to evaluate the popularity of resources, etc.

However, if the words are simply highlighted (descriptively or procedurally) as italics, this ambiguity cannot be fully resolved. If the two cases were labeled differently at the outset, each can be reformatted independently of the others. Generic markup is another name for descriptive markup.

In practice, elements of different markup classes usually coexist in any particular system. For example, HTML contains both markup elements that are procedural (b for bold) and others that are descriptive ("blockquote" or "href" is a feature). HTML also includes a pre element, which limits the area of text to be positioned exactly as printed.

Most modern systems descriptive markup treats documents as hierarchical structures (trees) and also provides some means for inline cross-references. Therefore, such documents can be treated and processed as databases, the structure of which is fairly well defined (however, since they do not have such strict schemas as relational databases, they are usually called "loosely structured databases").

With the onset III millennium there was an interest in documents of non-hierarchical structures. For example, ancient and religious literature usually has a rhetorical or prose structure (story, section, paragraph, etc.) and also includes background information (books, chapters, stanzas, lines). Because the boundaries of these modules often overlap, they cannot be fully encoded using only a tree-structured markup system. Document modeling systems that support such frameworks include MECS, TEI Guidelines, LMNL, and CLIX.

The term "markup" comes from the traditional practice of marking up manuscripts before publication (i.e., adding symbolic commands in the margins and between the lines of a paper manuscript), for centuries this was done by publishers (editors and proofreaders) who noted what font, style and Fragments of text should be typed in size, and then the manuscript was handed over to typesetters who manually typed text, taking into account markup characters.

Currently, there are many markup languages (Table 2.2), among the most widely known are DocBook,

MathML, SVG, Open eBook, XBRL, etc. They are mainly designed to represent various text documents, but specialized languages can be used in many other areas. By far the most well-known markup language is HTML (Hypertext Markup Language), one of the foundations of the WWW (World Wide Web).

Consider some of the markup systems.

RUNOFF was the first text formatting system to gain significant notoriety. It was developed in 1964 for the CTSS operating system by Jerome H. Saltzer using the MAD assembler.

The product actually consisted of a couple of programs:

TYPSET, which was basically a document editor;

RUNOFF - output processor.

RUNOFF provided support for pagination and heading placement, as well as text alignment. RUNOFF is the direct predecessor of the Multics document formatter, which in turn was the ancestor of the Unix formatters (roff and nroff), and their descendants. It was also the ancestor of FORMAT for IBM's OS/360, and of course by implication for all subsequent word processing programs and systems. The name is supposed to have come from a phrase popular at the time - I"ll run off a copy.

TeX is an abbreviation of τεχνη (TEXNH - techne), the Greek term for "art, craft, skill", the source for the word "technical". In English, it is pronounced "tech" (as in the word technology).

TeX is a typesetting system created by Donald Knuth. Together with the METAFONT language for font description and the Computer Modern typeface (Computer Modern typeface), it was designed for two main purposes - firstly, to provide each user with the opportunity to create high-quality books within reasonable labor costs and secondly, that such a system would give identical results on any computer, both now and in the future. TeX is free software popular in the academic community, especially among mathematicians, computer scientists, economists, and in the technical communities. It competes heavily with the other popular TeX formatter, Unix troff, and they are used together in many Unix installations.

TeX is recognized as the best way to create and print complex mathematical formulas, but it is now also used for many other typesetting tasks, especially in the form of LaTeX and other formatting software.

TeX commands usually start with a backslash and are grouped into blocks with curly braces. However, almost all the syntactic properties of TeX can be changed at program execution, making it difficult for other programs to process the TeX input. TeX is a macro and token based language and many commands, including the most commonly defined ones by the user, are expanded on execution until only non-expandable tokens remain and are executed.

The basic version of TeX includes about 300 instructions called primitives. However, these low-level commands are rarely used directly by users, most functionality provided by format files (memory copies of TeX after large sets of macros have been loaded). Nut's original (default) format, which adds about 600 commands, is called Plain TeX. A more widely used format is LaTeX, originally developed by Leslie Lamport, which includes document styles for books, letters, slides, etc., and adds support for links and automatic formula and section numbering.

Another widely used format is AMS-TeX, developed by the American Mathematical Society and provides many more friendly commands that can be modified by publishers to suit their branding. Most AMS-TeX features can be applied to LaTeX using AMS "packages" (referred to as AMS-LaTeX).

To write a program to print the string "Programming" in Plain TeX, you need to create a file myfile.tex with the following content:

\bye % end of the file; not shown in the final output.

By default, anything following a percent sign on a line is a comment, ignored by the TeX interpreter. If TeX is executed on this file (for example, by typing tex myfile.tex in command line mode), an output file named myfile.dvi will be created, which represents the contents of the page in Device Independent Format (DVI). The results can either be printed directly from the interactive digital video system viewer or converted to a more common format such as PostScript using the dvips program. Variants of TeX such as PDFTeX directly produce PDF files.

Consider formatting a mathematical formula. For example, to write a well-known expression for the root of a quadratic equation, you can enter:

The quadratic formula is $-b \pm \sqrt(b^2 - 4ac) \over 2a$ \bye

This will output the following text:

Several document processing systems are based on TeX, notably jadeTeX, which uses TeX internally for printing from the output of James Clark's DSSSL Engine, and Texinfo, the GNU system's documentation processor. TeX has been the official typesetting package for the GNU operating system since 1984 .

Numerous extensions and companion programs for TeX are known, among them BibTeX for bibliographies (distributed with LaTeX), PDFTeX, which bypasses the DVI format and outputs directly to Adobe Systems' Portable Document Format (pdf), and Omega, which allows TeX to use the set Unicode characters Most TeX extensions can be obtained for free from the Comprehensive TeX Archive Network (CTAN). scientific literature based on TeX, supports full match mode (WYSIWYG) and is designed to be compatible with TeX and Emacs.

In many technical fields, such as applied computer science, mathematics, and physics, TeX has become the de facto standard. Many thousands of books have been published using TeX by publishers such as Addison-Wesley, Cambridge University Press, Elsevier, Oxford University Press or Springer. Numerous journals in these fields are produced using TeX or LaTeX, with authors allowed to submit manuscripts in TeX format.

Since version 3, TeX has used a specific version numbering system, where updates are indicated by an extra digit to the decimal number so that the version number asymptotically approaches l. This is a reflection of the fact that TeX is very stable and only minor updates are expected. The current version of TeX is 3.141592; this was the last update in December 2002.

All documents accessible via the Web are written in a language specifically designed for this purpose called Hyper Text Markup Language (HTML). HTML is a simple markup language that allows you to mark up fragments of text and set links to other documents, highlight headings of several levels, break text into paragraphs, center them, etc., turning plain text into a formatted hypermedia document.

The basis of the toolkit of the HTML language is tags - HTML instructions, there are about a hundred of them in the language. They are present inside the hypertext document and allow to form the entire structure and style of its design to the subtleties. When viewing such a document with a browser, for example, these tags are invisible. And when creating a web page using specialized software tools, and such tools are present in almost all office applications (in Word, Excel, Access, Power Point, Outlook, etc.), the tags are not visible to the user - they are entered automatically .

Tags are written in angle brackets, for example

or. Here, the first tag is the opening tag, and the second, with a slash, is the closing tag. The effect of this pair of tags is that the text between them is aligned to the center of the window in which the document is viewed. There are a variety of tags from simple ones (for structural, text design and alignment, color formation, size, font style, etc.) to special ones (for including graphic and multimedia objects in a document. Complex tags, in addition to the name, also have attributes that detail the way they are used.

HTML tags do not define absolute document formatting like word processor codes, but only relative formatting. For example, a tag that causes a line of text to be centered will work equally well on a wide screen and a narrow one, and if the text does not fit the width of the screen, it will automatically wrap to the second line, third, and so on.

You can view Web pages in any text editor, but this is extremely inconvenient, since the page is not formatted, but its tags are visible.

Programs for viewing documents in HTML format are called browsers. Viewing Web documents is one of the main, although not the only, functions of a browser.

Several years have passed since the development of the first version of the language (HTML 1.0). During this time, there was a rather serious development of the language. The number of markup elements has almost doubled, the design of documents is increasingly approaching the design of high-quality printed publications, the means of describing non-text information resources and ways of interacting with application software are developing. The mechanism for developing typical styles is being improved. In fact, HTML is currently evolving towards creating a standard interface development language for both local and distributed systems.

In early February 1998, the international organization W3C approved the specification "Extensible Markup Language (XML) 1.0", which laid the foundation for the development of many new markup languages for transmitting information over the Internet based on the XML standard. In fact, this meant a new step in the development of hypertext markup languages. In the four years of its existence, XML has not only attracted quite a lot of attention from both ordinary users and many web designers, but has become an integral part of the Internet. Already today, there are practically no servers that would not use this technology to some extent as an analog of HTML. However, to say that XML is now becoming the main way to translate hypertext through the global network is still at least premature. The language itself is still quite young, and some of its elements are still under development. So far, only a general framework has been created that, perhaps, will replace Html in the future, but it is still impossible to say in what specific form it will be.

From start

In November 1990, when Internet users first heard about a new technology whose name could easily fit in just three letters, almost no one could imagine that very little time would pass and this technology would become practically the only way to transfer information on the global network. Today, for many inexperienced users, the word Internet is strongly associated with the WWW, although in fact these things, of course, are related to each other, but still a little different.

By and large, it is the incredible popularity of the World Wide Web and its integral part, HTML, of course, that has become the reason for the extremely increased attention to the structures of hypertext markup of documents.

The concept of hypertext was first introduced by W. Bush in 1945. However, real applications using such data structures began to be used only starting from the 60s, and a truly extraordinary surge of activity around this technology began only when there was a real need for a mechanism for combining many information resources, providing the ability to create, view non-linear text. And the WWW web served as an example of the implementation of this mechanism.

The document markup language itself is a set of special instructions called tags (in some translated publications, tags are called labels), designed to create a structure in documents and define relationships between the various elements of this structure, respectively. Markup language tags, or, as they are sometimes called, control descriptors, in such documents are encoded in a very specific way, distinguished from the main content of the document, and then serve as instructions for the program that interprets and displays the contents of the document, in fact, to the one who looks through it, if you try to find analogies with the Internet, then this someone is a client, and the interpreter program in the most common case is a browser). Already in the very first systems, it was decided to use the symbols "<" и ">", inside which to place the names of instructions and their parameters. Today, this way of designating tags is a generally recognized standard.

The very use of hypertext breakdown of a text document in modern information systems ah is largely due to the fact that hypertext allows you to create a mechanism for the so-called non-linear viewing of information. This means that in systems data is presented not as a continuous stream of text structures, but as a set of interrelated components, the transition through which is carried out using hyperlinks.

The most popular and well-known to date hypertext markup language - HTML, was created specifically for structuring and transmitting information located on the Internet, and is undoubtedly a key component of WWW technology. With the use of the hypertext document model, the way of presenting various information resources on the web has become more streamlined, and users have received a convenient mechanism for searching and viewing the necessary information. However, the first sign in this matter is still considered to be much more old language- SGML.

SGML (Standard Generalized Markup Language) was officially adopted in 1986 as an international standard (ISO 8879:1986) for describing input/output and computer-independent methods for representing textual information in electronic form. The basis for its creation was the rather old markup language GML (Generalized Markup Language), developed by IBM back in the days of the first personal computers. To be precise, SGML is a metalanguage designed to describe other markup languages.

Initially, the word markup was generally used to describe annotations or other indications within text that were intended to indicate to the document writer or, as it is sometimes called, the "typesetter" exactly how a particular place should be printed. Such methods may include underlining with a squiggle to indicate italics, some special icons to skip certain phrases or print them in a specific font, and so on. When formatting and printing became automated over time, the term already covered all kinds of special markup codes that were inserted into electronic text documents to control formatting, printing, or other processing.

A markup language is thus a set of conventions about formatting principles that are used to encode text blocks. The markup language should clearly indicate what markup is allowed in a given document, what markup is required, how to distinguish its elements from plain text, and what the markup means. SGML was able to solve the first three tasks, the solution of the last one assumed the existence of an informal description.

SGML, unlike all other markup languages based on it, uses the principle of so-called descriptive markup instead of procedural markup. Such a system uses markup elements that simply provide titles to categorize individual parts of a document. In other words, tags like Or \end(list), simply identify a portion of the document and assert that "this part is a paragraph" or that "this part is the end of a list started", etc. A system that uses procedural markup (word processors, for example, Microsoft Word) determines what direct processing will be performed at a particular point in a text document: "in this place, call such and such a procedure with parameters 5, e and z" or "move the border of the document is 7 mm to the right of any element, skip one line, start the next with a red line", etc. In SGML, the instructions that are needed to process a document for a specific purpose (such as formatting) are clearly separated from the descriptive markup that occurs within the document. They are usually collected outside the document in separate procedures or programs.

When using descriptive rather than procedural markup, the same document can be processed by different programs, each of which can apply its own processing instructions to those parts of it that it considers important. For example, a content parser might ignore footnotes entirely, while a formatter might extract and assemble them for printing at the end of each section. Different kinds of processing instructions can be associated with the same part of a file. For example, one program might extract people's names from a document and geographical names to create an index or database, while another processing the same text might print last names and titles in a different font.

SGML also introduces the concept of a document type, and, accordingly, ways to define it (document type definition, DTD). Documents are considered typed, just like other computer-processed objects. The document type is formally determined by its constituent parts and their structure. For example, you can define a document type such that it should consist of a title and possibly the author's name, followed by an abstract and a sequence of one or more paragraphs. Any document without a heading, according to this formal definition, will not be a report, nor will a sequence of paragraphs followed by an abstract, no matter how similar to a report such a document is from the point of view of a human reader. .

Because documents are of known types, a special program called a parser can be used to process a document claiming to be of a particular type and verify that all the elements required for that document type are indeed present and found. in the correct sequence and correctly structured. More importantly, different documents of the same type can be processed in a unified manner. It is possible to write programs that use the knowledge contained in the document's information structure, which can thus be more intelligent.

SGML, as a metalanguage, allows specific languages (often referred to as "SGML applications") to be defined to target specific applications. An example of this is the HTML language, which is widely used on the WWW. Each such language is described in the form of a DTD, defining elements and their attributes. Given such a DTD, SGML software can correctly process documents written in accordance with this DTD.

Even in the project, this language was conceived specifically for the implementation of the model of information transfer to the global network that we have now. In other words, HTML is a product of the Internet. Although, in fact, HTML is a simplified version of the standard common language markup - SGML (Standart Generalized Markup Language), which was approved by ISO as a standard back in the 80s of the last century. SGTML is not a pure language, but rather a set of some rules and descriptions for creating other languages, it defines the allowed set of tags, their attributes and the internal structure of the document. Control over the correct use of descriptors is carried out using a special set of rules called DTD descriptions, which are used by the client interpreter when parsing a document. Each class of documents defines its own set of rules that describe the grammar of the corresponding markup language. Using SGML, you can organize the information contained in documents, describe structured data, and present this information in some standardized format for later use. However, due to some of its complexity, SGML was used mainly to describe the syntax of other languages (the most famous of which is HTML), and few applications dealt directly with SGML documents.

HTML is a much more convenient and easy to use language than SGML. It does not allow you to define additional languages on its own. Using HTML involves marking up a document according to a standard, which is defined by a rather limited set of instructions or tags. Such instructions are intended, first of all, to control the process of displaying the contents of the document on the screen of the client program and thereby to determine the way the document is presented, but not its integral structure. In most cases, HTML data is presented in a plain text file that can be easily transferred over a network using the http protocol.

However, as time goes on and imposes more and more stringent requirements on popular technologies, modern applications need not only a language for presenting data on the client screen, but also a mechanism that allows you to determine the structure of a document and describe the elements contained in it. HTML has a simple set of commands and quite successfully copes with the task of describing textual information and displaying it on the screen of a browser viewer. However, the displayed data itself has nothing to do with the tags that are used for formatting, so parser programs do not have the ability to use HTML tags to find the document fragments we need. Those. having met, for example, such a description

rose,

The viewer will know what color to display the text contained within the tags and, most likely, it will display it correctly, but it is absolutely indifferent to it where this tag was found in the document, what other tags the current fragment is enclosed in, whether there are fragments nested in it, whether relations between objects are built correctly. Such "indifference" to the structure of the document leads to the fact that the search or analysis of information inside it will be no different from working with a continuous text file that is not divided into elements. And this, as you know, is not the most effective method work with information.

Another significant drawback of the very idea implemented in HTML is the limited set of its tags. DTD rules for HTML define a fixed set of descriptors, and therefore the developer does not have the opportunity to enter his own, special tags. Although new language extensions appear from time to time (today the latest version of HTML is HTML 4.0), but the long way to standardize them, accompanied by constant disagreements between major browser manufacturers, makes it almost impossible to quickly adapt the language, use it to display specialized information (for example, multimedia, mathematical, chemical formulas etc.).

Summarizing all that has been said, it can be argued that even today HTML does not fully satisfy the requirements that modern developers place on languages of this kind. And he was offered to replace new language hypertext markup: a powerful, flexible, and, at the same time, convenient XML language.

XML (Extensible Markup Language) is a markup language that describes a whole class of data objects called XML documents. This language is used as a means to describe the grammar of other languages and control the correctness of the drafting of documents. Those. XML itself does not contain any tags to be marked up, it simply defines the order in which they are created. Thus, if, for example, we think that to denote the rose element in a document, it is necessary to use the tag , then XML allows us to freely use the tag we define, and we can include snippets like this in the document:

rose

The set of tags can be easily extended. If, suppose, we also want to indicate that the description of the flower should go inside the description of the greenhouse in which it blooms, then we simply set new tags and choose the order in which they appear:

rose

If we want to plant a few more flowers there, we must make the following changes:

rose

tulip

cactus

As you can see, the very process of creating an XML document is very simple and requires only basic knowledge of HTML and an understanding of the tasks that we want to perform using XML as a markup language. Thus, developers have a unique opportunity to define their own commands, allowing them to most effectively determine the data contained in the document. The author of the document creates its structure, builds the necessary links between the elements, using the commands that meet his requirements, and achieves the type of markup that he needs to perform the operations of viewing, searching, analyzing the document.

Another obvious advantage of XML is its ability to be used as a general-purpose language for querying information stores. Today, in the depths of the W3C, a working version of the XML-QL (or XQL) standard is under consideration, which, perhaps, will seriously compete with SQL in the future. In addition, XML documents can act as a unique way to store data, which includes both tools for parsing information and presenting it on the client side. In this area, one of the promising areas is the integration of Java and XML technologies, which makes it possible to use the power of both technologies in building machine-independent applications that also use a universal data format for information exchange.

XML also allows you to control the correctness of the data stored in documents, check hierarchical relationships within the document and establish a single standard for the structure of documents, the contents of which can be a variety of data. This means that it can be used in building complex information systems, in which the issue of information exchange between different applications running in the same system is very important. By creating the structure of the information exchange mechanism at the very beginning of work on the project, the manager can save himself in the future from many problems associated with the incompatibility of the data formats used by various components of the system.

Also, one of the advantages of XML is that the programs that process XML documents are simple, and today all kinds of software products designed to work with XML documents are freely distributed. XML is supported today in all browsers of the Microsoft Internet Explorer family, starting with version 4.0. It was announced to be supported in subsequent versions of Netscape Communicator, Oracle DBMS, DB-2, in MS-Office applications. All this suggests that, most likely, in the near future XML will become the main information exchange language for information systems, thus replacing HTML. On the basis of XML, such well-known specialized markup languages as SMIL, CDF, MathML, XSL have already been created, and the list of working drafts of new languages that are under consideration by the W3C is constantly updated.

What does an XML document look like?

If you're familiar with HTML, learning XML won't require much effort on your part. Although XML is certainly very different in its capabilities and purpose from the hypertext markup language, both of these languages are subsets of SGML, and therefore inherit its basic principles.

Document structure

The simplest XML document might look like Example 1

The first

Second subparagraph 1

Third

Last

Note that this document is very similar to a regular HTML page. Just like in HTML, statements enclosed in angle brackets are called tags and serve to mark up the body of the document. In XML, there are open, close, and empty tags (in HTML, the concept of an empty tag also exists, but it does not need to be specially designated).

The body of an XML document consists of markup elements (markup) and the actual content of the document - data (content). XML tags are designed to define document elements, their attributes, and other language constructs. We will talk more about the types of markup used in documents a little later.

Any XML document must always begin with the statement, within which you can also set the language version number, code page number, and other parameters needed by the parser program in the process of parsing the document.

Rules for creating an XML document

In general, XML documents must meet the following requirements:

An XML declaration is placed in the header of the document, which specifies the markup language of the document, its version number, and additional information.

Each opening tag that defines a certain data area in the document must have its own closing "partner", i.e., unlike HTML, closing tags cannot be omitted.

XML is case sensitive.

All attribute values used in tag definitions must be enclosed in quotation marks.

Nesting of tags in XML is strictly controlled, so the order of opening and closing tags must be monitored.

All information between the start and end tags is treated as data in XML, and therefore all formatting characters are taken into account (ie spaces, newlines, tabs are not ignored, as in HTML).

If an XML document does not violate the above rules, then it is called formally correct, and all parsers designed to parse XML documents will be able to work with it correctly.

However, in addition to checking for formal compliance with the grammar of the language, the document may contain means of control over the content of the document, over compliance with the rules that determine the necessary relationships between elements and form the structure of the document. For example, the following text, while being a perfectly valid XML document, would be completely meaningless:

Russia Novosibirsk</country>

In order to ensure the correctness of XML documents, it is necessary to use parsers that perform such a check and are called verifiers.

To date, there are two main ways to control the correctness of an XML document: DTD definitions (Document Type Definition) and data schemas (Semantic Schema). We'll talk more about using DTDs and schemas next time. Unlike SGML, defining DTD rules in XML is not necessary, and this circumstance allows us to create any XML documents without breaking our heads over the rather complicated DTD syntax yet.

The basic principle

An element is the basic structural unit of an XML document. Enclosing the word rose in tags , we define a non-empty element called , whose content is rose. In the general case, the content of elements can be either just some text, or other, nested, document elements, CDATA sections, processing instructions, comments, i.e. virtually any part of an XML document.

Any non-empty element must consist of a start tag, an end tag, and the data enclosed between them.

The set of all elements contained in the document defines its structure and defines all hierarchical relationships. A flat data model is transformed using elements into a complex hierarchical system with many possible relationships between elements.

When subsequently searching in any document, the client program will rely on the information embedded in its structure - using the elements of the document. Those. if, for example, you want to find the right university in the right city, then you will need to look at the contents of a particular element , located inside a specific element . The search in this case, of course, will be much more efficient than finding the desired sequence throughout the document.

In an XML document, as a rule, at least one element is defined, called the root, and parsers begin viewing the document from it. In this example, this element is .

In some cases, tags can change and refine the semantics of certain fragments of a document, defining the same information in different ways and thereby providing the parsing application of this document with information about the context of using the described data. For example, reading the snippet Hollywood, we can guess that this part of the document is about the city, but in the fragment Hollywood- about the eatery.

Conclusion

The Web page formatting language HTML was originally introduced as an application of SGML. Later, with the rapid development of the WWW, HTML began to expand in every possible way in order to give the author more control over the external presentation of information. New elements and attributes such as or , focused on visual formatting. Tools that are not included in the markup language proper have appeared and have been actively used: imagemaps, Java and JavaScript, plugins, and so on. There are also many HTML elements that are only supported by a particular browser, or work differently in different browsers. Therefore, it is now difficult to say whether HTML is an application of SGML or not. Very few pages are built according to the HTML specifications and the corresponding DTDs.

Cascading styles, which have been standardized by the W3 consortium, are partly designed to alleviate this problem. CSS1 separates the style that defines the visual appearance of elements from the markup of elements.

Of great interest is the XML language, supposedly going to replace HTML as the markup language for Web pages. This is a variant of SGML primarily aimed at WWW applications. It does not require the mandatory presence of a DTD, and the language itself is simplified due to rarely used complex structures. This will keep the parsers simple, which will allow the active use of XML in browsers. (The probability of which is quite high, given the curtsy of both major players in the field of browsers towards XML).

PRINT VERSION>>
Article read:once.

In word processing systems, additional information is included in the document, called markup and performing the following functions:

selection of logical elements of this document;
setting processing functions for selected elements.

In conventional word processors, there are built-in commands for turning on / off fonts and other similar commands for controlling the placement of information on the screen or when printing (the so-called Esce sequences). This approach is called command or procedural markup.

An alternative way to markup is to select a portion of the text without specifying how the selection is handled. Then the other commands assign the processing to the fragments. This markup is called descriptive(descriptive). It includes marks (tags, tags) of the beginning and end of the text element and indicates how to interpret this fragment.

Advantages

The main advantage of descriptive markup is its flexibility, since pieces of text are marked as "what they are" (and not "how they should be displayed"), and in the future, software may be written to handle these fragments in a way that was not even envisaged by language designers. . For example, HTML hyperlinks, originally intended for users to navigate through a collection of links on the web, have since been used by web search and indexing mechanisms, to evaluate the popularity of resources, and so on.

Descriptive markup also makes it easier to reformat the document if needed, because the format description is not related to the content. For example, italics can be used either to highlight text, or mark foreign (or slang) words, or for other purposes. However, if the words are simply highlighted (descriptively or procedurally) as italics, this ambiguity cannot be fully resolved. If the two cases were labeled differently at the outset, each can be reformatted independently of the others. Generic markup is another name for descriptive markup.

In practice, elements of different markup classes usually coexist in any particular system. For example, HTML contains both markup elements that are procedural (b for bold) and others that are descriptive ("blockquote" or "href" is a feature). HTML also includes a PRE element, which limits the area of text to be positioned exactly as printed.

Descriptive markup systems

Most modern descriptive markup systems treat documents as hierarchical structures (trees) and also provide some means for inline cross-references. Therefore, such documents can be interpreted and processed as Database, whose structure is fairly well defined (however, since they do not have the same strict schemas as relational databases, they are commonly referred to as "loosely structured databases").

With the onset of the 3rd millennium, interest arose in the documents of non-hierarchical structures. For example, ancient and religious literature usually has a rhetorical or prose structure (story, section, paragraph, etc.) and also includes background information(books, chapters, stanzas, lines). Because the boundaries of these modules often overlap, they cannot be fully encoded using only a tree-structured markup system. Document modeling systems that support such structures include MECS, TEI Guidelines, LMNL, and CLIX.

The term "markup" comes from the traditional practice of marking up manuscripts before publication (that is, adding symbolic commands in the margins and between lines in a paper manuscript). For many centuries, this was done by publishing house employees (editors and proofreaders) who noted what font, style and size text fragments should be typed in, and then handed over the manuscript to typesetters who manually typed text taking into account markup characters.

Currently, there are many markup languages, among the most widely known - DocBook, MathML, SVG, Open eBook, XBRL and others. They are mainly intended to represent various text documents, but specialized languages can be used in many other areas. By far the most well-known markup language is HTML (Hypertext Markup Language), one of the foundations of the WWW (World Wide Web).