Most specialist terms are explained the first time they occur, but here are a few which are only mentioned in passing:

Ancestor a collective term for the elements within which a child element is nested; thus the root element is the ancestor of every other element in an XML file


assign to assign a variable is to give it a value


Attribute a container for information about the contents of an element, held in an opening tag,  separated from the element name by a space and followed by an equal sign, eg. <date style=”new style”>; elements can have multiple attributes

Attribute value

Attribute value attributes are fixed but their values can vary in each instance, they follow the equal sign and must be within quotation marks, eg <person id=”123456”>



Boolean a data format in Python that can only have the values true or false. More generally, Boolean operators create logical structures by combining statements with and, not and or.


Character data

Character data text that is ignored by the XML editor or other software; this means that characters such as & or < will be ignored and need not be rendered as entities


Child the immediate descendant of an element, i.e. an element nested directly within another

Closing tag

Closing tag the second part of an element, denoted by a forward slash after the opening angle bracket, eg  </body>


codepoint a unique number that defines a character and the number of bytes needed to encode it


concatenation the joining together of multiple things: strings, variables, files, etc.


Crowdsourcing the use of volunteers to contribute to a research project



Descendant an element nested inside another, at however deep a level – so all other elements in an XML document are descendants of the root element


DocBook an XML format for encoding books and papers


DTD Document Type Definition, a declaration which specifies which elements and attributes are allowed in an XML file, and how they can be used



EAD (Encoded Archival Description) an XML markup scheme for encoding archival finding aids


EEBO (Early English Books Online) a project to photograph all books published in England or English between the beginning of printing and 1700. Originally a microfilm product, EEBO is now published on the web by ProQuest.


EEBO-TCP a project to take a selection of page images from EEBO and produce lightly encoded TEI-conformant XML transcriptions of the texts


Element a discrete piece of markup, usually consisting of an opening and closing tag

Entity reference

Entity reference an encoding for a particular character that begins with & and ends with a semi-colon, eg  &amp;. Used in HTML for reserved characters, they also have wider applications.



Join a database manipulation technique to combine multiple tables into a new table


Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) an algorithm that clusters topics on the basis of probability (using the Dirichlet distribution)

list comprehension

list comprehension a compact, readable syntax provided by Python for creating lists



markup embedded annotations to a text which provide instructions on how elements of it should  be presented, structured or interpreted


metadata data that describes other data; the file properties given in an operating system, such as when the file was created, modified, and so on, are a basic form of metadata



Nest  an element that opens and closes inside another element (its parent) is nested within it, so date is nested within lang here:  <lang name=”Latin”>Sepultus erat <date value=”3-10-1609”>tertio die Octobris</date></lang>


node an element in a data structure which is linked to other nodes, often hierarchically as in a tree diagram


object serialisation

object serialisation the conversion of a complex data structure into a series of bytes

Opening tag

Opening tag the first part of an element, bounded by angle brackets, eg <p>



Parent the immediate ancestor element of another element


Parse to read the structure of an XML document, element by element; any XML-aware software needs to parse the document before acting upon it

Parsed character data

Parsed character data text that is read by the XML editor or other software; this means that any characters which are part of XML syntax, such as & or < will need to be rendered as entities if they are to be represented literally 

Plain text

Plain text:  text without any markup. Note that text in word processors, such as Word, does have markup – you just can’t see it.

Processing instruction

Processing instruction an element that takes the form <? … ?>, which calls upon software to act – for example by referring to another file, such as a stylesheet



Quantifier a symbol specifying how many in a DTD: ? = none or one; + = one or many; * = none, one or many.



RelaxNG an alternative rules file format to a DTD or XML Schema; although we don’t cover it in this course there is plenty of information about it on the web.

Root element

Root element everything in an XML file, apart from the declaration and other header information, must go inside one element which wraps all other elements

Rules file

Rules file a generic term for a DTD, XML Schema or other format specifying the rules of an XML document

Running text

Running text text in paragraphs or other long units of narrative, as opposed to text in tables, lists, headings etc.


scripting language

scripting language a programming language that does not need to be compiled before it is run

stop word

stop word an instruction to ignore a word when analysing text, creating indexes, etc


string a data type in Python, entered in quotation marks and treated as a literal string; for example the string 21 cannot be divided by 2, whereas the integer 21 can be



Tag part of an element, bounded by angle brackets, eg <h1>

Text file

Text file  a file that can be read by any text editor, usually having the file extension .txt


tuple a sequence of any number of values (the name is formed from the suffix of words like  quintuple). In Python, once a tuple has been created it cannot be changed



Unicode a standard for encoding characters in the world’s writing systems



Valid and XML document is valid if it follows the rules specified by the rules file to which it is linked; additionally a document must be well formed in order to be valid.


variable a name for a value; for example if in Python  myage = 21 assigns the value 21 to the variable myage


Web scraping

Web scraping automated collection of content from web pages


Well-formed conforming to the structural rules of XML, i.e. properly nested elements, matching case for elements, and quoted attribute values



XML (Extensible Stylesheet Language) a markup language which gives great flexibility to its users in defining content and structure

XML declaration

XML declaration a processing instruction that goes at the top of an XML file; the most minimal form is  <?xml version="1.0"?>

XML Schema

XML Schema a schema language: an expression of  the rules for a particular XML document, written in XML itself, and following the syntax specified by the W3C

