Skip to content

Latest commit

 

History

History
202 lines (146 loc) · 18.2 KB

ANNIS-tutorial.md

File metadata and controls

202 lines (146 loc) · 18.2 KB

ANNIS Tutorial

Introduction to ANNIS

Coptic Scriptorium uses the ANNIS search and visualization tool. You can access Coptic Scriptorium's corpora in ANNIS in multiple ways:

This tutorial will:

ANNIS Corpus Browser

When you arrive at https://corpling.uis.georgetown.edu/annis/scriptorium, you will see the list of publicly available corpora on the lower left of your screen. (On the right, you will see a list of sample queries for our corpora -- more on that in a minute.)

Each corpus contains multiple documents.

  • Each corpus has its own metadata (information about the corpus, such as all the editors/annotators who worked on the corpus, license information, the date this version of the corpus was released, etc.)
  • Each document within the corpus also has its own metadata (tite of the document, manuscript information if the text is from a manuscript, specific editors/annotators for that document, translators, the date this version of the document was released, etc.)

Look at the list of corpora:

  1. To find out more information about any corpus, click the "i" information button for that corpus. A window will appear with:
  • a dropdown menu at the top listing all the documents in a corpus
  • on the left, metadata (information about the corpus, such as annotators, translators, the date this version of the corpus was released)
  • on the right, all the annotations available for texts in this corpus. These annotations will make more sense to you after you run a few searches. (We also have documentation on our wiki if you're really interested.)
  1. To see a list of documents in any corpus, click the document icon 📄 for that corpus name
  • Click the "i" information button for any document for more information
  • You'll also see a list of visualizations for each document (the same visualizations available at http://data.copticscriptorium.org)

➡️Try it: What happens when you click on a link for a visualization?

corpus document list

  1. You can filter the list of corpora also:

➡️Try it: What happens when you type in "shenoute." (without the quotation marks) in the Filter box above the list of corpora?

➡️What happens when you click the 🔄 button ?

Basic Search

Let's start by using the example queries provided for any given corpus.

➡️1. Try it: Click on the first corpus, apophthegmata.patrum. Then play around with the sample queries or follow the following steps

  • In the panel on the right, click on the query to search for the word "ⲁⲡⲁ". Note: this searches for the normalized word, meaning spelling variants have been normalized, diacritics removed, and missing/damaged letters reconstructed.
  • In the search results on the right, your query term should appear in red (possibly within a Coptic bound group).
  • See the phrase "Base text" at the top of the list of results?
    • Change the base text from "norm_group" to "norm"? How does this change how the results appear on the screen?
    • Change the base text to "orig"? (Note: orig is an abbreviation for "original text" transcription)
  • You should see the same visualizations we've seen before (Analytic, Diplomatic, Normalized views). Click on the + next to "analytic view".
    • Can you see your search result in red again?
    • This view visualizes three annotations of the textual data: part of speech annotations, the normalized Coptic text, and an English translation.
    • Check out the other two visualizations. What information is available?
  • What happens when you click the "i" information icon for the first search result? What information does this give you?
  • To view ALL the annotations for any given query result, click on "annotations (grid)".
    • All annotations for that stretch of text will appear as layers below.
    • Some annotations have been manually encoded; others have been added using our Natural Language Processing tools

ANNIS uses a multi-layer annotation model, where a base text appears followed by layers of annotations on that base text. You can have any number of customized annotations. All our paleographic and manuscript annotations (lacunae, page breaks, column breaks) follow a set of annotation and encoding standards known as the TEI-XML (the Text Encoding Initiative standards for extensible markup language). Specifically, we use the Epidoc subset of TEI XML, the same encoding standards that Papyri.info uses.

➡️2. Try it: Let's create your own simple searches for words.

  • Modify the search we just did by typing your own favorite Coptic word where "ⲁⲡⲁ" appears. Click "Search". Check out the results.
  • Don't have a Coptic font installed on your computer? Click on the little keyboard to the right of the search pane!
  • Let's now search for your favorite word in more than one corpus. Control-click on a Mac/right-click on a PC on another corpus name in the corpus list in the lower left. Click Search.

➡️3. Try it: Create simple queries for information other than words.

  • Search for norm="ⲥⲟⲛ" in your chosen corpora
  • Now search for lemma="ⲥⲟⲛ". What's the difference in the results?
  • Search for all Greek words in Shenoute's "I See Your Eagerness": click on tge shenoute.eagerness corpus and search for lang="Greek"(link)
  • Search for all words with the morpheme "ⲙⲛⲧ" in Shenoute's "Not Because a Fox Barks": click on the shenoute.fox corpus and search for morph="ⲙⲛⲧ" (link)
  • Search for all proper names in Warren Wells' Sahidica edition of the Gospel of Mark: click on the sahidica.mark corpus and search for pos="NPROP"(link)
  • Play around with some simple searches.

➡️4. Try it: You can click on the History button to see all the previous queries you've run in your current ANNIS session.

Complex Searches

You can also use regular expressions and the Annis Query Language to create complex queries, searches for sequences of characters, queries for two or more annotations, etc.

➡️5. Try it: Select a corpus, like 1 Corinthians, and try the following queries. (Type or cut-and-paste.) What kind of results do you get?

  • norm_group=/ⲡⲉⲧ.*/
  • norm=/.*ⲟⲥ/
  • norm=/ⲥ[ⲟⲱ]ⲧⲙ/

Hint: the .* in the query syntax signals that you want to search for any character(s).

You can also search within a translation, if your corpus has a translation. (Not all do.)

➡️6. Try it: Select the 1 Corinthian corpus. Try the following queries. What's the difference?

translation=/.*brother.*/
translation=/.*[Bb]rother.*/

You can search more than one field at the same time.

➡️7. Try it: Say you're interested in proper names. Select the corpus for Abraham Our Father. Compare the following queries

  • pos="NPROP" _o_ lang="Greek"
  • pos="NPROP" _o_ lang="Hebrew"
  • pos="NPROP" _o_ lang=/.*/

Note: We tag loan words for language of origin based on the oldest possible language. To find all loan words, use the lang=/.*/ query.

You can also add metadata to your queries.

➡️8. Try it: Select the Abraham Our Father corpus. Search for all the appearances of "ϣⲉⲉⲣⲉ" in the codex MONB.YA: norm="ϣⲉⲉⲣⲉ" & meta::msName="MONB.YA"

  • Play around with other metadata fields. To find all words in documents edited by Rebecca S. Krawiec, select your corpora and search:
norm & meta::annotation=/.*Krawiec.*/

There's lots of fun stuff you can do with regular expressions and the ANNIS Query Language:

  • Find either circumstantial converters or focalizing converters: pos=/CCIRC|CFOC/
  • Find either form of the same verb: norm=/ⲥ[ⲟⲱ]ⲧⲙ/
  • Query for things following each other: To search for a copular pron sentence (a copula following a pronoun): pos="PPERI" . pos="COP"
  • Query for nearness: To find "daughter" within 50 tokens after "son": norm="ϣⲏⲣⲉ" ^* norm="ϣⲉⲉⲣⲉ"

Know your corpus and annotations when doing research. For example, in our corpus, a compound word containing both Greek and Coptic contains a language tag only for the Greek morph within the compound. (E.g., in ⲣⲭⲣⲉⲓⲁ, only ⲭⲣⲉⲓⲁ receives the Greek tag. Hence, we use syntax for finding overlapping search fields ("o") rather than equivalent fields ("="). lang="Greek" _=_ pos="V" (link) finds all verbs that are Greek loanwords; lang="Greek" _o_ pos="V" (link) finds all verbs that are Greek loanwords or are compound words with Greek loan words as part of the compound. Compare the results in the links.

Word Frequencies

ANNIS allows you to find word frequency lists for our corpora.

➡️1. Try it: Select the shenoute.eagerness corpus.

  • type in the following query to find all the words in the corpus: norm
  • Below the query window, you should see a button for "More." Click on it and select "Frequency Analysis." Click "Perform Frequency Analysis"
  • Both a chart and a list of word frequencies will appear.
  • You can see your frequencies on a linear scale or a log scale

➡️2. Try it: Download your frequency list by clicking the "Download as CSV" button

You can also produce frequencies for more refined lists. Be sure to close the "Frequency Analysis" pane to clear your data before you start a new analysis.

➡️3. Try it: Create lists for loan words in our corpus.

Remember: If you have just run a frequency analysis, then close the current "Frequency Analysis" pane first. Do this (or click "new analysis") between each new frequency analysis.

A. Find the Greek loan words in the shenoute.eagerness corpus using this query: lang="Greek" _o_ norm * Enter the query. (If you've closed the Frequency Analysis pane, click "More" then "Frequency Analysis") * Delete all rows EXCEPT "norm" (since you want the frequency of each normalized word) * Click "Perform Frequency Analysis" B. Can you do the same to find all loan words in the shenoute.eagerness corpus (remember to hit "new analysis" first): lang=/.*/ _o_ norm C. Can you do the same to find all loan words that are verbs (remember to hit "new analysis" first): lang=/.*/ _o_ norm _o_ pos="V"

BONUS question: What do you do about corpora that contain more than one manuscript witness to the same text? I See Your Eagerness is one such corpus. In some places, we have parallel manuscript witnesses to the same text. So if you run a straight word frequency list, you'll get duplicate "hits". For this corpus (and future versions of other corpora) we encode parallel witnesses in the metadata fields. When you click on the "i" information button for a document, you'll see metadata fields for "witness" and "redundant".

Image of metadata-witness

➡️4. Try it: Run a freqency analysis using the following query: lang=/.*/ _o_ norm _o_ pos="V" & meta::redundant="no"

  • Remember to click "new analysis" to clear your old frequency data first!
  • Remember to delete rows for everything except norm when you run the analysis.
  • Are the results different from the results from #3 above?

Again: know your corpus so you understand the numbers. Spend some time looking at the metadata, understanding the annotation layers, and running queries to see how the annotations and textual data work. In our corpora, we designate as redundant the withness(es) with the most damage or lacunae.

Download Your Results

We encourage all researcher to keep records of their research in ANNIS. This includes queries, the corpora on which the queries are run , the version number and version date of the corpora, and the results.

There are multiple ways you can download the results of your query by clicking More > Export underneath the query panel. Each way or format works well for a different discipline or research objective. For most people who work with texts as philologists, historians, or religious studies scholars, we recommend using the GridExporter. The GridExporter allows you to tell ANNIS which annotations and which metadata you want to export.

➡️Try it: Run a query (such as this one) and download your results.

  • Run the query
  • Click "More" > "Export"
  • In the Exporter dropdown menu select GridExporter
  • In the "annotation keys" box, type the annotations you want to export. Try: orig, norm, translation, pb_xml_id to export the original manuscript text, normalized text, translation (if available), and the page number of the manuscript
  • In the Parameters box type numbers=false;metakeys=title,version_n,version_date to export the document title, version number, and version date for EVERY hit in your search.
  • Click Perform Export
  • Click Download
  • You can open this text file in any text editor (such as TextEdit, Text Wrangler, etc.)
  • If you want more annotations (such as part of speech tags) add them to the "annotation keys" box; be sure to use the correct name for the annotation
  • If you want more metadata (such as the names of editors or translators of each document), add them to the "Parameters" box; be sure to use the correct name for the metadata field

Citing and Linking to Your Data

When researching our corpora for a future publication, please note the date and version number of the documents or corpora while you are conducting your research. (This information is in the corpus and document metadata accessed via the information button(s) for each corpus and each document within a corpus.) We update our corpora regularly and recommend all citations include the version number and date of the resources used, as described below. (If you conducted research in the past and did not note the version and date of the corpus at that time, then please cite the date you accessed the corpus.)

We have Citation Guidelines with examples for how to cite the project, the project site, individual corpora, and individual documents in your bibliography and footnotes. If you are using documents or queries on only one corpus, then you may cite only that corpus.

When citing more than one corpus, we recommend citing the corpora and versions of each corpus used.

You can save a link to a query or even to a query result.

screenshot of linking to query

If you want to embed a result in a blog, webpage, or other electronic publication, you can do that too!

screenshot of embed dialogue

Some DH researchers recommend providing access to your data when you publish your analysis. You can do this in a number of ways:

  1. Link to our project's raw data on our GitHub corpus repository. * Link to the version that corresponds to your data. See our release page. screenshot of GitHub corpora repository page * So, for research conducted in May 2017, link to the April 2017 version. For research that was conducted in January 2017, you would want to link to the December 2016 version
  2. Download your query results using the process described above and post them on your own site; link to them in your publication.
  3. Link to your query on our ANNIS site.

Important note: the URLs for the query and result links are stable, but the core text data may change if we update the corpus or documents you are querying. We update regularly to add more documents to a corpus, to add new annotations, or to make corrections. We encourage all researchers to download query results and cite the version number(s)and date(s) of the data used.