Antconc Corpus
Repeating this a lot is how you would build a corpus of plain text files; this process is called corpus construction, which very often involves addressing questions of sampling, representativeness and organization. Remember, each file you want to use in your corpus _must_ be a plain text file for Antconc to use it. It is customary to name files with the .txt suffix so that you know what kind of file it is.
Antconc Corpus
Heather Froehlich is a PhD student at the University of Strathclyde (Glasgow, UK), where she studies gender in Early Modern London plays using computers. Her thesis draws heavily from sociohistoric linguistics and corpus stylistics, though she sustains an interest in digital methods for literary and linguistic inquiry. Suggested Citation Heather Froehlich, "Corpus Analysis with Antconc," Programming Historian 4 (2015),
A freeware, parallel concordancer that allows users to check word and phrase usage in an English and Japanese educational corpus. WebSCoRE is developed by Laurence ANTHONY (Waseda University, Japan) in collaboration with Kiyomi CHUJO (Nihon University, Japan).
AntConc is a useful tool for finding clusters (frequency patterns of word sequences) or n-grams (sequences of n words within your corpus or document), which may be particularly useful once you have established high-frequency words for a search strategy but need to increase the precision of your search by either searching for phrases that contain those words or by establishing good collocates if you are making selections for adjacency searching.
It is best used once you have established the high-frequency words that you would like to add to your strategy: Tools such as PubReminer and Systematic Review Accelerator's Word Frequency Analysis are easy to use for that purpose as they take into account the occurrence of words but also the number of records in which those words appear, so they could be used before performing the analysis in AntConc (in fact, the Systematic Review Accelerator also identifies n-grams). The corpus or file containing relevant bibliographic records can then be opened in AntConc for text mining, and some authors suggest separately analyzing titles then abstracts, and setting different cutoffs for inclusion (less strict for titles, stricter for abstracts). Stopword lists and lemma lists can also be added to the tool, for example, PubMed's list of 132 stopwords. Many such lists exist for reuse on the internet and the choice depends on the context of the search.
One issue that can be overcome with some scripting is the fact that groups of bibliographic records are usually exported in one file, whereas ideally individual bibliographic records should be imported into AntConc as individual documents (one per record) within a corpus. I have not been able to figure out how to do this yet.
This tool shows the which words are unusually frequent (or infrequent) in the corpus in comparison with the words in a reference corpus. This allows you to identify characteristic words in the corpus, for example, as part of a genre or ESP study. The following steps produce a keyword list and demonstrate the main features of this tool.
AntConc is an advanced text analysis application which provides details about the text inside of one or multiple text files, should you opt for batch processing.With AntConc, we're presented with a number of interesting text analysis tools which calculates and displays the results of its analysis in a few different ways including concordance, file viewer and a cluster tool.Another tool that comes built-in with AntConc is the "Clusters/N-Grams" which can search the corpus for N length clusters, essentially detecting different but similar word patterns.Once all of the data has been collected, AntConc can export its results to a few different file formats, most notably text, HTML or Excel files.In closing, AntConc has its specific niche but it may also be useful for web developers and search engine optimizers for its keyword analysis functionality.Features of AntConcCharacter Encoding: Support for various character encodings.
Cluster Analysis: Automatically group words together.
Collocation: View collocations (words commonly used together).
Compare: Compare two or more texts.
Concordance: View keyword in context.
KWIC: View keywords in context.
N-Gram: View patterns of words used together.
Part-of-Speech Tagging: Automatically tag words with parts of speech.
Phonetic Analysis: Search using phonetic patterns.
Regular Expressions: Use powerful search patterns.
Search: Search for words or phrases within the text.
Text Classification: Automatically classify text into categories.
Text Conversion: Convert text to other formats.
Word Frequency: View frequency of words in the text.
Word List: Generate and view a list of words in the text.
Compatibility and LicenseAntConc is provided under a freeware license on Windows from language and translation software with no restrictions on usage. Download and installation of this PC software is free and 4.2.0 is the latest version last time we checked.
AntConc v.4 comes with a corpus builder, which means you can add raw files to create your own corpora, save them and then quickly pick the one you want to load for a particular query or project. Added bonus: compatible file types now include .pdf and .docx, so you no longer have to use AntFileConverter to convert .pdf files to .txt before processing them. Note that if you try to load improperly encoded files into the new version, you will see a warning and those files will be ignored. Try resaving those files as UTF-8 in a text editor to solve the problem as this is the default that AntConc uses.
Unfortunately, the results pane in the KWIC tool displays source file names in the first column, which means that long names tend to gobble up screen space. (Before, file names could be scrolled out of sight to the right.) However, the shortcut Ctrl+H hides the file name column, kicking into action the next time a search is launched. And for more screen space, you can now completely collapse the left pane containing the target corpus details by dragging the double-headed arrow leftwards.
Once you save your recently made corpus, you need to change the file into a text file by simply renaming the .crp file into a .txt file. A good idea is to make a copy of the .crp file before renaming so that you can still use original file in TextSTAT.
The AntConc GUI is conveniently subdivided into several tabs organized horizontally at the top of the program window.The tabs represent the functions of AntConc and offer the user relevent views of the corpus data. Down the left of the window there is a box with the list of the Corpus Files: the user has selected for analysis. The query area is at the bottom of the program window. The vertical area on the left-hand side holds the functionality for loading your own files into the Programm.
Navigate to a directory with *.txt files and load some of them into the software. Note that AntConc standardly expects *.txt as the input file type.If you do not have any files at hand, download this zip file with the American Inaugural Speeches from the NLTK_DATA collection and unpack it to your corpus directory. Note that those of you who have installed the NLTK_DATA set during the Python course can simply use the data already on their machines.
can anyone recommend a tool for corpus analysis in Chinese language (1) on basis of words (not single characters, so it can find e.g. frequency of 社会, not just 社), where (2) I can analyze documents I select myself? (free software would be best ;).
Language-neutral softwares such as AntConc have difficulties to recognize Chinese since there are no spaces between words. Tools like lancaster corpus can identify words, but it analyzes a corpus from the internet, so I cannot for example analyze the language in a sub-corpus I have collected myself.
AntConc is a free and green corpus tool developed by Japanese scholar Laurence Anthony featured by three main functions: concordance, wordlist and keywords. The author first describes the principles and process of constructing a web-based bilingual parallel corpus for the purpose of translation studies and ESP research based on AntConc. Constructing principles include strict linguistic standards, balance of corpus, and appropriate size. Based on the principles, language data is accumulated, processed and entered. After that, language processing software CLAWS part-of-speech tagger and Wmatrix are respectively used for text marking, annotating, high-frequency vocabulary extracting and corpus distribution balancing. In the end, texts, paragraphs and sentences are aligned by the corpus tool ParaConc. After the construction, the author uses the retrieval software Wordsmith 4.0 and statistical software SPSS 11.5 to prove the feasibility and effectiveness of the corpus with a six-month experiment that covers 16 translators, 25 teachers and 285 students.
Concordance is a list of words used in a corpus (group of texts) ranked by how frequently they are used. The Concordance tab will let you see the words in the context that they are used in each document, but you supply it with the terms for it to search for. So it is most useful if you already know which terms are of interest to you, rather than if you are trying to see the most popularly used words in a piece.
These tools above will let you see the context that your chosen words or phrases are used in for the works in your corpus. The tools in the next module Clusters/N-grams will let you know which phrases are most popular so that you might gather new phrases to search for. 041b061a72