Creating and Using Custom Corpora
For languages without built-in support or if you want to use your own corpus, you can create a custom corpus file for use with Jiwar.
Custom Corpus Requirements
File Format: CSV or Excel (.xlsx)
Minimum Required Columns: -
word: The word in the target languageOptional Columns for Frequency Information: -
frequency_*: Columns starting with “frequency_” will be used for neighborhood frequency calculations
Steps to Create a Custom Corpus
Prepare your word list in the target language.
Create a CSV or Excel file with at least a ‘word’ column.
If available, add frequency information in columns starting with ‘frequency_’.
Save the file in the
data/corpus/user_loaded/directory of your Jiwar installation.
Example Custom Corpus Structure
word,frequency_count,frequency_per_million
apple,1000,50.5
banana,800,40.2
cherry,600,30.1
...
Using a Custom Corpus with Jiwar
Place your custom corpus file in the
data/corpus/user_loaded/directory.Run Jiwar as usual:
python jiwar.pyWhen prompted, enter the name of your custom corpus file.
Proceed with your analysis as normal.
Note: If a built-in corpus is available for your chosen language, Jiwar will ask if you want to use it or your custom corpus.
Tips for Custom Corpora
Ensure your corpus is representative of the language or specific domain you’re studying.
The larger the corpus, the more accurate the neighborhood measures will be.
If you’re using a custom corpus for a language with a built-in corpus, consider comparing results to validate your custom corpus.
For languages without built-in IPA support, you may need to provide IPA transcriptions in your custom corpus for phonological and phonographic measures.