Creating and Using Custom Corpora

For languages without built-in support or if you want to use your own corpus, you can create a custom corpus file for use with Jiwar.

Custom Corpus Requirements

  1. File Format: CSV or Excel (.xlsx)

  2. Minimum Required Columns: - word: The word in the target language

  3. Optional Columns for Frequency Information: - frequency_*: Columns starting with “frequency_” will be used for neighborhood frequency calculations

Steps to Create a Custom Corpus

  1. Prepare your word list in the target language.

  2. Create a CSV or Excel file with at least a ‘word’ column.

  3. If available, add frequency information in columns starting with ‘frequency_’.

  4. Save the file in the data/corpus/user_loaded/ directory of your Jiwar installation.

Example Custom Corpus Structure

word,frequency_count,frequency_per_million
apple,1000,50.5
banana,800,40.2
cherry,600,30.1
...

Using a Custom Corpus with Jiwar

  1. Place your custom corpus file in the data/corpus/user_loaded/ directory.

  2. Run Jiwar as usual: python jiwar.py

  3. When prompted, enter the name of your custom corpus file.

  4. Proceed with your analysis as normal.

Note: If a built-in corpus is available for your chosen language, Jiwar will ask if you want to use it or your custom corpus.

Tips for Custom Corpora

  • Ensure your corpus is representative of the language or specific domain you’re studying.

  • The larger the corpus, the more accurate the neighborhood measures will be.

  • If you’re using a custom corpus for a language with a built-in corpus, consider comparing results to validate your custom corpus.

  • For languages without built-in IPA support, you may need to provide IPA transcriptions in your custom corpus for phonological and phonographic measures.