mitm/DICT_README.md

2.2 KiB

Chinese Dictionary Implementation

This implementation provides a ChineseDict struct that loads Chinese characters from dict.txt and provides functionality to generate random Chinese characters.

Features

  • Load Chinese characters: Reads dict.txt and extracts all Chinese characters (Unicode range 0x4E00-0x9FFF)
  • Random character generation: Get single random Chinese characters
  • Random string generation: Generate strings of random Chinese characters with specified length
  • Character counting: Get the total number of unique Chinese characters loaded

Usage

Basic Usage

// Create a new dictionary instance
dict, err := NewChineseDict("dict.txt")
if err != nil {
    log.Fatalf("Error loading dictionary: %v", err)
}

// Get a single random Chinese character
randomChar := dict.GetRandomCharacter()
fmt.Printf("Random character: %c\n", randomChar)

// Get a random string of 5 Chinese characters
randomString := dict.GetRandomString(5)
fmt.Printf("Random string: %s\n", randomString)

// Get the total number of characters in dictionary
count := dict.GetCharacterCount()
fmt.Printf("Total characters: %d\n", count)

Demo

Run the demo to see the functionality in action:

go run . -dict

This will display:

  • Total number of Chinese characters loaded
  • 10 random single characters
  • Random strings of different lengths (3, 5, 8, 10 characters)

Integration with ASR

The dictionary is automatically integrated with the ASR (Automatic Speech Recognition) functionality. When processing speech recognition results, the system will:

  1. Try to load the dictionary from dict.txt
  2. Use dictionary characters for more realistic Chinese character replacement
  3. Fall back to random generation if dictionary loading fails

File Structure

  • dict.go - Main dictionary implementation
  • dict.txt - Source file containing Chinese characters
  • asr.go - ASR functionality with dictionary integration
  • main.go - Main application with demo functionality

Requirements

  • Go 1.16 or later (uses os.ReadFile)
  • dict.txt file in the same directory as the executable

Character Statistics

The current dict.txt contains 479,939 Chinese characters, providing a rich source for realistic random character generation.