mitm/DICT_README.md

72 lines
2.2 KiB
Markdown

# Chinese Dictionary Implementation
This implementation provides a `ChineseDict` struct that loads Chinese characters from `dict.txt` and provides functionality to generate random Chinese characters.
## Features
- **Load Chinese characters**: Reads `dict.txt` and extracts all Chinese characters (Unicode range 0x4E00-0x9FFF)
- **Random character generation**: Get single random Chinese characters
- **Random string generation**: Generate strings of random Chinese characters with specified length
- **Character counting**: Get the total number of unique Chinese characters loaded
## Usage
### Basic Usage
```go
// Create a new dictionary instance
dict, err := NewChineseDict("dict.txt")
if err != nil {
log.Fatalf("Error loading dictionary: %v", err)
}
// Get a single random Chinese character
randomChar := dict.GetRandomCharacter()
fmt.Printf("Random character: %c\n", randomChar)
// Get a random string of 5 Chinese characters
randomString := dict.GetRandomString(5)
fmt.Printf("Random string: %s\n", randomString)
// Get the total number of characters in dictionary
count := dict.GetCharacterCount()
fmt.Printf("Total characters: %d\n", count)
```
### Demo
Run the demo to see the functionality in action:
```bash
go run . -dict
```
This will display:
- Total number of Chinese characters loaded
- 10 random single characters
- Random strings of different lengths (3, 5, 8, 10 characters)
## Integration with ASR
The dictionary is automatically integrated with the ASR (Automatic Speech Recognition) functionality. When processing speech recognition results, the system will:
1. Try to load the dictionary from `dict.txt`
2. Use dictionary characters for more realistic Chinese character replacement
3. Fall back to random generation if dictionary loading fails
## File Structure
- `dict.go` - Main dictionary implementation
- `dict.txt` - Source file containing Chinese characters
- `asr.go` - ASR functionality with dictionary integration
- `main.go` - Main application with demo functionality
## Requirements
- Go 1.16 or later (uses `os.ReadFile`)
- `dict.txt` file in the same directory as the executable
## Character Statistics
The current `dict.txt` contains **479,939** Chinese characters, providing a rich source for realistic random character generation.