I think everyone my age who took a second language in elementary school remembers using Google Translate for their homework. And just about everyone has a funny story about Google Translate gone wrong. My version of the story involved the Clothing unit in grade 7 French. Our homework assignment was to write a short composition on what clothes we liked to wear. At the time, I really liked tank tops. I handed in my assignment and wrote something like, “J’aime porter des chars hauts.” Thanks, Google Translate. That makes no sense. What should’ve been “J’aime porter des debradeurs” actually translated to “I like wearing high [army] tanks.” My French teacher was not fooled.
Fortunately, Google translate has improved immensely throughout the years. Now if you type in “I like to wear tank tops,” it no longer makes the same mistake it did for me in Grade 7. Google Translate is actually a really useful tool for those learning a language. As of now, Google Translate supports over 100 languages. While Google Translate supports Pinyin (Mandarin romanization), I found that it lacks something close to my heart: Cantonese romanization, or Jyutping.
Google Translate does do a great job of differentiating between traditional and simplified Chinese characters (for those who are not familiar, traditional characters are used for Cantonese and Mandarin in Taiwan, while simplified characters are used for Mandarin in China). Type “horse” in English, and you’ll be given “馬” if you choose traditional characters, and “马” if you choose simplified. However, for both traditional and simplified characters, the romanization given with the character is “Mǎ”, and this is Pinyin. What we really should have is “馬, maa5” for Cantonese, and “马, mǎ” for Mandarin. Google translate incorrectly uses Pinyin for Cantonese. What we really want here is Jyutping.
Cantonese isn’t supported by Google Translate
The use of Pinyin for both the traditional and simplified characters simply shows that by “Chinese,” Google means “Mandarin Chinese.” So actually, Cantonese isn’t supported at all. We don’t only want to add Jyutping, but we want to add Cantonese as a supported language. Sadly, I don’t have a direct phone line to anyone at Google who could help me with this, so I came up with a work around.
The best part about this is: by harnessing two websites, we don’t need any of our own data, and we don’t need to use any machine learning. This hack is very fast, and solves our problem. No neural networks required!
Baidu and Chineseconverter to the Rescue!
Baidu, which is pretty much China’s Google, has a great Baidu translate tool that we can take advantage of. Not only does it support English to Mandarin Chinese translations, but it supports English to Cantonese! Enter “horse” and you’ll get “馬”, but with no Jyutping to help a student actually pronounce it. We need another site to do this: Chineseconverter.com. This site allows us to enter traditional characters, and output Jyutping. Finally, what we want! Now let’s write some code to do this for us.
Selenium to the Rescue!
Using Python and the Selenium package, we can automate the process of typing English text into Baidu Translate, copying the output as input into Chineseconverter.com, and finally taking the result as our final output.
We start by initializing the webdriver:
Next, we scrape! This part involved a bit of going into the source code of the two webpages and retrieving the necessary elements. I found that if I didn’t let the program “sleep” for 3 seconds after entering my English text into Baidu Translate, then Selenium wouldn’t find the output. Other than that, everything works well.
Let’s look at how the program runs.
Enter your input: I love eating noodles
Traditional characters: 我钟意食面
Cantonese Jyutping: ngo5 zung1 ji3 sik6 min6
If we had used Google Translate and traditional Chinese characters, we would’ve gotten “我喜歡吃麵條”, which is not Cantonese, since it gives us “吃, chī”. What we want is “食, sik6,” which is the Cantonese way of saying “to eat.” So this is great, it works!
Final Notes on the Run Time
Since we’re just automating the process of entering and extracting text from webpages, this takes a ridiculously long time to run (about 5 seconds). The lag gives a Cantonese student time to think about what their translation should be, but this is definitely too long to declare it a solution to the lack of Cantonese on Google Translate!
Closing Remarks and, why use Jyutping instead of just learning the characters?
Using Python and Selenium, it was quick and easy to come up with a way to translate from English to Cantonese Jyutping. I didn’t have to train any neural nets, or even get any training data- I just used what was already available. You may be asking yourself, though, why even do this? Why wouldn’t someone who’s learning to speak Cantonese also learn to read? Well, the problem with learning Cantonese, and any Chinese language, is that if you focus on learning to read, write, and speak all at once, you’ll be overwhelmed. In my opinion, the best way to learn these sorts of languages is to focus on speaking first, and reading will come later. If your goal is to communicate with people (in person!), you won’t need to be able to read or write. Even young people in Chinese speaking areas tend to prefer voice messages over texting, which adds even more to my argument that you don’t need to be able to read or write at first. Of course, it’d be nice, but you have to choose. It’s a trade off.