Japanese Chinese tea web sites

In researching information for Babelcarp's database, I often run Web searches using Chinese characters. Typically you find vastly more hits (mainly mainland Chinese sites) this way than if you use the Pinyin name for a tea.

I've noticed often that a lot of hits will come from Japanese web sites. This isn't too surprising when you think about it: Japanese is written using (among other things) Chinese characters; why shouldn't Japanese people be interested in Chinese tea; and for those Japanese people who are interested in Chinese tea, why shouldn't they use Chinese characters to refer to them?[1]

One thing, though, puzzles me about these Japanese sites for Chinese teas: some of the teas they list can only be found on Japanese sites. If a tea really is Chinese, why wouldn't it be retrievable on some Chinese site? Here's an example. (This won't work, of course, if your Web browser has no access to Chinese characters.) On the site

formatting link

scroll down to the Jiangxi teas, where you'll find a tea whose Pinyin name (in the right-hand column) is zhou da tie cha. Search for it using the Chinese characters in the left-hand column. The results will be exclusively Japanese sites.

Anyone know what's going on here? Kuri?

/Lew

Reply to
Lewis Perin
Loading thread data ...

The charset=shift_jis of the webpage indicates Japanese. All 2 character pairs are used for Japanese font sets. The characters you see are from the Japanese fonts and not Chinese. That character may very well exist in the Chinese font set and vice versa but the charset setting on the HTML page tells where to look. Basically non Roman languages take two characters for representation and a corresponding font set. For example the Cha character in Japanese JIS is 3567 and simplified Chinese GB 1872. The Glyph representation from both will look the same and the same argument for "zhou da tie cha" in Japanese JIS and Chinese GB where the Glyphs look the same but not the pairs. Google will find computer strings anywhere which in your case just happens to be on web pages with charset indicating JIS. It looks like to me you did a post with Linux which comes with default international language support. In Windows you optionally load the Unicode font set called CJK for Chinese, Japanese, Korean which is the international standard to replace national language sets like JIS and GB.

Jim

Lewis Perin wrote:

Reply to
Space Cowboy

Yes, but it's still the same Unicode code point (33590, or 8336 in hex), which is why you get both .cn and .jp web sites if you Google for it.

But Google, smart though it is, can't see the glyph; it can only see the codepoint in whatever encoding is there. I've run these through the Unihan database, and they're the Chinese codepoints that correspond to the Pinyin on the same line of the page.

BSD, actually, but I didn't post anything that wasn't ASCII.

Right, I use that a lot.

Thanks, Jim, for trying, but I don't see how this explains the phenomenon.

/Lew

Reply to
Lewis Perin

Reply to
kuri

They have no other characters anyway. That'd be too bad to get rid of the meaning and keep only a phonetic reading.

Isn't that the same for the different Chinese dialects ?

On the first column, they give the kanji name (for Japan) of that tea. In the second, they give the Japanese reading that are supposed to use (real tea fans tend to know the pin yin actually used in China better than the Japanese reading). There is the possibility that some of the characters used in the first column are only for the Japanese naming of that tea. One possibility is that the Chinese original uses a more simplified or more complicated, and that character is not of the list of kanji (characters used in Japan), so they replace. Another is that they translated the Chinese meaning (here that would be the *rolled* thing) into Japanese, with different characters.

I have seen that many times. They obviously change certain names. And they don't tell which...at the end the Japanese themselves believe that was the original Chinese. I suspect the pin yin of that list has been added later, using the automatic character change of the computer.

Also, Chinese sites about tea tend to be more basic, give very little information. Most of them are only made to sell tea. Probably fewer idle amateurs have access to internet, compared with Japan. In Japanese, there are many more pages that aim at sharing some knowledge, then the on-line shop copy from them.... And you know what it's like on internet. The first guy may have mispelled the name of a tea, 1000 others copy the mistake and a new tea is invented. In this case, nobody says he/she has had the tea you've picked. All these Japanese sites are not selling that tea. They are listing all the green teas they have ever heard about.

Kuri

Reply to
kuri

These lists of teas are given in Kanji (Japanese character set); but Kanji uses a lot of simplified characters, which is the same simplied form of Chinese used in the PRC. It is perfectly fine Chinese, and perfectly understandable. No switching of Chinese characters to muddle the meaning. But the encoding is: charset=shift_jis. So it's in Japanese encoding. Still, if you input these characters using simplified Chinese IME, you can find them.

I just did a search in Chinese for that Zhou Da Tie Cha, and I found

199 matches. All are Chinese sources. Here is one example:

formatting link

The pinyin transliteration for these teas is mostly correct. But some of the teas do have mistakes in Pinyin.

That's not really true. There are many many Chinese websites devoted to tea, to information about tea. Not all are tea vendor websites. In fact, most of the Chinese tea websites I visit don't sell tea at all. A lot of Chinese tea websites have very detailed information - and many many subjects. However, a lot of other tea websites just have garbage information too. And many tea websites do copy the texts from the other websites. So what you get is many sites that contain the exact same information.

I don't really know why you are getting only Japanese websites. Maybe your CJK IME is not set to Chinese.

Reply to
niisonge

The *tie* (tetsu) used in Japanese is not the same *tie* used in the link you give. It's a different simplification. That could be enough to restrict the search to Japanese language pages.

I've pasted the Chinese writing from this page and I got zero hit with google. I have not restricted the search to any language (in theory, my preferences on the browser are French +English+Japanese+Chinese+Chinese). I need to do a search restricted to simplified Chinese to get something (I got 1000 matches).

It seems I don't get all the hits. Whatever I search about tea, I get something 10 000 hits in Japanese and only 1000 in Chinese. Maybe my browser is not impartial. Well surely it isn't...

We didn't use IME in this story. Just copy and paste.

Kuri

Reply to
kuri

In my browser, both characters show as exactly the same character. Must be just a configuration problem on your computer. There is a traditional Chinese form of the "tie" character. But Japanese don't use that one.

Google sucks. Forget about google. Why not use a Chinese search engine? Try Baidu:

formatting link

Try downloading NJ Star Communicator:

formatting link

You can then input Chinese into your browser. Copy paste just doesn't work very well, in my experience.

Reply to
niisonge

By the way, if you use Baidu, you should get about 525 search results.

Reply to
niisonge

There would be a problem if I saw both the same. That I confuse Chinese and Japanese is one thing, but I hope that my computer is more clever than I.

I don't know what *you* see. (my post is in Unicode UTF 8) I get :

周打鉄茶  in Japanese 周打铁茶 in Chinese (simplified)

Baidu also finds pages in Japanese if you enter the first line.

Because of my level in Chinese... Maybe next year.

Most times, I can't, but that's not a question of IME. I don't know when the Chinese character should be different from the Japanese one + I don't know the pin yin. Well, there are dictionnaries for that. I should get one. Also, I have a textbook that lists all the different characters between Japanese and Chinese, I should study...

Kuri

Reply to
kuri

Only if the Chinese or Japanese websites uses Unicode codepoints such as 8336. There are plenty of Chinese and Japanese sites that use charset=UTF-8. I'm not sure of the particulars but you can also mix language sets on a webpage. I use Unicode strings for Google searches. I could get additional hits if I used JIS or GB strings but I only track Unicode. On TaoBao I have to use GB strings. Ebay China uses Unicode. Babelfish doesn't accept Unicode strings.

The codepoints are Japanese and not Unihan which only accepts Unicode codepoints. You didn't run any Japanese codepoints from "zhou da tie cha" and get a valid hit on Unihan. At the minimum you would need Japanese JIS to Unicode codepoints. If anyone knows of a routine or website to do this let me know. You also don't plug in strings to Unihan just the 4 bit hex characters (0-9A-F) which represent each pair of ascii characters for a total of 16 bits.

I don't have JIS or GB or BIG5 loaded on this computer. The webpage you mentioned looks like gibberish. I also don't have Unicode loaded on this computer. Fortunately I can tell Unicode characters because MSIE indicates a "empty square". If I want to see the Glyph I insert the Unicode string into a routine which gives the character pair codepoints which I then use in Unihan. This is the main reason I use Unicode. I previously posted a Zhongwen backdoor procedure using Unicode codepoints. I don't know of any Japanese or Chinese sites that let me do the same thing with their corresponding character pairs to see a Glyph representation.

It's simple. The codepoints from any charset are different. I think you understand the character pairs that make up each non Roman language or the Unicode standard for all languages. Maybe there are some overlapping codepoints between JIS or GB or BIG5 meaning the same Glyph character but I haven't found that true for Unicode at least for tea terms. When you use cut and paste in Windows you keep intact the ascii character pairs for whatever language.

Jim

Reply to
Space Cowboy

That's a good point. I suppose it applies for teas that aren't well known enough to have recognized names in one's own dialect.

/Lew

Reply to
Lewis Perin

Thanks for finding this. That site's Tie character is different from the Japanese site's character, despite their being rendered with the same glyph. The Chinese site's character's Unicode codepoint is

38081, while the Japanese site's is 37444. When I search using the four characters I get 776 sites from Google.

/Lew

Reply to
Lewis Perin

Thanks for the pointer. From where I browse, though, Google wins: 776 hits versus 497.

Are you sure you don't mean what that site calls Asian Explorer? The other products on the site all seem to have only *trial* versions available for free.

/Lew

Reply to
Lewis Perin

But UTF-8 *is* Unicode. More pedantically, it's an encoding of Unicode. The codepoints exist at the abstract level of Unicode; the encodings, like UTF-8, mediate between that level and what you see in your browser. See

formatting link

for an explanation.

JIS, GB, and Big5 are all parts of Unicode.

Do you mean Babelfish or Babelcar? If it's the latter, and you want to try the alpha version that searches on Chinese characters, email me.

/Lew

Reply to
Lewis Perin

You can download Asian Explorer if you want, but basically, it's just a cheaper version of Internet Explorer, just enhanced for Asian character sets. The copy/paste function is the most useful part of it.

But what I'm referring to is NJ Star Communicator - it's a CJK IME. The trial version is fully functional. It's supposedly only a 30 day trial. But it's still fully functional way beyond the trial date.

If you use NJ Star Communicator, it will automatically display the characters on the web page in whatever character format you set the software to load - GB, Big5, EUC, etc. So for me, traditional Chinese web pages (doesn't matter if encoded in Big5 or unicode or utf-8) all get loaded into simplified Chinese. If I want to change to traditional Chinese, then I change language settings.

And inputting characters into a web search using say, GB will also yield results in Big5, EUC, Chinese UTF simplified, Chinese UTF traditional, Japanese Shift-JIS, Japanese UTF8, etc. So all of this encoding stuff is really a moot point if you use a CJK IME.

If you know the Pinyin, you can find the character easily. If you are unsure of the Pinyin, you can also switch to English to Chinese input.

Reply to
niisonge

don't know what *you* see. (my post is in Unicode UTF 8) I get :

周打鉄茶  in Japanese 周打铁茶 in Chinese (simplified)

I see what you mean there. Changing my settings to Japanese UTF 8 shows

2 different characters. The Japanese "tie" is the simplified character. And the Chinese "tie" is the traditional character. Switching to Chinese UTF Traditional also shows the last character as Chinese traditional. Switching to Chinese UTF Simplified shows both "tie" characters as Chinese simplified text.

I think if you want to search Chinese PRC websites, you better switch to Chinese Simplified. I learned that years ago. So I never have any problems. Of course, it meant I had to learn Chinese simplified characters along the way.

Reply to
niisonge

Agreed UTF-8 is Unicode. Anytime you use the codepoint 8336 it means tea only if you find websites with charset=UTF-8. As I said before the tea codepoint for GB2312 is 1872, BIG5 AFF9, JIS 3567. So if the webpage said charset=JIS you would use 3567 to find the glyph meaning tea which is the reason you would only see Japanese websites. You won't see the Japanese websites using charset=UTF-8 or if you did there is a Unicode glyph for 3567 but not for tea. If I come across a webpage that says charset=UTF-8 and want to see the glyphs in my browser I load the MS Unicode CJK codeset. GB2312, BIG5, JIS have their codepoints and glyphs. Any specific codepoint only has meaning if you know what charset it uses to look up the glyph. As an aside I've been checking the Chinese webpages mentioned in this thread by you and the html says charset=GB2312. You indicate you derived a Unicode codepoint which I assume came from the webpage contents. I don't see how. That is only valid if charset=UTF-8.

In what sense? They use different codepoints for language glyphs. You couldn't tell what codepoint produced the tea glyph if it exist in any of the language packs. Every scriptable language on Earth is part of Unicode or that is the intent. There are language sets that only exist in Unicode because the computer linguists know of some some isolated language group that hasn't seen a computer but they could communicate with each other in Unicode when the Internet arrives.

It's AltaVista Babelfish. I would expect at the minimum to use Unicode strings to search your site. I'm not talking about the derived hex codepoints. As I said before there is a mapping of the normaly used codepoints used in the CJK language packs to Unicode. If you could find the routine, if it exists, then internally you store Unicode while accepting any external language pack characters in CJK or the default Unicode. It would be just as easy to display back in the language packs codepoints.

Jim

PS: One doesn't care about different codepoints in language packs if you see the expected glyph. It is important because some Japanese website might be talking about Chinese teas using charset=JIS codepoints.

Reply to
Space Cowboy

I can't believe NJSC will never expire. I have an old computer I load trial dated software. I just reset the Date if I want to use the software. Some of the products know about this so put in a semaphore entry in the Registry. You simply keep track of before and after software changes to the Registry on the date of load versus the date of expiration. Or there is some mysterious hidden file entry you need to find. I'd love to find any routine than allows me to go from CJK languages packs to Unicode.

Jim

niis> You can download Asian Explorer if you want, but basically, it's just a

Reply to
Space Cowboy

Na, everyone and their grandmother spends time on the net in China now. Some people are failing out of school because of QQ, a Chinese chat program (they stole the code from ICQ). Anyway, it's the Chinese business style to give as little information about their products as possible to confuse the consumer. You can not imagine how many "ten fu" tea shop copies there are around here...

Reply to
Mydnight

DrinksForum website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.