April 09, 2007

Google admits word database came from third party

Admission comes as Google faces a deadline to stop allegedly infringing on Sohu.com 's copyrights

Faced with mounting questions over similarities with a rival's software, Google on Sunday acknowledged that a dictionary of Chinese words used with one of its recently released software tools came from a third party. The statement came as Google faces a looming deadline to stop downloads of the software and issue an apology.

Google's Pinyin Input Method Editor (IME) "was built leveraging some non-Google database resources," Google China spokeswoman Cui Jin wrote in an e-mail response to questions. The IME allows users to enter Chinese characters by typing their Pinyin romanization equivalents.

"We are willing to face this issue of ours," Cui wrote. She did not describe the database or where it came from.

The admission comes as Google faces a deadline from Sohu.com to stop allegedly infringing on its copyrights. On Friday, the Chinese Internet company gave Google until Monday to stop downloads of its IME software and issue an apology. Sohu also wants compensation from Google. At the time of writing, Google's software remains available online.

Cui did not respond to questions concerning Sohu's letter.

Google's Pinyin IME bears an uncanny resemblance to Sohu's Sogou Pinyin IME, which draws on a database of popular search queries from Sohu's Sogou search engine to suggest characters that match the Pinyin entered by a user.

The dictionaries used with both software from Google and Sohu shared several common mistakes, where Chinese characters were matched with the wrong Pinyin equivalents. In addition, both dictionaries listed the names of engineers who had developed Sohu's Sogou Pinyin IME.

These names were added to the Sohu dictionary solely for the convenience of the engineers and would not otherwise need to appear in the dictionary, said Wang Xiaochuan, Sohu's vice president of technology and head of the company's research and development center, in an interview over the weekend.

A review of the first version conducted by Sohu's engineers revealed a dictionary of around 330,000 words and their Pinyin equivalents, including more than 300,000 entries that are identical with Sohu's dictionary, Wang said.

"We have never made this dictionary public or licensed anybody to use it," he said.

Google was slow to respond to questions over its dictionary late last week, even as it made changes to remove similarities with Sohu's Pinyin IME.

On Friday, Google released an updated version of its Pinyin IME that removed the names of the Sohu engineers from its dictionary. That update removed 600 words from the dictionary, while adding just one, Sohu's Wang said. That update did not remove Pinyin errors, such as one mistake that required users to type the incorrect Pinyin -- pinggong -- to get the characters for the name of Feng Gong, an actor and comedian.

That error has been changed in the latest version of Google's Pinyin IME released on Sunday. "The new dictionary is now based on tens of thousands of entries Google's enormous search database has accumulated over the years," Cui wrote.

That claim was confirmed Monday by Sohu, which said the similarity between Google's dictionary and its own dictionary had fallen from 96 percent to 79 percent with the latest version of the software.

 

Close

On Twitter now

Data management

Powered by Twitter

On Twitter now

White Paper

D2D Virtual Tape Library Replication Primer

This whitepaper explains the terminology and concepts behind Data Replication technologies and establishes some sizing rules through worked examples. Learn the new paradigm in disaster tolerance—protect data anywhere.

Download now »

White Paper

An Alternative to Virtualization for Datacenter Cost Savings

Server virtualization is a popular option for dealing with mounting datacenter costs. Another equally promising approach is the use of an Application Delivery Controller. Citrix NetScaler provides a low-cost way for organizations to reduce their server count and accrue cost savings from a reduction in space, cooling, power and personnel.

Download now »

White Paper

Why Your Firewall, VPN, and IEEE 802.11i Aren't Enough to Protect Your Network

The emergence of WLANs has created a new breed of security threats to enterprise networks.

Included in HP ProCurve WLAN solutions is security technology that alleviates threats from WLANs through:
* Monitoring wireless activity inside and out of the enterprise
* Classifying WLAN transmissions into harmful and harmless
* Preventing transmissions that pose a security threat to the enterprise network
* Locating participating devices for physical remediation

Download now »

White Paper

Bringing the Edge to the Data Center

Effectively address data protection challenges, implementing solutions that help store and protect business–critical data while cutting costs and improving efficiency and reliability.

Download now »

Sign up to receive Data Management Resource Alerts

Subscribe to the Today's Headlines: First Look Newsletter

Find out what will be news for the day, with our first-thing-in-the-morning briefing.

©1994-2009 Infoworld, Inc.