Free Newsletters
InfoWorld Daily

InfoWorld
Log-in | Register

Data mining outside the firewall

Mine the Web

By Maggie Biggs
September 05, 2003
 

How do your company’s pricing models compare to that of competitors? Are customers making it to your site’s deep links or leaving shortly after visiting the home page?

Free IT resource

Virtualization Insights from Top Experts - Learn how virtualization gets real!

Sponsored by Dell

Free IT resource

TechNet: More ways to know it, share it, and keep it running.

Sponsored by Microsoft

In-depth answers to these questions can be found through mining the Web — that is, discovering and analyzing Web page content, descriptions found in Web documents, overall Web structure, and Web site usage and access patterns.

Web mining is an externally focused relative of business intelligence. Retrieving data outside the firewall can be done via agent technology, by tapping into Web site logs or by adding data retrieval methods into Web site applications. IT managers can turn to their existing data-mining tools to examine structured Web data and also use text-mining tools to examine unstructured Web data.

Eyeballs count. In setting up the Web-mining process, first define the business problem and the types of information desired. For example, with competition fierce for site visitors’ time and attention, comparing link counts and page rankings of your company’s Web site to others can affect the number of page views and, ultimately, revenue. This data can be uncovered by mining search engine data either via text-mining tools or through a data-mining wrappering strategy. 

Analyze page weighting within your company’s sector to see which companies are most effectively drawing visitors and achieving high search-engine ranking. Then examine the content, site structure, and page layout of high- and low-ranking companies. Finally, consider taking a broader view, analyzing the Web as a whole and examining those sites that are the most effective in terms of traffic and page rankings.

Likewise, analyzing the structure of your Web pages can yield useful insights. Using available tools, you can analyze the number of links into and out of various content. And usually, the more links, the more useful the content.

Looking inside. Do visitors to your site hit the main page, but seldom go any deeper? Access trends can pinpoint a site structure that may need to be redesigned to increase traffic. The same tools and techniques used to mine outside the firewall can reveal how customers interact with your site. Analysis of this information might lead you to provide precise content dynamically, choose tight or loose site structure, or opt for customized services, such as online customer representatives.

Web server logs can yield some of the information needed to perform usage and access analysis of your site. But additional data gathering with third-party tools or in-house scripting programs may be needed to capture enough elements to make the analysis useful.

Inside or out? Data gathering for Web-content mining can be handled in-house, but a fair number of service providers can also tackle the task and may offer the capability of notifying you when content changes. You might consider using a service provider when large data sets are involved to reduce the overhead on your network when gathering data.

Quite a few commercial and open source tools exist to assist with Web mining efforts. For example, NetGenesis from SPSS collects and analyzes Web data and transforms it into useful metrics; and QL2 Software’s WebQL includes a development interface, querying capabilities, and a deployment engine to extract the data needed.

Web mining extends data mining beyond the corporate walls. And including the Web in your mining strategy can improve your Web presence and increase your competitive intelligence.





 


 
Maggie Biggs is a senior contributing editor for the InfoWorld Test Center.
 

TOP NEWS:


»  Four quick tips for choosing an IM security product
71 percent of businesses will invest in real-time messaging this year. If you're one of them, be sure to protect your enterprise

»  Forrester analysts ID hot IT jobs
Research group finds 16 IT roles with a promising future

»  Nvidia claims 10 hours of HD video on Tegra chip
The Tegra 600 and 650 can be used with hard disk drives and are designed partly for mobile Internet devices

»  Database vendors add Google's MapReduce
Greenplum and Aster Data Systems will support Google's programming technique, developed for parallel processing of large data sets across commodity hardware

»  Network management: Tips for managing costs
New technologies, changing requirements, and ongoing equipment maintenance and upgrades cost money, but there are ways to manage expenses

»  EMC targets SMBs, branch offices with new low-end storage
Celerra NX4 highlights include thin provisioning, snapshot technology for data recovery and backups, and Web-based console for management of storage volumes




COMPREHENSIVE DATA PROTECTION AND DISASTER RECOVERY
Traditional backup and recovery is becoming irrelevant. You need more. Watch this InfoWorld and Dell Equallogic webcast to learn the current trends in Comprehensive Data Protection and Disaster Recovery for VMware Virtual Infrastructure. Sponsored by Dell Equallogic:

»  Click here to view this Webcast
  Network Security Solutions Guide
Network security is comprised of so much more than protecting just one or two PCs. And network security management can be different based on your situation. Read this Solutions Guide to find the best ways to protect your entire network, from individual PCs to network-attached storage and more. Sponsored by ISC2

»  Click here to download now

- Special Advertising Partners -
WHITE PAPERS
 

» Technology White Papers Library

Technology White Papers by Topic

Technology White Papers E-mail Alert

Find out when the latest white paper is available:
 
 
INFOWORLD MARKETPLACE
 
» BUY A LINK NOW
 
SEE ALSO
• Nine questions for evaluating data-mining tools


FIND PRODUCTS AND COMPANIES
» COMPLETE PRODUCT GUIDE



TECHNOLOGY INDEX
• Applications
• Application Development
• Security
• Networking
• Wireless
• Platforms
• Hardware
• Data Management
• Storage
• Web Services
• Business
• Telecom
• Professional Services
• Standards

TECH WATCH 


What's the 411 on GOOG-411?
Just as Google has become synonymous with "performing a Web search," 411 is understood to mean "information" -- as in "what's the 411?" I was thus surprised to discover, from a billboard, no less, that the king of search is taking on the ...

Apple HTML source reveals 'iPhone Extreme'
"This one's a stretch..." reports AppleInsider. Um, yeah. Reporting on HTML code sightings of product names could be called a stretch, but iPhone Extreme has a ring to it. Now, that sounds like the product Apple should have released first, rather ...

COLUMNISTS

Unified under law
Ephraim Schwartz's Column and Blog (InfoWorld) - In the litigious world we live in, deploying a unified communications platform in your enterprise could...
» MORE COLUMNISTS

MORE INFOWORLD BLOGS


Open Sources 
Product Management
When I joined MySQL four years ago, there was quite a lot of debate about product management. We didn't actually have ...

Zero Day 
Botnet herders tending smaller flocks
New research backs up the theory that botnet operators are keeping their networks smaller in a continued effort to keep ...



• Advice Line
• Database Underground
• The Deep End
• Enterprise Mac
• Geeks in Paradise
• Grid Meter
• The Gripe Line
• InfoWorld Daily
• Inside IT
• IT Troubleshooter
• ITXtreme
• Open Sources
• ProdBlog
• Real World SOA
• Reality Check
• Security Adviser
• SMB IT
• The Storage Network
• Tech Watch
• Virtualization Report
• Zero Day

ADVERTISEMENT


RESOURCE CENTERadvertisement 

GOVERNMENT IT & POLICY
'If you don't go after the network, you're never going to stop these guys. Never.'
From the State Department, All the News for Inquiring Minds
TechPresident, the Internet Citizenry's New Consensus Taker



Sponsored Technology Links

 
 
 HOME  NEWS  BLOGS  PODCASTS  VIDEOS  TECHNOLOGIES  TEST CENTER  EVENTS   About | Advertise | Awards | RSS | Contact Us 

Copyright © 2008, Reprints, Permissions, Licensing, IDG Network, Privacy Policy, Terms of Service.
All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses,
phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

CIO :: ComputerWorld :: CSO :: Demo :: GamePro :: Games.net :: IDG Connect :: IDG World Expo
Industry Standard :: IT World :: JavaWorld :: LinuxWorld :: MacUser :: Macworld :: Network World :: PC World :: Playlist
TecChannel :: TecCommunity