login/register

Snip!t from collection of Alan Dix

see all channels for Alan Dix

Snip
summary

Many valuable text databases on the web have non-crawlable
contents that are "hidden" behind search interfaces. H ...
engines do not index this valuable information. One wa ...
to "hidden-web" databases is through commercial Yahoo! ...
... Dat

QProber: Classifying and Searching "Hidden-Web" Text Databases
http://qprober.cs.columbia.edu/

Categories

/Channels/search

[ go to category ]

For Snip

loading snip actions ...

For Page

loading url actions ...

Full snip

Many valuable text databases on the web have non-crawlable
contents that are "hidden" behind search interfaces. Hence traditional search
engines do not index this valuable information. One way to facilitate access
to "hidden-web" databases is through commercial Yahoo!-like directories, which
organize these databases manually into categories that users can browse. Our
QProber system automates the classification of searchable text
databases (whether their contents are "hidden" or not) by adaptively
probing the databases with queries derived from document classifiers, without
retrieving any documents. A large-scale experimental evaluation over 130 real
web databases indicates that our technique produces highly accurate database
classification results using -on average- fewer than 200 queries of four words
or less to classify a database (TOIS'03
paper;
SIGMOD'01 paper). Interestingly, our technique is attractive to classify
even crawlable text databases (i.e.,
databases whose contents are not "hidden") as long as search interfaces for the databases are
available (IEEE
Data Engineering Bulletin'02 paper).