Yogita R. Chavan


Web data extraction is one of the very popular research activities aims at extracting useful information from web pages. Such extracted information is then stored into the database that can be used for faster access to the data. Several efforts have already been carried out and used in the past. Some of the techniques are record level while the others are page level. This paper shows the work aims at extracting useful information from web pages using the concepts of tags and values. From the source code associated with the page (that is HTML code), a DOM tree is constructed. Data regions are formed using similar nodes in the tag tree. One or more data regions formed during this step are again checked for similarity and merged if found. To avoid discarding of non matching first node that represents non auxiliary information in the data region a method of keeping a copy of the same which can later be added in the database is proposed.

