Updates from 22/09/14 to 05/10/14

Following up with the Python program activity plan:

  1. Crawled URL should not be crawled again
  2. Normalisation, converting Mr. to Mister and 30 to thirty.

I have achieved the above two activity set out. However, the normalisation is still incomplete in that not all texts has been normalised. More in depth normalisation should be seen in the next update.

Work completed from 22/09/14 to 05/10/14

The code has been cleaned, split into their individual functions that they perform for a more focused configuration. For instance, an XML Handler package is created where XML related codes are placed. This made the codes more readable and modular.

Instead of using nested FOR loops for crawling through the dates, I have opted for Python’s built in datetime module that handles the datetime crawling correctly with just 1 while loop.

Crawled articles are also no longer crawled again. The program will be able to skip crawled articles and pick up from when it is last crawled to.

A 5 step process of normalisation with PERL scripts (Provided by Chong Tze Yuang) has been integrated into the Python program as well.

The 5 step process is:

  1. Clean unwanted characters
  2. Expand abbreviations
  3. Split Sentences
  4. Remove puntuations
  5. Append <s></s> sentence tags

Step one has been modified to also create the folder and nomalised text file in each month of the raw data folder.

A new News website from Indonesia is also being crawled currently. The website is http://sp.beritasatu.com/home/

Activity plan

The next update should see:

  1. A more in-depth normalisation that includes numerals to actual words
  2. To begin researching on the accuracy of crawled data and building of Language Models.

Approaching Milestones

Normalise after each month of crawling is done.

Issues

There are no issues to report.

Updates from 07/09/14 to 21/09/14

Coding language was updated to Python with existing program to work on. Source codes available at https://bitbucket.org/ntudatamining/ntu-webcrawler

The new program will allow crawled data to be written in the form of XML files.

The core of the crawler still works the same way as C#. Essentially it is preferable to locate a source news site where there are archives available with URL parameters. For example, http://www.utusan.com.my/utusan/search.asp?paper=um&dd=01&mm=01&yy=1998&query=Search&stype=dt where we see dd, mm and yy as the URL parameters which we can specify within our Python program.

XPaths such as doc.xpath(‘//div[@class=”search”]//ul//a’)  needs to be specified as with previous program depending on the website we want to crawl.

Work completed from 07/09/14 to 21/09/14

Familiarising with Python language and the use of certain libraries was  required during the initial few days. The following libraries were installed:

  1. PyDev for Eclipse
  2. Urllib3
  3. lxml
  4. requests

As the program was written in Python 2.x, I have decided to upgrade it to Python 3.4 to make use of the new features in Python. Adjustments to the code was required when upgrading from Python 2.x to Python 3.4.

Other major update to the program includes:

  1.  Using xml.etree.ElementTree to modify XML files instead of having to open the files
    • By doing so we can accurately and easily control where we want to add certain tags or content. (e.g. having all the articles of the same category within the <category> tag of the same kind)
  2. Again using xml.etree.ElementTree to open and assign variables from XML tags instead of using readlines
  3. Allows the ability to not specify end date, which results in defaulting to today’s date
  4. Added a function to indent XML files to make it more readable

Below is a comparison between when an unformatted XML file and a formatted one:

unformatted

Unformatted

formatted

Formatted

Activity plan

The next update should see:

  1. Crawled URL should not be crawled again
  2. Normalisation, converting Mr. to Mister and 30 to thirty.

 

Approaching Milestones

Will be trying out other news website.

 

Issues

There were issues with using urllib3 and installation of lxml libraries. Some of the codes were corrected to align with urllib3 standards. Installation instructions (inclusive of lxml) has been created to ensure that program can be installed correctly in future.

Updates 25/08/14 to 06/09/14

Work completed from 25/08/14 to 06/09/14

Redesigned codes such that it is more streamlined with ability to write into SQL databaes.

The following is an overview of the project with possible adjustments as discussed in future.Mining Flow

 

The following was achieved during the period between 25th August to 6th September 2014

  1. Redesigned code to simplify crawler.
    • The code was redesigned such that there is better compatibility with the MySQL database and to streamline the crawling process
  2. Clean the articles by removing unwanted HTML tags
    • Standard Regular expressions to match HTML tags was used to remove the unwanted text such that the extracted text is left with only the raw data of the article.
  3. Store in database
    • Instead of storing into text files, the raw data is now being stored into a MySQL database.
    • The following 2 tables has been created to allow for the storage of the raw data:

    database

    • Do note that there is another table named “visited_url”. This table is to be used for checking if a url has already been crawled so that the crawler may skip it in future.

The following 3 classes has been created so far, these 3 classes has allowed the above mentioned functionality :
classdiagram060914

The table below describes their use.

Class Methods Description
MainInterface.cs
  • Main
This is the Main Interface of the program, it will allow user simple commands through the Command Prompt, configure the crawling process such as start date of the crawling process or directories to which results of processing are to be written to.
Functions.cs
  • WebQueryCrawl
The crawling functions are included in the Functions.cs files. Functions such as the actual crawling of a website are included here.
MySQL.cs
  • initConn
  • insertRawData
  • insertUrl
Database connection and queries are located here.

Activity plan

The next update should see:

  1. A user interface with options to select actions to perform
  2. Crawled URL should not be crawled again
  3. Configuration items such as Crawling start date and directories
  4. Normalisation, converting Mr. to Mister and 30 to thirty.

 

Approaching Milestones

  1. There is plan to determine the latest crawled data and inform user when an earlier date is set to be crawled. An error should occur, however an override option be available such that crawled data can be re-crawled, however the old data will be replaced with the newly crawled ones even if they are similar.
  2. Crawled URL is already stored into a database and work is being done to make use of it to prevent duplicated data.

 

Issues

There are no issues to report.