Following up with the Python program activity plan:
- Crawled URL should not be crawled again
- Normalisation, converting Mr. to Mister and 30 to thirty.
I have achieved the above two activity set out. However, the normalisation is still incomplete in that not all texts has been normalised. More in depth normalisation should be seen in the next update.
Work completed from 22/09/14 to 05/10/14
The code has been cleaned, split into their individual functions that they perform for a more focused configuration. For instance, an XML Handler package is created where XML related codes are placed. This made the codes more readable and modular.
Instead of using nested FOR loops for crawling through the dates, I have opted for Python’s built in datetime module that handles the datetime crawling correctly with just 1 while loop.
Crawled articles are also no longer crawled again. The program will be able to skip crawled articles and pick up from when it is last crawled to.
A 5 step process of normalisation with PERL scripts (Provided by Chong Tze Yuang) has been integrated into the Python program as well.
The 5 step process is:
- Clean unwanted characters
- Expand abbreviations
- Split Sentences
- Remove puntuations
- Append <s></s> sentence tags
Step one has been modified to also create the folder and nomalised text file in each month of the raw data folder.
A new News website from Indonesia is also being crawled currently. The website is http://sp.beritasatu.com/home/
Activity plan
The next update should see:
- A more in-depth normalisation that includes numerals to actual words
- To begin researching on the accuracy of crawled data and building of Language Models.
Approaching Milestones
Normalise after each month of crawling is done.
Issues
There are no issues to report.