sweemeng's tech adventure: August 2011

Wednesday, August 31, 2011

Many Ways To Grep File Content

So not too long ago I have posted on twitter

This spin to a few other way to do grep.

A few have suggested on IRC and facebook, the i parameter is to make keyword not case sensitive.

grep -iR keyword directory

Another suggestion on IRC.

grep -iR --exclude=file-to-ignore keyword directory

Another tweet i have receive is,

Then the last one I discovered on google is ack-grep

ack-grep keyword directory

and again, -i make case insensitive search.

ack-grep -i keyword directory

ack-grep output is nicer, and automatically ignore binary. It is slightly different than grep. But both get the job done., to me anyway

Monday, August 29, 2011

Python Dateutil Redux

Not too long ago, I covered one use of python dateutil, on the blog here.

The library itself is pretty nifty in other case as well. In this case date difference. While python datetime module in the standard library, the datetime.timedelta is used to find difference in date, it counts up to the days. In my case, I want to count it to years.

That is where dateutil comes it. It have a module called, relativedelta. Which do actually count to years. To use it is a matter of import and use it

from dateutil.relativedelta import relativedelta
date_diff = relativedelta(date_from,date_to)
print date_diff

It as you can see does count up to years, also months. Which is useful if you wanted to find difference in date beyond just days.

Python Web Scraping

There is time where there is information in govt website of is very useful, but unfortunately the data is in form of website, it could be worst as it can be in PDF. So it can be a pain if we wanted to use information for programming, but there is no API.

On the other python is a pretty powerful language. It comes with many library, include those that can be use to do HTTP request. Introducing urllib2, it is part of standard library. To use it to download data from a website can be done in 3 line of code

import urllib2
page = urllib2.urlopen("url")

The problem, then is you get a whole set of HTML, which a bit hard to process. Then python have a few third party library, the one I use is Beautiful Soup. Beautiful Soup is nice that it is very forgiving in processing bad markup in HTML. So you don't need to worry about bad format and focus to get things done. The library itself can also parse XML, among other thing.

To use Beautiful Soup,

from BeautifulSoup
import BeautifulSoup
page = "html goes here"soup = BeautifulSoup(page)
value = soup.findAll('div')
print value[0].text

But you need to get the html first don't you?

import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("url")
soup = BeautifulSoup(page)
value = soup.findAll('div')
print value[0].text

To use it, just download the data using urllib2 and pass to to beautiful soup. To use it is pretty easy, to me anyway. Though, urllib2 is going to be re organized in python 3. So code need some modification.

To see how the scraper fare, here is a real world example, in github part of a bigger project. But hey it is open source. Just fork and use it, in the this link.

So enjoy go forth and extract some data, and promise to be nice, don't hammer their server.