Subscribe to RSS Subscribe to Comments Blog of Roy Chan

Blog of Roy Chan

One line shell command to show the top referer of your WWW site

First show you the big monster:

egrep ‘HTTP/[0-9.]*” *[23][0-9][0-9]’ access.log | \
  egrep -iv ‘\.(js|css|jpg|jpeg|png|gif)[” ]’ | \
  sed -e ’s/^.*HTTP\/[0-9.]\+”\?[[:space:]]\+[0-9]\+[[:space:]]\+\(-\|[0-9]\+\)[[:space:]]\+”\?//i’ \
    -e ’s/”.*$//’ | \
  egrep -v ‘^-’ | egrep -v ‘blog\.xychen\.org’ | \
  sed -e ’s/^http:\/\/[^\/]*\(\.google\.\|google\.pchome\|google\.sina\).*q=\([^&]*\).*$/[Google]:\2/i’  \
    -e ’s/^http:\/\/[^\/]*search\.yahoo\..*[^a-z0-9]p=\([^&]*\).*$/[Yahoo]:\1/i’ \
    -e ’s/^http:\/\/[^\/]*\.soso\..*[^a-z0-9]w=\([^&]*\).*$/[SoSo]:\1/i’ \
    -e ’s/^http:\/\/[^\/]*\.baidu\..*[^a-z0-9]wd=\([^&]*\).*$/[Baidu]:\1/i’ \
    -e ’s/^http:\/\/[^\/]*\.hisearch\.hinet\..*[^a-z0-9]k=\([^&]*\).*$/[Hinet]:\1/i’ \
    -e ’s/=/=3D/g’ -e ’s/%\([0-9A-F][0-9A-F]\)/=\1/gi’ | recode /QP.. | \
  sort | uniq -c | sort -nr | head

Here is the output of My blog:

     15 [Google]:中文字型
     12 [Yahoo]:速成字碼表
      9 [Google]:Beryl
      9 [Google]:beryl
      8 http://planet.debian.org.hk/
      7 [Google]: scim
      6 [Google]:色弱
      6 [Google]:任性
      5 http://www2.shoutmix.com/?sidekick
      4 http://www2.cbox.ws/box/?boxid=1129960&boxtag=9722&sec=main

It show me that 15 visits came to my site while searching “中文字型” with Google. 12 visits from searching “速成字碼表” with Yahoo. You can write a program to analyze the referer entry in your WW site log to obtain the above result. But I just show you how to do that with one line shell command. Sure, it might better to rewrite it with Perl while the one line shell command involve too many utilities and become too complex. But I had already finish it, why not share it.

First, I would like to filter out those unsuccessful visits and requests that only getting the images/javascript code/stylesheets:

egrep 'HTTP/[0-9.]*” *[23][0-9][0-9]‘ access.log | \
  egrep -iv ‘\.(js|css|jpg|jpeg|png|gif)[” ]

Then I need to dig out the referer from the log. I try to filter out the characters before and after the referer:

sed -e 's/^.*HTTP\/[0-9.]\+"\?[[:space:]]\+[0-9]\+[[:space:]]\+\(-\|[0-9]\+\)[[:space:]]\+"\?//i' \
  -e 's/".*$//' access.log >referer.log

Sure I don’t want to count the referer from my own site and request without referer:

 egrep -v '^-' referer.log | egrep -v 'blog\.xychen\.org'

Up to now, you can already pipe the result to the old shell trick - “sort | uniq -c | sort -nr | head” to show the top 10. However, being a crazy guy, I would like to group the refer from search engine (This part is complex and should be better using perl to handle it):

sed -e 's/^http:\/\/[^\/]*\(\.google\.\|google\.pchome\|google\.sina\).*q=\([^&]*\).*$/[Google]:\2/i'  \
  -e 's/^http:\/\/[^\/]*search\.yahoo\..*[^a-z0-9]p=\([^&]*\).*$/[Yahoo]:\1/i' \
  -e 's/^http:\/\/[^\/]*\.soso\..*[^a-z0-9]w=\([^&]*\).*$/[SoSo]:\1/i' \
  -e 's/^http:\/\/[^\/]*\.baidu\..*[^a-z0-9]wd=\([^&]*\).*$/[Baidu]:\1/i' \
  -e 's/^http:\/\/[^\/]*\.hisearch\.hinet\..*[^a-z0-9]k=\([^&]*\).*$/[Hinet]:\1/i' referer.log

While I’m insane, I even don’t want see those “%2F%e6“. Pls give me the “real” characters.

  sed -e 's/=/=3D/g' -e 's/%\([0-9A-F][0-9A-F]\)/=\1/gi' | recode /QP.. referer.log

I convert the URI encoding to Quota-Printable and decode it with recode. OK, now, I can use the old shell magic to get the top 10. First I sort it to group the same referer.

sort referer.log

Kick out the repeated line, left one only and the number of it had repeat.

sort referer.log | uniq -c

Sort the result with no. of repeat:

sort referer.log | uniq -c | sort -nr

Show only the top 10:

sort referer.log | uniq -c | sort -nr | head

Known problem

  1. It only a quick trick. I haven’t optimize it yet. Any recommendation?
  2. Fail to show the search keyword with the correct encoding. Most search engine use UTF8 but CN engine love GB2312 and some TW engine seems still using stupid Big5. Ummm… I nearly forget the zh-autoconvert…. Oops, zh-autoconvert can’t distinguish the UTF8 with Big5 or GB2312.

Share It: [del.icio.us] [Technorati] [Google Bookmark] [Yahoo MyWeb] [Furl]


Comments

  1. 任我行
    January 12th, 2007 | 23:42

    睇到頭暈! %-/

  2. January 13th, 2007 | 2:20

    確實要花點力才可以理解到! -P

Leave a reply

  • :-)
  • :-D
  • :lol:
  • ;-)
  • :-P
  • :-(
  • :'(
  • :'-(
  • >:-(
  • :-O
  • :annoy:
  • :appeal:
  • :asleep:
  • Zzz...
  • :-Q
  • 8-)
  • B-)
  • :clap:
  • :enjoy:
  • :blush:
  • :shy:
  • :*
  • :inlove:
  • :inlove2:
  • :inlove3:
  • :love:
  • :allure:
  • 8D~~~
  • :amative:
  • :chatter:
  • :bored:
  • bored2:
  • X-O~~~
  • :-'|
  • 8-S
  • #_#
  • :dontknow:
  • :embarrased:
  • :excite:
  • :faint:
  • :fuzzy:
  • :plan:
  • :conceal:
  • :regret:
  • :punched:
  • :silent:
  • :-X
  • :tease:
  • :vent:
  • :win:
  • :work:
  • :good:
  • :shit:
  • :bomb:

Based on Fluidity© 1998-2007 Roy Hiu-yeung Chan