Data Hacks

by @jehiah on 2010-10-20 19:00UTC
Filed under: All , Programming , Python , Data

Data Hacks is a new library we have developed at bit.ly which is a set of command line tools to assist in data analysis.

We love the beauty of command line tools that read/write from stdin/stdout and these are a set of utilities that do that, and help explore large data sets.

Included: a tool to calculate 95 percentile values, a histogram display, sample to a % of stdin, and a tool to pass stdin to stdout for a set time period.

For example you can now run this on the fly to get a histogram of request response time for a 30 second period. (in my case awk '{print $NF}' gets the last column in a access log which has the response time)

$ tail -f access.log | awk '{print $NF}' | run_for.py 30s | sample.py 10% | histogram.py --min=0 --max=1.0 --buckets=20

# NumSamples = 6809; Min = 0.00; Max = 0.05
# 313 values outside of min/max
# Mean = 0.014075; Variance = 0.001441; SD = 0.037954
# each * represents a count of 34
0.0000 -     0.0025 [   404]: ***********
0.0025 -     0.0050 [  2595]: ****************************************************************************
0.0050 -     0.0075 [  1099]: ********************************
0.0075 -     0.0100 [  1056]: *******************************
0.0100 -     0.0125 [   476]: **************
0.0125 -     0.0150 [   403]: ***********
0.0150 -     0.0175 [   122]: ***
0.0175 -     0.0200 [    81]: **
0.0200 -     0.0225 [    37]: *
0.0225 -     0.0250 [    32]: 
0.0250 -     0.0275 [    25]: 
0.0275 -     0.0300 [    26]: 
0.0300 -     0.0325 [     6]: 
0.0325 -     0.0350 [    29]: 
0.0350 -     0.0375 [    12]: 
0.0375 -     0.0400 [    25]: 
0.0400 -     0.0425 [    10]: 
0.0425 -     0.0450 [    28]: 
0.0450 -     0.0475 [    13]: 
0.0475 -     0.0500 [    17]: 

For more information and examples see http://github.com/bitly/data_hacks

Update 2010/10/20: I’ve also added a utility to generate ascii bar chart.

Subscribe via RSS ı Email
Jehiah Czebotar