The company I’m working for is looking to move some of its server equipment to Amazon Web Services (AWS) type infrastructure, and in doing so, is also re-looking at products used to ingest and search enterprise log data. Seeing that log file analysis has long been my favorite product category of any of the enterprise software I run (chalk that up to my days long ago as a support engineer for Webtrends), I’m of course interested in the differences between Splunk (as the preeminent do-it-yourself solution) and other newer products like Splunk’s own Splunk Storm hosted solution, and up-and-coming competition like Sumo Logic.
Following are a few points that I can think of, from my own usage of the product, that compares the three. Note that the Splunk install I have is a relatively small one – 40GB/Day data ingestion rate, so the problems I have and features I like are going to be a lot different than ones of a big site.
|Splunk (Self Hosted)||Splunk Storm||Sumo Logic|
|Auto Source Typing||Knows the source typing of your data, automatically parses it and extracts fields. Nearly every log file type except the really obscure ones (like CQ5 dual-line request logs) are automatically parsed & fields extracted by Splunk.||Same as self-hosted Splunk.||Can’t parse data by itself, makes you tell it how to parse the data before it can extract any fields. The Sumologic demo guy we had said this is coming later as a feature at some point.|
|Interactive field extraction||Easy as heck to extract fields from unknown log types using the interactive field extractor tool. Makes it dead easy to do more complicated lookups & averages on new log types.||Same as self-hosted Splunk||Couldn’t figure out how to do this with Sumo.|
|Scripted Input from Unix Boxes||This is (in my opinion) one of the biggest selling features of Splunk. Splunk’s *NIX app includes, out of the box, nifty scripted input that grabs the output of top, ps, netstat, df, etc and dumps that into a parsable, graphable index that you can use to make nifty CPU and network graphs for dashboards, search to see when a particular process was actually running on a machine, etc.||Splunk Storm presently does NOT allow you to run apps, which is far and away the biggest reason it’s still sort of a toy compared to the self-host product.||You’d have to do this yourself in Sumo Logic, which is a LOT of work.|
|App ecosystem||Self-host Splunk gives you access to all of the nifty apps folks have made for parsing F5 data, Nagios data, S3 buckets, etc, etc.||Splunk Storm doesn’t let you do apps.||They’re working on an app infrastructure, but this is nowhere compared to the 5-year head start Splunk has.|
|Graphing||Splunk has sexy graphing libraries that let you make radial gages, marker gages, area graphs, scatter graphs, all sorts of sexy ways to visualize data.||Same as self-host Splunk.||Bar graphs, line graphs, that’s about it. Pretty bare-bones, though the dashboarding is pretty easy to accomplish.|
|Integration with On-Premise Data||Single web search head can query multiple indexers, including things on F5’s, CCTV prod, etc, etc. A search head at amazon could transparently include on-site data.||You can’t really do this with Storm.||Can’t do this with Sumo.|
|Data Retention||You can retain as much as you have storage for.||You pay for data retention||You pay for data retention|
There’s more, but this is just what I could think of off the top of my head.
I’m really curious to know what folks think of Sumo Logic, especially for those who’ve used Splunk in production as well.
Ever since Flickr rolled out the feature to upload videos as well as photos, Flickr has been the first place I put up videos. Why? Most videos I take are part of the same stream as my photos – part of a set of vacation shots, or are usual daily uploads like this one – and I love being able to view them in the same context as the rest of my photos – instead of telling people to go to YouTube if they want to see video.
But, at this point, the video codec that Flickr is using is a straight-up embarrassment and due to the distractingly-bad compression artifacts, is so utterly contrary to the whole reason Flickr stated that they rolled out video to begin with.
See the above video. Now, granted, it is basically a worst-case scenario for a video compression codec, as with the leaves and trees and movement, you basically have 95% of the pixels of the frame changing from frame to frame. But still, the video quality is so bad that it’s painful to watch. It’s like it was compressed with Sorenson Spark, circa 2005, and made to be streamed to a first-generation color-screen cell phone.
I’m no stranger to the compute requirements of implementing a proper codec, the storage requirements of such, and the banks of new equipment that will probably have to be added in order to make it a reality.
But it’s the price to play in today’s market – and encoding on YouTube looks a million times better, even if it still has some distracting blurring.
Photo quality is most of the reason why I still have all of my photos here on Flickr. But with the fact that Picasa/G+ encodes all its video on Youtube, it basically starts tipping the scales away from my favorite platform.
Can someone please help?
Baron Schwartz giving an Epic Talk on Benchmarking, a photo by tadnkat on Flickr.
Baron Schwartz from Percona gave an amazing talk on Benchmarking. As someone who’s always loved reading about benchmarks, but having been pretty terrible at producing them myself, I found this talk fascinating — especially after my recent experience with attempting to run a bunch of inconclusive benchmarks on JBoss 4.2 vs JBoss 5.1 performance.
Put simply, Baron Schwartz is a benchmarking GOD. Listen to what he says. Read his blog. This guy is benchmarking sanity personified.
Bullets from his talk:
- It’s important to establish goals for a benchmark, reasons why, legend, distribution, response time, etc – not just throughput
- One needs a lot of info to think clearly about a benchmark
- Ideal benchmark report:
- Clear benchmark goals:
- Validating hardware config (disk / cpu / etc) – see if it matches expectations
- Compare two systems
- Checking for regressions
- Capacity planning (how will it perform at higher load than you have?)
- Reproduce bad behaviour to solve it
- Most systems you don’t want to push it as far as it’s max throughput, as at that point you’re beyond its threshhold of “good behaviour”.
- Stress test to find bottlenecks
- Get specs:
- Get specs for CPU, disk, memory, network, including makes/models/etc.
- SSDs are EXTREMELY tricky to benchmark
- Versions of all software
- RAID controller / filesystem
- Disk queue scheduler -
- a lot of Linux defaults have tons of desktop software shoved in there. CFQ is standard disk scheduler (desktop – perf sucks) instead of noop or others
- Generate some plots to summarize
- Better Aggregate Measurement:
- Average / Percentiles
- Observation duration
- 95th percentile = you can throw away the worst 1/20 of your day. Means you can throw away more than an hour of data per day. I.e. your system can be rock bottom performing for an hour a day. Not so good for establishing an SLA or SLO (objective).
- Scatter graphs can be much more telling than a single point – as you can see if your performance is all over the map or if it returns a stable figure. i.e. SSDs have performance all over the map, and have very different performance characteristics when empty / full or at start/end of the benchmark.
- Two metrics: Thoughput and Response time (tasks per time or time per task)
- They are not reciprocals
- Resource consumption is NOT a good measure of performance – i.e. CPU% / Load Avg / etc. These are indicators. They are not the goal.
- Be very careful with tools that report utilization. At 100% utilization many systems are not actually saturated.
- try ptdiskstats from perconia
- What is a system’s actual capacity?
- Max throughput at max achievable concurrency while being given acceptable performance (response time).
- Most benchmarks reveal little
- if 1/20 is serialized, you’ll never get more than a 20x speedup from going parallel.
- Isolating bottlenecks or iteratively optimizing them is one way – but don’t optimize things that don’t matter. Don’t try to optimize little things.
- Little’s law: concurrency = throughput * response time
- This holds regardless of queuing, arrival rate distribution, response time distribution, etc.
- Utilization law:
- Utilization = service time * throughput
- Clear benchmark goals:
Ian White from Neustar – Performance Optimization & Build Process, a photo by tadnkat on Flickr.
Ian White from Neustar gave a talk on Dev, Prod & Developer environments & how to organize around the goal of making web performance better without making the development lifecycle suck.
- Web optimization
- Organize CSS&JS
- Multiple Domains
- Gzip compression
- Resource caching
- Far-future expiry
- Image optimization (not a lot of automated tools out there)
- CSS spriting
- CI tools should take care of optimization, but if you put it on a different machine, they can’t test their optimizations
- Use mod_pagespeed
- Really good for server-generated content
- Take look at HTML being sent from the server and then does a whole ton of optimizations on that. Handles multiple domains behind load balancers.
- Works great on WordPress & other such stuff. Not so good with Client-side stuff
- Making dev & ops work well together:
- parameterize environments
- Make it so when a dev changes something, it changes all environments or can affect all environments
- For static resources:
- Try nginx or lighttpd
- Simple / fast / low overhead / gzip & caching
- Have it go in front of Apache & reverse-proxy back for some or go direct to files for others
- Far-future expires:
- consider using MD5 as the version
- Only issue is lack of human readability & tough to see if one is using the right version for that build.
- Image Opt
- JPEG Mini or Smooshit from Yahoo
- PNGcrush, others
- Look for “7 image optimiation mistakes”
- See http://developer.apple.com/library/ios/#documentation/2DDrawing/Conceptual/DrawingPrintingiOS/SupportingHiResScreens/SupportingHiResScreens.html for Generating 2X res images for apple retina displays
- Use Glue (github.com/jorgebastida/glue) for Sprite Manufacturing
John Cowie from Etsy gave insight on how they use Chef at Etsy. This is a turbo-overview of his talk and my impressions.
First, as a note, Etsy prefers running on bare metal as opposed to the cloud. There are cases where working in cloud is better, but not EVERY case as some techno-pundits seem to evangelize. Not that some managers I’ve dealt with have this idea, but just know that the ENTIRE WORLD IS NOT MOVING TO AMAZON. They’ve got around 800 servers and they’re real servers – not VMs and not AMIs.
Some rules of thumb for dealing with Chef they’ve got:
- Never test chef in production
- Keep things as simple as possible
- For metrics – they use Chef handler to send data to graphite (in github.com/etsy/chef-handlers.git
- They push their chef failures to IRC.
With respect to handling of Failures:
- Use “knife node lastrun <<hostname>> to get what happened on the last run
- gem install knife-lastrun, then install on client.rb to get this data.
- Try to keep conditions simple if possible – not huge regexes, try to keep to simplicities if possible to help your readability at 3am.
Standards in Chef:
- Foodcritic: This is a tool for enforcing rules & standards on Chef, and it sounds FULLY RAD. http://acrmp.github.com/foodcritic/
- Foodcritic integrates with Jenkins (!!)
- Supports custom rules
- Etsy standards (guidelines)
- Never have chef auto-upgrade packages
- If you want to send an action:restart — RELOAD instead of restart
- (foodcritic can enforce)
- Some more rules of thumb:
- Don’t take Opscode’s word for it – if it doesn’t work for you, change it.
- 41 people have Chef access, most have keys to push to prod
- There is an unconstrained Test env. NEVER test in prod
- They tweaked the “environments” workflow with some tooling
- SPORK! (knife-spork)
- knife-spork is a wrapper around environments
- check / bump / upload / promote
- Spork has a bunch of safety checks
- Spork now does chat notifications, git support, default environments, etc
I’ve had my fair share of issues getting Nvidia-based 3D drivers to work on Linux desktops for the last several years, so it did give me a bit of a chuckle seeing Linus Torvalds’ reaction to a question from the audience on getting Linux to work properly with an Nvidia Optimus-based laptop.
See the clip below, colour & fireworks is at about 49:59.
Yup – I just sat in a Tesla Model S. I just happened to walk Tesla’s downtown DC showroom after a nice morning studying at my church, thinking “Oh..what if they actually have a Model S in there…” — and they did.
If you read my post on Electric Car Power Sources, or what I thought of my first drive of a Chevy Volt, you’d know I LOVE electric cars. But there’s a big difference between lusting after a roadster that you’ll likely never have, and sitting in a car that comfortably seats 6 (with the RAD rear-facing seats), which can haul around my wife & kids, and actually isn’t even absurdly expensive.
The most defining feature I already knew about the inside was the massive 17″ touchscreen in the dash. I thought it was overkill when I saw photos of it, but sitting in front of it the first impression is: this is how big touch screens in cars SHOULD be. Big enough to actually see while you’re going down the street, and big enough to make BIG FAT touch buttons that you don’t end up swerving off the road whilst trying to find. It’s actually brilliant. And the fact that you can see a full-size Google Maps, plus your a/v controls, etc all in one display – it’s frankly awesome.
Next – the fact that the hood (bonnet) is just extra storage and there’s no need to actually have any space used by an engine compartment, it just looks like it would be a rad road trip machine. Provided the range is good enough. And provided I can find 240v charging stations along the way.
Count me among the impressed by the Model S.
I wrote a few days ago that with the Chrome OS 20 dev release, old ChromeBooks like the CR-48 now get the new fancydancy Aura UI. This basically handled my biggest complaint about the Chrome OS, in that it didn’t ever feel like it had a “home” – there was an insufficient origin point in the UI to fall back to or to start from. The OS became a much more engaging companion, especially combined with its already-good combination of fantastic battery life and instant-on performance.
But that led me to my next complaint, which was that I couldn’t do any real work on the machine – i.e. I had no SSH client. I had found some Java-based SSH clients before, but that is no good on Chrome OS as obviously CrOS doesn’t run Java.
The app is fast and clean, and allows one to use the normal Chrome bookmarks function to bookmark your SSH connections. SSH connections are made straight to the target machine (no proxy or server-side SSH client action here) which is perfect. This means that when cruising around to the bazillion meetings I sometimes have at my IT job, I can use the corporate wi-fi to SSH in to local computers and still do work, leveraging the portability and battery life of the ChromeBook. Nice to then not have to tote around a full-size laptop. And also, to note, as the CR-48 is basically just as portable as an iPad but has a proper keyboard makes it so that I can actually work and type, and not be hobbled like a tablet user.
Makes one finally think that the Google boffins were on to something with this Chrome OS thing. It’s absolutely not a replacement for a desktop OS at all, but for portability and instant-on use cases like meetings like this, it’s starting to be quite nice.
I just opened my Chromebook to sit on the couch and do some blogging tonight, and lo and behold – on the Dev Channel, an update was waiting to bring the machine up from ChromeOS 18 straight to ChromeOS 20 – which brought me the Aura UI!
From what I had read earlier regarding the Aura UI, Google was going to skip right past Aura on the CR-48, giving CR-48 users only critical patches & updates. From what I read, it looked like I was going to have to install Linux on the machine to keep it useful.
So, I was obviously surprised when I saw this tonight!