Andre Arko blog post about dealing with logs for very busy web application behind RubyGems.org. A single day of request logs was usually around 500 gigabytes on disk. They tried few hosted logging products, but at their volume they can typically only offer a retention measured in hours. The only thing they could think of to do with the full log firehose was to run it through gzip -9
and then drop it in AWS S3.
Every day, they generated about 500 files that are 85MB on disk (compressed), and contain about a million streaming JSON objects that take up 1GB when uncompressed.
They tried to retrieve advanced statistics using AWS Glue, Python for Spark, running directly against the S3 bucket of logs. This proved to be expensive solution. At about $1,000 a month.
Then they tried Rust. It turns out serde, the Rust JSON library, is super fast. It tries very hard to not allocate, and it can deserialize the (uncompressed) 1GB of JSON into Rust structs in 2 seconds flat. And finally they stumbled upon Rust parallel iteration library, Rayon.
rust-aws-lambda, is a crate that lets your Rust program run on AWS Lambda by pretending to be a Go binary
Read, how they come up with inventive solution and used rust-aws-lambda, a crate that lets your Rust program run on AWS Lambda by pretending to be a Go binary. As a nice bonus for their use case, it’s only a few clicks to have AWS run a Lambda as a callback every time a new file is added to an S3 bucket.
And the cost? 500 log files, parsing 500GB of logs per day using AWS free tier! The code repository is aslo included in the article. Perfect!
[Read More]