Chukwa is a data collection and Analysis Framework that works with Hadoop to process and analyze the huge logs generated. It is built on top of the Hadoop Distributed File System (HDFS) and Map Reduce Framework. It is highly flexible tool that makes Log analysis, processing and monitoring easier especially while handling Distributed File Systems like Hadoop.
Components of Chukwa
Chukwa comprises of the following components
Agents that run on each machine to collect the logs generated from various applications.
Collectors that receive data from the agent and write it to stable storage – HDFS in case of Hadoop
MapReduce jobs or parsing and archiving the data.
How does Chukwa Work ?
Chukwa Agents run on every machine from where logs needs to transferred to Hadoop. The Agents collect Logs generated from the application layer using Adaptors. One agent can have many adaptors, each doing a separate task of collecting logs.
Collectors do the actual job of Collecting logs. As described in the figure above, every Application that is generating logs will have its own adaptor. The job of the Adaptor is to send the logs to the Agent which in turn forwards to the Collector. The Collector will save all these logs collected from various agents in a Data sink file in HDFS. The appending is done in the Chukwa side and hence we can successfully process huge amounts of logs. The data sinks will be renamed and moved after a threshold and logs will be sequentially written in next Data sink File.
Map Reduce Jobs in Chukwa:
As data is collected, Chukwa dumps it into sink files in HDFS. By default, these are located in hdfs:///chukwa/logs. If the file name ends in ‘.chukwa’, that means the file is still being written by it. Every few minutes, the collector will close the file, and rename the file to ‘*.done’. This marks the file as available for processing. Each sink file is a Hadoop sequence file, containing a succession of key-value pairs, and periodic synch markers to facilitate MapReduce access. Chukwa has its own set of Map Reduce jobs that do the processing of Logs. There are two basic types of jobs:
The simple archiver is designed to consolidate a large number of data sink files into a small number of archive files, with the contents grouped in a useful way. Archive files, like raw sink files, are in Hadoop sequence file format. Unlike the data sink, however, duplicates have been removed. The simple archiver moves every .done file out of the sink, and then runs a MapReduce job to group the data. Output Chunks will be placed into files with names of the form hdfs:///chukwa/archive/clustername/Datatype_date.arc. Date corresponds to when the data was collected; Datatype is the datatype of each Chunk. If archived data corresponds to an existing file name, a new file will be created with a disambiguate suffix.
A key use for Chukwa is processing arriving data, in parallel, using MapReduce. The most common way to do this is using the Chukwa demux framework. As data flows through chukwa, the demux job is often the first job that runs. By default, Chukwa will use the default TsProcessor. This parser will try to extract the real log statement from the log entry using the ISO8601 date format. If it fails, it will use the time at which the chunk was written to disk (collector timestamp).