Flume
Flume 是 Apache 开源的分布式日志采集系统,专为将海量日志数据从各类数据源可靠地传输到 HDFS、Kafka、HBase 等存储系统而设计。核心模型是 Source → Channel → Sink 的流水线。
核心组件
数据源(Web Server / App)
│
▼
Source(采集数据)
│
▼
Channel(缓冲,保证可靠性)
│
▼
Sink(写出到目标存储)
│
▼
HDFS / Kafka / HBase
| 组件 | 作用 | 常见实现 |
|---|---|---|
| Source | 接收外部数据 | Taildir(监听文件)、Avro、Kafka、HTTP |
| Channel | 数据缓冲,解耦 Source 和 Sink | Memory Channel(高吞吐)、File Channel(可靠) |
| Sink | 将数据写出到目标 | HDFS Sink、Kafka Sink、Logger Sink |
配置示例
监听日志文件变化,写入 HDFS:
agent.sources = tailSource
agent.channels = memChannel
agent.sinks = hdfsSink
# Source:监听日志目录(断点续传)
agent.sources.tailSource.type = TAILDIR
agent.sources.tailSource.filegroups = f1
agent.sources.tailSource.filegroups.f1 = /var/log/app/.*\\.log
agent.sources.tailSource.positionFile = /tmp/flume_position.json
agent.sources.tailSource.channels = memChannel
# Channel:内存缓冲
agent.channels.memChannel.type = memory
agent.channels.memChannel.capacity = 10000
agent.channels.memChannel.transactionCapacity = 1000
# Sink:写入 HDFS,按小时分目录
agent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.hdfs.path = hdfs://namenode:9000/logs/%Y-%m-%d/%H
agent.sinks.hdfsSink.hdfs.fileType = DataStream
agent.sinks.hdfsSink.hdfs.rollInterval = 3600
agent.sinks.hdfsSink.hdfs.rollSize = 134217728
agent.sinks.hdfsSink.hdfs.rollCount = 0
agent.sinks.hdfsSink.channel = memChannel启动 Agent:
flume-ng agent \
--conf /etc/flume/conf \
--conf-file /etc/flume/conf/app-to-hdfs.conf \
--name agent \
-Dflume.root.logger=INFO,console多 Agent 级联
大规模场景下将多个 Flume Agent 串联,汇聚后统一写出:
App Server 1 → Agent1 (Avro Sink) ─┐
App Server 2 → Agent2 (Avro Sink) ─┼→ 汇聚 Agent → HDFS
App Server 3 → Agent3 (Avro Sink) ─┘
Channel 选型
| Channel | 优点 | 缺点 | 适用场景 |
|---|---|---|---|
| Memory | 高吞吐,低延迟 | Agent 崩溃数据丢失 | 允许少量丢失 |
| File | 持久化,可靠 | 磁盘 I/O,吞吐较低 | 不允许丢失 |
| Kafka | 分布式,高吞吐,可回溯 | 依赖 Kafka 集群 | 与 Kafka 集成 |