Big Data Management and Applications

Inputs to collect

Domain context: Is this for learning/education, professional work, or project implementation?
Specific area: Does the user focus on data collection, storage, processing, analysis, or application?
Tech stack preference: Any specific tools or frameworks the user prefers (e.g., Hadoop, Spark, Flink)?
Problem type: Is this a theoretical question, practical implementation, or solution design?

Procedure

Core Knowledge Areas

1. Data Collection and Integration

Real-time data collection: Flume, Kafka Connect, logstash
Batch data ingestion: Sqoop, DataX, Kafka
Data formats: JSON, CSV, Parquet, ORC, Avro
Data validation and quality checks

2. Storage Architecture

Distributed file systems: HDFS, Ceph
Data lakes: Delta Lake, Iceberg, Hudi
NoSQL databases: HBase, MongoDB, Cassandra
Time-series databases: InfluxDB, TimescaleDB
Data warehouse: Hive, ClickHouse, StarRocks, Doris

3. Processing Frameworks

Batch processing: MapReduce, Spark SQL, Flink Batch
Stream processing: Kafka Streams, Flink, Spark Streaming, Storm
ETL pipelines: Airflow, DolphinScheduler, Azkaban
Data transformation: Spark DataFrame, Flink Table API

4. Analysis and Computing

SQL engines: Presto, Trino, Hive LLAP, Spark Thrift Server
OLAP engines: ClickHouse, Druid, Kylin, Doris
Machine learning: Spark MLlib, XGBoost on Spark, TensorFlow on Spark
Graph processing: GraphX, Neo4j, Gremlin

5. Data Governance

Data catalog: Apache Atlas, DataHub, OpenMetadata
Data lineage: Apache Griffin, Great Expectations
Data quality: Deequ, Great Expectations, Delta Lake schema enforcement
Data security: Ranger, Sentry, column-level encryption

6. Practical Application Scenarios

Real-time data dashboard and monitoring
User behavior analysis and recommendation systems
Risk control and fraud detection
Data assets and monetization
Business intelligence and reporting

Solution Design Framework

Assess requirements
- Data volume, velocity, variety assessment
- Latency requirements (real-time vs batch)
- Analytical complexity needs
Architecture selection
- Lambda architecture vs Kappa architecture
- Data mesh vs traditional data warehouse
- Cloud-native vs on-premise considerations
Technology stack recommendation
- Match specific requirements to appropriate tools
- Consider team expertise and learning curve
- Evaluate cost and operational complexity
Implementation roadmap
- Quick wins vs long-term architecture
- Migration strategy from legacy systems
- Performance tuning and optimization

Output contract

Provide:

Clear, actionable guidance or solution design
Technology recommendations with rationale
Code examples for implementation when needed
Architecture diagrams in text format when helpful
Comparison of alternatives when relevant

Failure handling

For highly specific technical questions outside current knowledge: acknowledge limitations and provide best effort guidance
For emerging technologies not in training data: suggest official documentation and community resources
When user needs hands-on implementation: recommend specific tutorials or documentation

Examples

Example 1: Real-time data pipeline design Input: "设计一个日均处理10亿条数据的实时分析系统" Output: Provide architecture covering Kafka for ingestion, Flink for processing, ClickHouse for real-time OLAP, with data flow diagrams and key configurations

Example 2: Data lake migration Input: "如何将传统数据仓库迁移到现代数据湖架构" Output: Provide phased migration plan, tool selection rationale (Iceberg vs Hudi vs Delta Lake), and data governance recommendations

Example 3: Performance optimization Input: "Spark job 运行很慢，怎么排查和优化" Output: Provide troubleshooting checklist: shuffle optimization, partition tuning, memory configuration, data skew handling, with specific parameter recommendations

Reference Resources

For detailed implementation guides, refer to:

Apache official documentation (Hadoop, Spark, Flink, Kafka)
Cloud provider big data services (AWS EMR, Azure Databricks, GCP Dataproc)
Open source project GitHub repositories and best practices
Industry case studies and architecture patterns