返回 Skill 列表
extension
分类: 其它无需 API Key

大数据管理与应用

bigdata-management

person作者: user_a045eab5hubcommunity

Big Data Management and Applications

Inputs to collect

  • Domain context: Is this for learning/education, professional work, or project implementation?
  • Specific area: Does the user focus on data collection, storage, processing, analysis, or application?
  • Tech stack preference: Any specific tools or frameworks the user prefers (e.g., Hadoop, Spark, Flink)?
  • Problem type: Is this a theoretical question, practical implementation, or solution design?

Procedure

Core Knowledge Areas

1. Data Collection and Integration

  • Real-time data collection: Flume, Kafka Connect, logstash
  • Batch data ingestion: Sqoop, DataX, Kafka
  • Data formats: JSON, CSV, Parquet, ORC, Avro
  • Data validation and quality checks

2. Storage Architecture

  • Distributed file systems: HDFS, Ceph
  • Data lakes: Delta Lake, Iceberg, Hudi
  • NoSQL databases: HBase, MongoDB, Cassandra
  • Time-series databases: InfluxDB, TimescaleDB
  • Data warehouse: Hive, ClickHouse, StarRocks, Doris

3. Processing Frameworks

  • Batch processing: MapReduce, Spark SQL, Flink Batch
  • Stream processing: Kafka Streams, Flink, Spark Streaming, Storm
  • ETL pipelines: Airflow, DolphinScheduler, Azkaban
  • Data transformation: Spark DataFrame, Flink Table API

4. Analysis and Computing

  • SQL engines: Presto, Trino, Hive LLAP, Spark Thrift Server
  • OLAP engines: ClickHouse, Druid, Kylin, Doris
  • Machine learning: Spark MLlib, XGBoost on Spark, TensorFlow on Spark
  • Graph processing: GraphX, Neo4j, Gremlin

5. Data Governance

  • Data catalog: Apache Atlas, DataHub, OpenMetadata
  • Data lineage: Apache Griffin, Great Expectations
  • Data quality: Deequ, Great Expectations, Delta Lake schema enforcement
  • Data security: Ranger, Sentry, column-level encryption

6. Practical Application Scenarios

  • Real-time data dashboard and monitoring
  • User behavior analysis and recommendation systems
  • Risk control and fraud detection
  • Data assets and monetization
  • Business intelligence and reporting

Solution Design Framework

  1. Assess requirements

    • Data volume, velocity, variety assessment
    • Latency requirements (real-time vs batch)
    • Analytical complexity needs
  2. Architecture selection

    • Lambda architecture vs Kappa architecture
    • Data mesh vs traditional data warehouse
    • Cloud-native vs on-premise considerations
  3. Technology stack recommendation

    • Match specific requirements to appropriate tools
    • Consider team expertise and learning curve
    • Evaluate cost and operational complexity
  4. Implementation roadmap

    • Quick wins vs long-term architecture
    • Migration strategy from legacy systems
    • Performance tuning and optimization

Output contract

Provide:

  • Clear, actionable guidance or solution design
  • Technology recommendations with rationale
  • Code examples for implementation when needed
  • Architecture diagrams in text format when helpful
  • Comparison of alternatives when relevant

Failure handling

  • For highly specific technical questions outside current knowledge: acknowledge limitations and provide best effort guidance
  • For emerging technologies not in training data: suggest official documentation and community resources
  • When user needs hands-on implementation: recommend specific tutorials or documentation

Examples

Example 1: Real-time data pipeline design Input: "设计一个日均处理10亿条数据的实时分析系统" Output: Provide architecture covering Kafka for ingestion, Flink for processing, ClickHouse for real-time OLAP, with data flow diagrams and key configurations

Example 2: Data lake migration Input: "如何将传统数据仓库迁移到现代数据湖架构" Output: Provide phased migration plan, tool selection rationale (Iceberg vs Hudi vs Delta Lake), and data governance recommendations

Example 3: Performance optimization Input: "Spark job 运行很慢,怎么排查和优化" Output: Provide troubleshooting checklist: shuffle optimization, partition tuning, memory configuration, data skew handling, with specific parameter recommendations

Reference Resources

For detailed implementation guides, refer to:

  • Apache official documentation (Hadoop, Spark, Flink, Kafka)
  • Cloud provider big data services (AWS EMR, Azure Databricks, GCP Dataproc)
  • Open source project GitHub repositories and best practices
  • Industry case studies and architecture patterns