返回 Skill 列表
extension
分类: 数据与分析无需 API Key

Information Extraction

通过半自动流水线从未结构化文本中提取结构化信息。支持实体抽取、关系抽取、属性抽取...

person作者: quqxuihubclawhub

Information Extraction

Extract entity, relation, attribute, and event information from text into a normalized intermediate structure, then export triples in JSON, JSONL, or TSV.

Core workflow

  1. Define extraction scope and output granularity.
  2. Segment input text into sentences and paragraphs.
  3. Extract entities with evidence.
  4. Extract relations, attributes, and events.
  5. Normalize aliases, predicates, and duplicated records.
  6. Export triples. Default output is JSON.
  7. Review ambiguities before treating output as final.

Input scope

Prefer this skill for:

  • Plain text strings
  • Markdown text
  • Text copied from webpages, notes, reports, transcripts, or documents

If the user provides a file in another format, convert it to text first, then use this skill.

Output contract

Default output should contain:

{
  "triples": [],
  "entities": [],
  "attributes": [],
  "events": [],
  "ambiguities": []
}

Support export formats:

  • JSON (default)
  • JSONL
  • TSV

Extraction principles

  • Extract explicit facts before inference.
  • Preserve evidence spans for important records.
  • Prefer controlled predicates from references/relation-taxonomy.md.
  • Keep attributes and events separate internally, even when final output is triples.
  • Do not flatten complex events too early.
  • Normalize before exporting.
  • Record unresolved ambiguity instead of pretending certainty.

Minimal internal schema

Use these record shapes during extraction.

Entity

{
  "id": "ent_001",
  "mention": "OpenAI",
  "canonical_name": "OpenAI",
  "type": "Organization",
  "evidence": "OpenAI published the GPT-4 Technical Report.",
  "confidence": 0.95
}

Relation

{
  "subject": "ent_001",
  "predicate": "published",
  "object": "ent_002",
  "evidence": "OpenAI published the GPT-4 Technical Report.",
  "confidence": 0.93
}

Attribute

{
  "entity_id": "ent_002",
  "attribute": "year",
  "value": "2023",
  "evidence": "The report was released in 2023.",
  "confidence": 0.87
}

Event

{
  "id": "ev_001",
  "type": "Publication",
  "trigger": "published",
  "participants": {
    "agent": "ent_001",
    "object": "ent_002"
  },
  "time": "2023",
  "location": null,
  "evidence": "OpenAI published the GPT-4 Technical Report in 2023.",
  "confidence": 0.92
}

How to use references

  • Read references/pipeline.md for the end-to-end procedure.
  • Read references/schema.md for types and intermediate record structure.
  • Read references/relation-taxonomy.md before inventing new predicates.
  • Read references/triple-mapping.md when exporting final triples.
  • Read references/event-modeling.md when text describes complex events.
  • Read references/quality-checklist.md before final delivery.

Scripts

Extract

python3 skills/information-extraction/scripts/extract.py --text "OpenAI published GPT-4." --output out.json

Or read from stdin:

echo "OpenAI published GPT-4." | python3 skills/information-extraction/scripts/extract.py --stdin --output out.json

Normalize

python3 skills/information-extraction/scripts/normalize.py --input out.json --output normalized.json

Export triples

python3 skills/information-extraction/scripts/export_triples.py --input normalized.json --format json --output triples.json
python3 skills/information-extraction/scripts/export_triples.py --input normalized.json --format jsonl --output triples.jsonl
python3 skills/information-extraction/scripts/export_triples.py --input normalized.json --format tsv --output triples.tsv

Notes on automation

This is a semi-automatic pipeline, not a claim of perfect extraction. The scripts provide scaffolding, normalization, and export. For high-stakes outputs, keep evidence and perform manual review.