DataHub Extension
Watch the demo
This tutorial covers how to add the DataHub MCP Server as a goose extension to enable AI-powered data discovery, lineage exploration, and metadata querying across your data ecosystem.
- goose Desktop
- goose CLI
Command
uvx mcp-server-datahub@latest
Environment Variables
DATAHUB_GMS_URL: <your-datahub-url>
DATAHUB_GMS_TOKEN: <your-datahub-token>
What is DataHub?β
DataHub is an open-source metadata platform that provides a unified view of your data ecosystem, cataloging datasets, dashboards, pipelines, and more with rich metadata including ownership, lineage, usage statistics, and data quality information.
The DataHub MCP Server enables AI agents to:
- Find trustworthy data using natural language search with trust signals like popularity, quality, and lineage
- Explore data lineage to understand upstream and downstream dependencies at table and column level
- Understand business context through glossaries, domains, data products, and organizational metadata
- Generate SQL queries with help from documentation, lineage, and popular query patterns
Learn more: DataHub MCP Server Guide | GitHub Repository
Prerequisitesβ
Before using the DataHub MCP Server, ensure you have:
- Python 3.10+ and uv package manager installed
- A DataHub instance: DataHub Cloud or self-hosted DataHub
- A Personal Access Token from your DataHub instance
Configurationβ
Note that you'll need uv installed on your system to run this command, as it uses uvx.
- goose Desktop
- goose CLI
- Launch the installer
- Click
Yesto confirm the installation - Get your DataHub Personal Access Token and paste it in
- Click
Add Extension - Click the button in the top-left to open the sidebar
- Navigate to the chat
- Run the
configurecommand:
goose configure
- Choose to add a
Command-line Extension.
β goose-configure
β
β What would you like to configure?
β Add Extension
β
β What type of extension would you like to add?
β β Built-in Extension
β β Command-line Extension (Run a local command or script)
β β Remote Extension (SSE)
β β Remote Extension (Streaming HTTP)
β
- Give your extension a name.
β goose-configure
β
β What would you like to configure?
β Add Extension
β
β What type of extension would you like to add?
β Command-line Extension
β
β What would you like to call this extension?
β DataHub
β
- Enter the command to run when this extension is used.
β goose-configure
β
β What would you like to configure?
β Add Extension
β
β What type of extension would you like to add?
β Command-line Extension
β
β What would you like to call this extension?
β DataHub
β
β What command should be run?
β uvx mcp-server-datahub@latest
β
- Enter the number of seconds Goose should wait for actions to complete before timing out. Default is
300seconds.
β goose-configure
β
β What would you like to configure?
β Add Extension
β
β What type of extension would you like to add?
β Command-line Extension
β
β What would you like to call this extension?
β DataHub
β
β What command should be run?
β uvx mcp-server-datahub@latest
β
β Please set the timeout for this tool (in secs):
β 300
β
- Enter a description for this extension.
β goose-configure
β
β What would you like to configure?
β Add Extension
β
β What type of extension would you like to add?
β Command-line Extension
β
β What would you like to call this extension?
β DataHub
β
β What command should be run?
β uvx mcp-server-datahub@latest
β
β Please set the timeout for this tool (in secs):
β 300
β
β Enter a description for this extension:
β Data discovery and metadata platform integration
β
- Add environment variables for this extension.
β goose-configure
β
β What would you like to configure?
β Add Extension
β
β What type of extension would you like to add?
β Command-line Extension
β
β What would you like to call this extension?
β DataHub
β
β What command should be run?
β uvx mcp-server-datahub@latest
β
β Please set the timeout for this tool (in secs):
β 300
β
β Enter a description for this extension:
β Data discovery and metadata platform integration
β
β Would you like to add environment variables?
β Yes
β
β Environment variable name:
β DATAHUB_GMS_URL
β
β Environment variable value:
β https://your-instance.acryl.io
β
β Add another environment variable?
β Yes
β
β Environment variable name:
β DATAHUB_GMS_TOKEN
β
β Environment variable value:
β βͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺ
β
β Add another environment variable?
β No
β
β Added DataHub extension
Example Usageβ
Finding Trustworthy Dataβ
Find datasets related to your project by describing what you need in natural language.
goose Promptβ
Find all datasets related to customer transactions that are owned by the analytics team
goose Outputβ
The DataHub extension will search across your data catalog and return relevant datasets with their metadata, including:
- Dataset names and descriptions
- Column names, types, descriptions, and labels
- Owners
- Tags, properties, and glossary terms
- Usage statistics
- Data quality status
Exploring Data Lineageβ
I want to remove the "timestamp_seconds" column from the customer_orders table. What will break?
goose Promptβ
Show me the upstream lineage for the customer_orders table
goose Outputβ
The extension will traverse the lineage graph and show any:
- Source tables and datasets
- Transformation pipelines
- ETL jobs and workflows
- Downstream columns
That would be impacted by removing the column.
Generating SQL Queriesβ
How do I calculate the number of orders made in the USA last year?
goose Promptβ
What are the most common queries run against the customer_orders dataset?
goose Outputβ
The extension will retrieve SQL query history showing:
- Frequently executed queries
- Common join patterns
- Filter conditions
- Aggregation patterns
In addition to column names, types, descriptions, and any labels. This will enable the agent to generate high quality SQL to answer the question.
Understanding Data Quality & Freshnessβ
Determine whether a dataset is trustworthy before using it.
goose Promptβ
Is the customer_orders table fresh and free of data quality issues?
goose Outputβ
The extension will fetch:
- Latest data quality assertions and test results
- Freshness / staleness metrics
- Schema change history
- SLA or SLO metadata
- Owner-provided health status
Allowing the agent to warn the user or confirm data trustworthiness.
Capabilitiesβ
The DataHub MCP Server provides the following tools:
search
Search DataHub using structured keyword search (/q syntax) with boolean logic, filters, pagination, and optional sorting by usage metrics.
get_lineage
Retrieve upstream or downstream lineage for any entity (datasets, columns, dashboards, etc.) with filtering, query-within-lineage, pagination, and hop control.
get_dataset_queries
Fetch real SQL queries referencing a dataset or columnβmanual or system-generatedβto understand usage patterns, joins, filters, and aggregation behavior.
get_entities
Fetch detailed metadata for one or more entities by URN; supports batch retrieval for efficient inspection of search results.
list_schema_fields
List schema fields for a dataset with keyword filtering and pagination, useful when search results truncate fields or when exploring large schemas.
get_lineage_paths_between
Retrieve the exact lineage paths between two assets or columns, including intermediate transformations and SQL query information.
Resourcesβ
Troubleshootingβ
Connection Issuesβ
If you're having trouble connecting to DataHub:
-
Verify your
DATAHUB_GMS_URLis correct:- For DataHub Cloud:
https://your-tenant.acryl.io - For local instances:
http://localhost:8080 - For on-premises:
https://datahub.your-company.com
- For DataHub Cloud:
-
Confirm your Personal Access Token is valid and has appropriate permissions
-
Check network connectivity and firewall rules
Installation Issuesβ
If uvx is not found:
- Ensure
uvis installed:curl -LsSf https://astral.sh/uv/install.sh | sh - Restart your terminal or source your shell configuration
- Verify installation:
which uvx