apache Atlas 1.2.0 搭建以及hive的集成-白红宇

apache Atlas 1.2.0 搭建以及hive的集成

阅读量：564 次

发布时间：2019-03-09

本文共 6841 字，大约阅读时间需要 22 分钟。

apache Atlas 1.2.0 搭建以及hive的集成

简介

ps:(官网摘抄)

Atlas 是一组可扩展且可扩展的核心基础治理服务，使企业能够高效、高效地满足 Hadoop 内的合规性要求，并允许与整个企业数据生态系统集成。

为组织提供开放的元数据管理和治理功能，以构建其数据资产目录，对这些资产进行分类和治理，并为数据科学家、分析师和数据治理团队提供围绕这些数据资产的协作功能。

特点:

元数据类型和实例

各种 Hadoop 和非 Hadoop 元数据的预定义类型

能够为要管理的元数据定义新类型

类型可以具有基元属性、复杂属性、对象引用;可以从其他类型的继承
类型（称为实体）的实例捕获元数据对象详细信息及其关系

使用类型和实例的 REST API 允许更轻松地集成

分类:

能够动态创建分类 - 如 PII、EXPIRES_ON、DATA_QUALITY、敏感
分类可以包括属性 - 如expiry_date分类EXPIRES_ON属性

实体可以与多个分类关联，从而更轻松地发现和安全实施

通过血统传播分类 - 自动确保分类在经过各种处理时遵循数据

血统:

直观的 UI，用于查看数据在各种程序流程中的数据走向和来源

通过 REST AP来访问和更新血统

搜索/发现
直观的 UI 按类型、分类、属性值或自由文本搜索实体
要按复杂条件搜索的丰富的 REST API
SQL 类似于搜索实体的查询语言 - 特定于域的语言（DSL）

atlas主要借助了hbase用来存储数据信息

通过solr存储相关的所以,由此提供了丰富的restful API 提供我们进行索引库的查询

各类元数据之间的关系atlas借助了图数据库janusGraph

1.下载源码

选择镜像进行下载

在这里插入图片描述

下载后上传到你的linux环境下进行解压

tar -zxvf  apache-atlas-1.2.0-sources.tar.gz

就可以得到atlas的源码目录

接下来需要编译

准备编译的环境

java1.8

maven3.5以上

node.js(构建的时候涉及到需要npm进行下载,自行配置好)

构建编译过程可以参考官网

在这里插入图片描述

2.编译源码

进入atlas源代码目录中用maven编译源文件

mvn clean -DskipTests package -Pdist,embedded-hbase-solr

编译过程比较长,因为还要下载hbase+solr

编译结束后可以看到源码目录下

/export/servers/apache-atlas-sources-1.2.0/distro/target

在这里插入图片描述

多了这些包,其中主要通过apache-atlas-1.2.0-server 进行配置启动

3.修改配置

在编译之后我们需要改一下配置文件atlas-application.properties

atlas.graph.storage.backend=hbaseatlas.graph.storage.hbase.table=apache_atlas_janusatlas.graph.storage.hostname=localhostatlas.graph.storage.hbase.regions-per-server=1atlas.graph.storage.lock.wait-time=10000atlas.EntityAuditRepository.impl=org.apache.atlas.repository.audit.HBaseBasedAuditRepositoryatlas.graph.index.search.backend=solratlas.graph.index.search.solr.mode=cloudatlas.graph.index.search.solr.zookeeper-url=localhost:2181atlas.graph.index.search.solr.zookeeper-connect-timeout=60000atlas.graph.index.search.solr.zookeeper-session-timeout=60000atlas.graph.index.search.solr.wait-searcher=trueatlas.graph.index.search.max-result-set-size=150#########  Notification Configs  #########atlas.notification.embedded=trueatlas.kafka.data=${sys:atlas.home}/data/kafkaatlas.kafka.zookeeper.connect=localhost:9026atlas.kafka.bootstrap.servers=localhost:9027atlas.kafka.zookeeper.session.timeout.ms=400atlas.kafka.zookeeper.connection.timeout.ms=200atlas.kafka.zookeeper.sync.time.ms=20atlas.kafka.auto.commit.interval.ms=1000atlas.kafka.hook.group.id=atlasatlas.kafka.enable.auto.commit=falseatlas.kafka.auto.offset.reset=earliestatlas.kafka.session.timeout.ms=30000atlas.kafka.offsets.topic.replication.factor=1atlas.kafka.poll.timeout.ms=1000atlas.notification.create.topics=trueatlas.notification.replicas=1atlas.notification.topics=ATLAS_HOOK,ATLAS_ENTITIESatlas.notification.log.failed.messages=trueatlas.notification.consumer.retry.interval=500atlas.notification.hook.retry.interval=1000atlas.server.http.port=21000atlas.enableTLS=falseatlas.authentication.method.kerberos=falseatlas.authentication.method.file=true#### ldap.type= LDAP or ADatlas.authentication.method.ldap.type=none#### user credentials fileatlas.authentication.method.file.filename=${sys:atlas.home}/conf/users-credentials.propertiesatlas.rest.address=http://localhost:21000# If enabled and set to true, this will run setup steps when the server starts#atlas.server.run.setup.on.start=false#########  Entity Audit Configs  #########atlas.audit.hbase.tablename=apache_atlas_entity_auditatlas.audit.zookeeper.session.timeout.ms=1000atlas.audit.hbase.zookeeper.quorum=localhost:2181#########  High Availability Configuration ########atlas.server.ha.enabled=falseatlas.authorizer.impl=simpleatlas.authorizer.simple.authz.policy.file=atlas-simple-authz-policy.jsonatlas.rest-csrf.enabled=trueatlas.rest-csrf.browser-useragents-regex=^Mozilla.*,^Opera.*,^Chrome.*atlas.rest-csrf.methods-to-ignore=GET,OPTIONS,HEAD,TRACEatlas.rest-csrf.custom-header=X-XSRF-HEADERatlas.metric.query.cache.ttlInSecs=900#########  Gremlin Search Configuration  ##########Set to false to disable gremlin search.atlas.search.gremlin.enable=false########## Add http headers ############atlas.headers.Access-Control-Allow-Origin=*#atlas.headers.Access-Control-Allow-Methods=GET,OPTIONS,HEAD,PUT,POST#atlas.headers.
   
    =

根据自己的需求可以在这里进行配置,

可以参考官网,由于这里我选择本地测试用,所以hbase,solr我都是选择的localhost默认本地,如果有自己搭了集群的可以修改配置切换成集群环境

4.启动

在本地模式下的solr不需要自己创建节点,atlas的启动脚本中

/export/servers/apache-atlas-sources-1.2.0/distro/target/apache-atlas-1.2.0-server/apache-atlas-1.2.0/bin/atlas_start.py

已经默认帮我们创建了

在这里插入图片描述

进入bin目录启动atlas

在这里插入图片描述

访问webui界面

http://node01:21000/

在这里插入图片描述

输入初始化账号密码admin/admin

进入

查看solr界面http://192.168.100.210:9838/

在这里插入图片描述

起初是没有数据的,可以通过bin目录下的

进行测试数据的导入

5.集成hive

这里以hive为例,在工作中,如果没有特定的元数据管理平台进行管理,你很难去维护你数仓中各个表之间的关系,以及每个表是否被引用

借助atlas平台可以有效的监控hive任务以及他们的血统关系,让你的数据拒绝成为数据孤岛和数据沼泽

换句话说,以大数据开发为基础的数据中台,如果缺少了一个元数据管理平台,那么它是不完整的

目前的环境,目前我部署了三台机器作为自己的测试集群,在hive启动正常的前提下

在这里插入图片描述

atlas是如何集成hive的呢,他会利用hook作为钩子,监听你的hive 事件,消息数据发送到指定的kafka队列中

在这里插入图片描述

以此来构建各个实体之间的关系

所以要集成hive需要改两个配置

hive-site.xml

hive-env.sh

进入/export/servers/apache-hive-2.1.1-bin/conf 目录

添加一下配置到hive-site.xml


         
    
     hive.exec.post.hooks
          
    
     org.apache.atlas.hive.hook.HiveHook

在这里插入图片描述

添加一下配置到hive-env.sh

export HIVE_AUX_JARS_PATH=/export/servers/apache-atlas-sources-1.2.0/distro/target/apache-atlas-1.2.0-server/apache-atlas-1.2.0/hook/hive

在这里插入图片描述

如果你是配置多台机器的集群的话,需要在每台机器的hive上都存在有/export/servers/apache-atlas-sources-1.2.0/distro/target/apache-atlas-1.2.0-server/apache-atlas-1.2.0/hook/hive

这个目录所以我们采用scp 将文件分发到我们的机器上

进入/export/servers/apache-atlas-sources-1.2.0/distro/target/apache-atlas-1.2.0-server/apache-atlas-1.2.0/hook/

scp -r hive root@node02:/export/servers/apache-atlas-sources-1.2.0/distro/target/apache-atlas-1.2.0-server/apache-atlas-1.2.0/hook/scp -r hive root@node03:/export/servers/apache-atlas-sources-1.2.0/distro/target/apache-atlas-1.2.0-server/apache-atlas-1.2.0/hook/

然后对应的hive配置修改也要到各自的机器上进行修改

集成过程中可能会因为缺少某个jackson的包导致失败

在这里插入图片描述

jackson-jaxrs-json-providerjackson-jaxrs-basejackson-module-jaxb-annotations2.9.9版本

可以手动下载对应版本的jackson包放到

shell/export/servers/apache-atlas-sources-1.2.0/distro/target/apache-atlas-1.2.0-server/apache-atlas-1.2.0/hook/hive/atlas-hive-plugin-impl/

在进行分发

然后重启hive,确定能否正常重启

查看我们当前的数据库是否有数据,因为atlas它只能监听当前的操作,所以我们需要将历史的数据进行导入

在这里插入图片描述

然后在重启atlas

重启后要检查对应的solr和hbase有没有正常启动可以通过jps检查当前的java进程是否存在这两个应用,因为hbase是主要的存储数据库,如果没启动的话,是检索不到我们的元数据的

在这里插入图片描述

进行hive历史数据的导入

进入/export/servers/apache-atlas-sources-1.2.0/distro/target/apache-atlas-1.2.0-server/apache-atlas-1.2.0/hook-bin/

执行import-hive.sh脚本

在这里插入图片描述

需要输入默认的账户和密码

导入成功后我们查看ui界面可以看到对应的数据了

在这里插入图片描述

此时可看到我们的数据了,可以根据自己的需求给表添加分类,元素等信息

然后我们试着新建表并插入数据,查看能否监听到我们的操作

-- 创建表 create table t_host_event (ip string , event string); -- 插入数据 insert into t_host_event values ('192.168.100.210','login on hive'); -- 关联查询操作 select * from t_host_info left join t_host_event on t_host_info.ip=t_host_event.ip; -- 关联查询操作写入新表 create table t_host_info2event as select host,t_host_info.ip ,event from t_host_info left join t_host_event on t_host_info.ip=t_host_event.ip

在这里插入图片描述