1.hive vs spark-sql#为了给熟悉RDBMS但又不理解MapReduce的技术人员提供快速上手的工具,hive应运而生,它是运行在Hadoop上的SQL-on-hadoop工具。但是MapReduce计算过程中大量的中间磁盘落地过程消耗了大量的I/O,运行效率底,spark sql而是采用内存存储可以减少大量的中间磁盘落地数据。相比hive速度能提高10到100倍。
data:image/s3,"s3://crabby-images/2484c/2484c44ad073a666b6453869989500f12e72167e" alt=""
2.集群配置#拷贝hive的配置文件到spark conf的目录下,并删除不必要的信息,增加thrift server配置
data:image/s3,"s3://crabby-images/3c06e/3c06e624849b1ef52899f80867f73d802230c4fa" alt=""
data:image/s3,"s3://crabby-images/92539/9253907a71e77cff51b9fec473ed699f1325934a" alt=""
修改conf/hive-site.xml 中的 “hive.metastore.schema.verification”值为false即可解决 “Caused by: MetaException(message:Version information not found in metastore.)
减少日志输出
data:image/s3,"s3://crabby-images/a7919/a7919d1c5d700675e1a299412d8d23f9d9f93e9c" alt=""
spark-env.sh
data:image/s3,"s3://crabby-images/2cd90/2cd902dbdfcd0e053ec760189d44c66945ddf550" alt=""
spark-defaults.conf,yarn模式运行所需要的Libs
/usr/local/spark/jars
data:image/s3,"s3://crabby-images/0faac/0faacd01d71ec9c30736c8a50f10cc1724eb1269" alt=""
3.spark-sql#这种方式每个人一个driver彼此之间的数据无法共享
spark-sql --master yarn --queue hainiu --num-executors 12 --executor-memory 5G
data:image/s3,"s3://crabby-images/e90e4/e90e4d3a5b2f135b43622cdae9f01ffd48c71973" alt=""
data:image/s3,"s3://crabby-images/e5b9d/e5b9d0c5e75a6e14aa7e7c6af1eaf8113e0b4e85" alt=""
data:image/s3,"s3://crabby-images/fd746/fd7464ee92f858e82f13d168ad90f38ec22878ea" alt=""
data:image/s3,"s3://crabby-images/118aa/118aaa15eb08932fcf27a47be0c4fc20f0c681b4" alt=""
data:image/s3,"s3://crabby-images/fc365/fc365edebd338be03f8fd1a549916d0395f8518f" alt=""
默认的任务partitions为200
data:image/s3,"s3://crabby-images/33116/33116daa5c4a9c15783f2805344f9bea0d547d8a" alt=""
SET spark.sql.shuffle.partitions=20;
可以减少shuffle的次数
data:image/s3,"s3://crabby-images/8d8e2/8d8e2cf733bbd3eca4374ee7aba66b7aa1a96382" alt=""
data:image/s3,"s3://crabby-images/b7d92/b7d92decd0e8a8b4d803b40fee228a73cb5e0f5a" alt=""
data:image/s3,"s3://crabby-images/8561c/8561c8a6d5e997ba0c75bb5269b51aa95a0f4c0b" alt=""
spark-sql --help可以查看CLI命令参数:
4.spart thriftserver#这种方式所有人可以连接driver彼此之间的数据可以共享
ThriftServer是一个JDBC/ODBC接口,用户可以通过JDBC/ODBC连接ThriftServer来访问SparkSQL的数据。ThriftServer在启动的时候,会启动了一个SparkSQL的应用程序,而通过JDBC/ODBC连接进来的客户端共同分享这个SparkSQL应用程序的资源,也就是说不同的用户之间可以共享数据;ThriftServer启动时还开启一个侦听器,等待JDBC客户端的连接和提交查询。所以,在配置ThriftServer的时候,至少要配置ThriftServer的主机名和端口,如果要使用Hive数据的话,还要提供Hive Metastore的uris。
使用hadoop用户,不然不能创建logs目录权限
/usr/local/spark/sbin/start-thriftserver.sh --master yarn --queue hainiu
data:image/s3,"s3://crabby-images/c550b/c550b82afd0a8040ba87d2361ac347de6a011341" alt=""
使用任意业务用户来使用beeline连接thriftserver
/usr/local/hive/bin/beeline
这是使用了hive的beeline,因为与sparkserver提供的jdbc版本不一致所以,提示这个错误
data:image/s3,"s3://crabby-images/972e0/972e092558cbae7e564e56582f1d51b9d1340e91" alt=""
/usr/local/spark/bin/beeline
!connect jdbc:hive2://nn2.hadoop:20000
用spark的beeline就可以连接成功,因为使用的版本是一致的
data:image/s3,"s3://crabby-images/5a4bf/5a4bfd849eac5127fbca133c30a75790735663fa" alt=""
data:image/s3,"s3://crabby-images/72557/725574467bfc8e8659fbc181bd57e3b90c41a8ef" alt=""
缓存表,内存不够就刷到硬盘
cache table user_install_status;
data:image/s3,"s3://crabby-images/0244d/0244d4f8a641f82d8efa1f238684864ba03f902a" alt=""
data:image/s3,"s3://crabby-images/810de/810de2f80b310122c4896f5ca213dd7cf9c23cb3" alt=""
这时数据缓存在硬盘,查询用时30秒
data:image/s3,"s3://crabby-images/8bedc/8bedc9b5066e6518213105d49242ec3c87a06012" alt=""
可以调大thriftserver的executor缓存和executor数量 --num-executors 12 --executor-memory 5G
/usr/local/spark/sbin/start-thriftserver.sh --master yarn --queue hainiu --num-executors 12 --executor-memory 5G
data:image/s3,"s3://crabby-images/81b5e/81b5ec771a36eb971bdb4bb2eb529b03d61106cb" alt=""
data:image/s3,"s3://crabby-images/f27d9/f27d9602c9a83fdbd4da0df06a6ce12812096763" alt=""
data:image/s3,"s3://crabby-images/052de/052de2e6e12881ce4e1ce4d84811a3eb9b624d8c" alt=""
5.spark-jdbc编程#通过JDBC连接thriftserver
pom里添加spark的hive-jdbc
data:image/s3,"s3://crabby-images/fc05e/fc05ed075c87bea61c3e6aaf24abf4b15af8bf68" alt=""
data:image/s3,"s3://crabby-images/ffe7a/ffe7a00e6615ba221554b465e9b126636dd556b2" alt=""
data:image/s3,"s3://crabby-images/5dca4/5dca44c8ef725e54b4e3cf8f2871aa53fe7cc340" alt=""