问题形容

hadoop集群装置结束,在yarn的控制台显示节点id和节点地址都是localhost

hadoop@master sbin]$ yarn node -list20/12/17 12:21:19 INFO client.RMProxy: Connecting to ResourceManager at master/172.16.8.42:18040Total Nodes:1         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers localhost:43141                RUNNING    localhost:8042                                  0

提交作业时在yarn的日志中也打印出节点信息为127.0.0.1,并且应用该ip作为节点IP,必定连贯出错

2020-12-17 00:53:30,721 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting up container Container: [ContainerId: container_1607916354082_0008_01_000001, AllocationRequestId: 0, Version: 0, NodeId: localhost:43141, NodeHttpAddress: localhost:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 127.0.0.1:35845 }, ExecutionType: GUARANTEED, ] for AM appattempt_1607916354082_0008_000001020-12-17 00:56:30,801 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_1607916354082_0008_000001. Got exception: java.net.ConnectException: Call From master/172.16.8.42 to localhost:43141 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused       at sun.reflect.GeneratedConstructorAccessor46.newInstance(Unknown Source)       at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)       at java.lang.reflect.Constructor.newInstance(Constructor.java:423)       at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:827)       at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:757)       at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553)       at org.apache.hadoop.ipc.Client.call(Client.java:1495)       at org.apache.hadoop.ipc.Client.call(Client.java:1394)       at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)       at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)

问题起因

在hadoop的源码中,获取节点信息的代码如下

private NodeId buildNodeId(InetSocketAddress connectAddress,String hostOverride) {       if (hostOverride != null) {           connectAddress = NetUtils.getConnectAddress(                   new InetSocketAddress(hostOverride, connectAddress.getPort()));       }       return NodeId.newInstance(               connectAddress.getAddress().getCanonicalHostName(),               connectAddress.getPort());   }

其中主机名是通过connectAddress.getAddress().getCanonicalHostName()进行获取,咱们晓得获取主机名还能够通过getHostName获取,那么这两种有什么区别?getCanonicalHostName获取的是全域名,getHostName获取的是主机名,比方主机名是definesys但可能dns下面配的域名是definesys.com,getCanonicalHostName就是通过dns进行解析获取全域名,实际上getAddress获取到的是127.0.0.1,在hosts文件中是这样配置的

127.0.0.1     localhost       localhost.localdomain

因而解析成了localhost

解决方案

在hadoop的举荐计划里是这么写的

  • If the error message says the remote service is on "127.0.0.1" or "localhost" that means the configuration file is telling the client that the service is on the local server. If your client is trying to talk to a remote system, then your configuration is broken.
  • Check that there isn't an entry for your hostname mapped to 127.0.0.1 or 127.0.1.1 in /etc/hosts (Ubuntu is notorious for this).

翻译过去是倡议删除127.0.0.1 和 127.0.1.1在hosts中的配置,删除后恢复正常,问题解决。