关于大数据:大数据开发之HDFS的API操作过程

2次阅读

共计 6313 个字符，预计需要花费 16 分钟才能阅读完成。

创立 maven 工程并导入 jar 包

<repository>
    <id>cloudera</id>
    <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>

</repositories>
<dependencies>

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.6.0-mr1-cdh5.14.0</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.6.0-cdh5.14.0</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.6.0-cdh5.14.0</version>
</dependency>


<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce-client-core</artifactId>
    <version>2.6.0-cdh5.14.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/junit/junit -->
<dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.11</version>
    <scope>test</scope>
</dependency>
<dependency>
    <groupId>org.testng</groupId>
    <artifactId>testng</artifactId>
    <version>RELEASE</version>
</dependency>

</dependencies>
<build>

<plugins>
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.0</version>
        <configuration>
            <source>1.8</source>
            <target>1.8</target>
            <encoding>UTF-8</encoding>
            <!--    <verbal>true</verbal>-->
        </configuration>
    </plugin>


    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.4.3</version>
        <executions>
            <execution>
                <phase>package</phase>
                <goals>
                    <goal>shade</goal>
                </goals>
                <configuration>
                    <minimizeJar>true</minimizeJar>
                </configuration>
            </execution>
        </executions>
    </plugin>
</plugins>

</build>

应用文件系统形式拜访数据

在 java 中操作 HDFS，次要波及以下 Class：

Configuration：该类的对象封装了客户端或者服务器的配置;

FileSystem：该类的对象是一个文件系统对象，能够用该对象的一些办法来对文件进行操作，通过 FileSystem 的静态方法 get 取得该对象。

FileSystem fs = FileSystem.get(conf)

get 办法从 conf 中的一个参数 fs.defaultFS 的配置值判断具体是什么类型的文件系统。如果咱们的代码中没有指定 fs.defaultFS，并且工程 classpath 下也没有给定相应的配置，conf 中的默认值就来自于 hadoop 的 jar 包中的 core-default.xml，默认值为：file:///，则获取的将不是一个 DistributedFileSystem 的实例，而是一个本地文件系统的客户端对象

获取 FileSystem 的几种形式

第一种形式获取 FileSystem

@Test
public void getFileSystem() throws URISyntaxException, IOException {
Configuration configuration = new Configuration();

FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.47.100:8020"), configuration);
System.out.println(fileSystem.toString());

}

第二种形式获取 FileSystem

@Test
public void getFileSystem2() throws URISyntaxException, IOException {

Configuration configuration = new Configuration();
configuration.set("fs.defaultFS","hdfs://192.168.47.100:8020");
FileSystem fileSystem = FileSystem.get(new URI("/"), configuration);
System.out.println(fileSystem.toString());

}

第三种获取 FileSystem 类的形式

@Test
public void getFileSystem3() throws URISyntaxException, IOException {

Configuration configuration = new Configuration();
FileSystem fileSystem = FileSystem.newInstance(new URI("hdfs://192.168.47.100:8020"), configuration);
System.out.println(fileSystem.toString());

}

第四种获取 FileSystem 类的形式

@Test
public void getFileSystem4() throws Exception{

Configuration configuration = new Configuration();
configuration.set("fs.defaultFS","hdfs://192.168.47.100:8020");
FileSystem fileSystem = FileSystem.newInstance(configuration);
System.out.println(fileSystem.toString());

}

递归遍历文件系统当中的所有文件

通过递归遍历 hdfs 文件系统

@Test
public void listFile() throws Exception{

FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.47.100:8020"), new Configuration());
FileStatus[] fileStatuses = fileSystem.listStatus(new Path("/"));
for (FileStatus fileStatus : fileStatuses) {if(fileStatus.isDirectory()){Path path = fileStatus.getPath();
        listAllFiles(fileSystem,path);
    }else{System.out.println("文件门路为"+fileStatus.getPath().toString());
    }
}

}
public void listAllFiles(FileSystem fileSystem,Path path) throws Exception{

FileStatus[] fileStatuses = fileSystem.listStatus(path);
for (FileStatus fileStatus : fileStatuses) {if(fileStatus.isDirectory()){listAllFiles(fileSystem,fileStatus.getPath());
    }else{Path path1 = fileStatus.getPath();
        System.out.println("文件门路为"+path1);
    }
}

}

官网提供的 API 间接遍历

/**

递归遍历官网提供的 API 版本
@throws Exception
*/

@Test
public void listMyFiles()throws Exception{

// 获取 fileSystem 类
FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.52.100:8020"), new Configuration());
// 获取 RemoteIterator 失去所有的文件或者文件夹，第一个参数指定遍历的门路，第二个参数示意是否要递归遍历
RemoteIterator<LocatedFileStatus> locatedFileStatusRemoteIterator = fileSystem.listFiles(new Path("/"), true);
while (locatedFileStatusRemoteIterator.hasNext()){LocatedFileStatus next = locatedFileStatusRemoteIterator.next();
    System.out.println(next.getPath().toString());
}
fileSystem.close();

}

下载文件到本地

程序执行的 main 办法

**

拷贝文件的到本地
@throws Exception
*/

@Test
public void getFileToLocal()throws Exception{

FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.47.100:8020"), new Configuration());
FSDataInputStream open = fileSystem.open(new Path("/test/input/install.log"));
FileOutputStream fileOutputStream = new FileOutputStream(new File("c:\\install.log"));
IOUtils.copy(open,fileOutputStream);
IOUtils.closeQuietly(open);
IOUtils.closeQuietly(fileOutputStream);
fileSystem.close();

}

hdfs 上创立文件夹

@Test
public void mkdirs() throws Exception{

FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.52.100:8020"), new Configuration());
boolean mkdirs = fileSystem.mkdirs(new Path("/hello/mydir/test"));
fileSystem.close();

}

hdfs 文件上传

@Test
public void putData() throws Exception{

FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.47.100:8020"), new Configuration());
fileSystem.copyFromLocalFile(new Path("file:///c:\\install.log"),new Path("/hello/mydir/test"));
fileSystem.close();

}

HDFS 的小文件合并

因为 hadoop 善于存储大文件，因为大文件的元数据信息比拟少，如果 hadoop 集群当中有大量的小文件，那么每个小文件都须要保护一份元数据信息，会大大的减少集群治理元数据的内存压力，所以在理论工作当中，如果有必要肯定要将小文件合并成大文件进行一起解决

在咱们的 hdfs 的 shell 命令模式下，能够通过命令即将很多的 hdfs 文件合并成一个大文件下载到本地，命令如下

cd /export/servers

hdfs dfs -getmerge /config/*.xml ./hello.xml

既然能够在下载的时候将这些小文件合并成一个大文件一起下载，那么必定就能够在上传的时候将小文件合并到一个大文件外面去

代码如下：

/**

将多个本地系统文件，上传到 hdfs，并合并成一个大的文件
@throws Exception
*/

@Test
public void mergeFile() throws Exception{

// 获取分布式文件系统
FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.47.100:8020"), new Configuration(),"root");
FSDataOutputStream outputStream = fileSystem.create(new Path("/bigfile.xml"));
// 获取本地文件系统
LocalFileSystem local = FileSystem.getLocal(new Configuration());
// 通过本地文件系统获取文件列表，为一个汇合
FileStatus[] fileStatuses = local.listStatus(new Path("file:///F:\\ 上传小文件合并"));
for (FileStatus fileStatus : fileStatuses) {FSDataInputStream inputStream = local.open(fileStatus.getPath());
   IOUtils.copy(inputStream,outputStream);
    IOUtils.closeQuietly(inputStream);
}
IOUtils.closeQuietly(outputStream);
local.close();
fileSystem.close();

}

正文完