How to Sort a Large File

latest Articles

How to Sort a Large File

Articles —> How to Sort a Large File

Sorting the lines of a file programmatically can often be as simple as reading the contents of the file into memory, sorting the elements based upon the sorting criteria, then writing the sorted data back into a file.

I recently encountered a problem in which I needed to sort a large file - a file gigabytes in size (in this case 10Gb in length). Reading all the data into memory as required by traditional programmatic sorting may not be ideal - or possible - in a lot of contexts. In this instance my requirement was a single use, non-production solution, so I used a linux script to pipe together commands to parse the file into sortable columns, sort the file on those columns, and stitch the columns back together*. At the same time however, I began to consider what the solution could be for programmatic production level code that can run independently of other tools.

How this can be done is similar to how the unix sort command was implemented: the unix sort command uses External Sorting, a method similar in concept to a merge sort. In this approach, the file to be sorted is read in chunks, each chunk of which is sorted independently of the next, and each written to its own temporary file. Afterwards, one is left with several files - each with its contents sorted - that must be spliced back together. This can be accomplished by reading lines from each file into memory and merging the lines together in sorted order, writing out to the final file as the algorithm moves along through each temporary file.

The following code is a java class which has the very basic implementation of this sorting algorithm. The method splitChunks performs the initial sorting into temporary files. The method mergeChunks merges the temporary files into the final merged file.




import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.File;

import java.io.FileOutputStream;

import java.io.FileReader;

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.io.OutputStream;

import java.io.OutputStreamWriter;

import java.util.ArrayList;

import java.util.Collections;

import java.util.Comparator;

import java.util.HashMap;

import java.util.LinkedList;

import java.util.List;

import java.util.Map;



/**

 * Class that can be used to sort a large file by spliting said file into several temporary sorted files and 

 * merging those files. 

 * @author Greg Cope

 *

 */

public class StreamSorter {



	private final Comparator<String> sorter;

	private int maxChunkSize = 100000000;

	private List<File> outputs = new ArrayList<File>();

	private String tempDirectory = "";

	

	public StreamSorter(Comparator<String> sorter){

		this.sorter = sorter;

	}

	

	/**

	 * Sets the temporary directory

	 * @param temp

	 */

	public void setTempDirectory(String temp){

		tempDirectory = temp;

		File file = new File(tempDirectory);

		if ( !file.exists() || !file.isDirectory() ){

			throw new IllegalArgumentException("Parameter director is not a directory or does not exist");

		}

	}

	/**

	 * Sets the chunck size for temprary files

	 * @param size

	 */

	public void setMaximumChunkSize(int size){

		this.maxChunkSize = size;

	}

	

	/**

	 * Reads the input io stream and splits it into sorted chunks which are written to temporary files. 

	 * @param in

	 * @throws IOException

	 */

	public void splitChunks(InputStream in) throws IOException{

		outputs.clear();

		BufferedReader br = null;

		List<String> lines = new ArrayList<String>();

		try{

			br = new BufferedReader(new InputStreamReader(in));

			String line = null;

			int currChunkSize = 0;

			while ((line = br.readLine() ) != null ){

				lines.add(line);

				currChunkSize += line.length() + 1;

				if ( currChunkSize >= maxChunkSize ){

					currChunkSize = 0;

					Collections.sort(lines, sorter);

					File file = new File(tempDirectory + "temp" + System.currentTimeMillis());

					outputs.add(file);

					writeOut(lines, new FileOutputStream(file));

					lines.clear();

				}

			}

			//write out the remaining chunk

			Collections.sort(lines, sorter);

			File file = new File(tempDirectory + "temp" + System.currentTimeMillis());

			outputs.add(file);

			writeOut(lines, new FileOutputStream(file));

			lines.clear();

		}catch(IOException io){

			throw io;

		}finally{

			if ( br != null )try{br.close();}catch(Exception e){}

		}

	}



	/**

	 * Writes the list of lines out to the output stream, append new lines after each line.

	 * @param list

	 * @param os

	 * @throws IOException

	 */

	private void writeOut(List<String> list, OutputStream os) throws IOException{

		BufferedWriter writer = null;

		try{

			writer = new BufferedWriter(new OutputStreamWriter(os));

			for ( String s : list ){

				writer.write(s);

				writer.write("\n");

			}

			writer.flush();

		}catch(IOException io){

			throw io;

		}finally{

			if ( writer != null ){

				try{writer.close();}catch(Exception e){}

			}

		}

	}

	/**

	 * Reads the temporary files created by splitChunks method and merges them in a sorted manner into the output stream.

	 * @param list

	 * @param os

	 * @throws IOException

	 */

	public void mergeChunks(OutputStream os) throws IOException{

		Map<StringWrapper, BufferedReader> map = new HashMap<StringWrapper, BufferedReader>();

		List<BufferedReader> readers = new ArrayList<BufferedReader>();

		BufferedWriter writer = null;

		ComparatorDelegate delegate = new ComparatorDelegate();

		try{

			writer = new BufferedWriter(new OutputStreamWriter(os));

			for ( int i = 0; i < outputs.size(); i++ ){

				BufferedReader reader = new BufferedReader(new FileReader(outputs.get(i)));

				readers.add(reader);

				String line = reader.readLine();

				if ( line != null ){

					map.put(new StringWrapper(line), readers.get(i));

				}

			}

			///continue to loop until no more lines lefts

			List<StringWrapper> sorted = new LinkedList<StringWrapper>(map.keySet());

			while ( map.size() > 0 ){

				Collections.sort(sorted, delegate);

				StringWrapper line = sorted.remove(0);

				writer.write(line.string);

				writer.write("\n");

				BufferedReader reader = map.remove(line);

				String nextLine = reader.readLine();

				if ( nextLine != null ){

					StringWrapper sw = new StringWrapper(nextLine);

					map.put(sw,  reader);

					sorted.add(sw);

				}

			}

			

		}catch(IOException io){

			throw io;

		}finally{

			for ( int i = 0; i < readers.size(); i++ ){

				try{readers.get(i).close();}catch(Exception e){}

			}

			for ( int i = 0; i < outputs.size(); i++ ){

				outputs.get(i).delete();

			}

			try{writer.close();}catch(Exception e){}

		}

		

	}

	

	/**

	 * Delegate comparator to be able to sort the StringWrapper class. Delegates its behavior to 

	 * the sorter field.

	 * @author Greg Cope

	 *

	 */

	private class ComparatorDelegate implements Comparator<StringWrapper>{



		@Override

		public int compare(StringWrapper o1, StringWrapper o2) {

			return sorter.compare(o1.string, o2.string);

		}

		

	}

	

	/**

	 * Class which is a wrapper class for a String. This is necessary for String duplicates, which may cause equals/hashCode

	 * conflicts within the HashMap used in the file merge.

	 * @author Greg Cope

	 *

	 */

	private class StringWrapper implements Comparable<StringWrapper>{

		

		private final String string;

		

		public StringWrapper(String line){

			this.string = line;

		}



		@Override

		public int compareTo(StringWrapper o) {

			return string.compareTo(o.string);

		}



		

	}

}

Another more extensive implementation of this algorithm is available as an open source External Sort library.

*The file was required to be sorted on the values of one column, but this column itself contained delimited values each of which was defined as a sorting criteria. Using perl, the header was removed and sorting column split into their own columns. Unix sort was then used to sort the file, and perl again used to splice the previously split columns together. Lastly, cat was used to insert the header back into the file.

<-- When to Construct a new String Object in Java

Linux Command Line: Beginner Tools and Commands -->

There are no comments on this article.

Back to Articles

How to Sort a Large File

latest Articles

Categories

How to Sort a Large File