If you're a data analyst or data scientist or work with data in some way, shape, or form, then at some point you've had to deal with random text files full of data that needs to be extracted. And more than likely, you end up forgetting how to do this each and every time in whichever language you use in your endeavors. So today I'm going to cover all the possible files that you may need to read into your programs in the hopes that this can be a one-stop shop for your data extraction needs.
Table of contents
1. Text files
This is the easiest, and one of the more common files that you will encounter on your journey. The fastest way to get this done, is to use the System.IO File object, which comes with a handy method ReadAllText, which does exactly what its name implies.
System.IO.File
var fileContents = System.IO.File.ReadAllText(Server.MapPath(@"~/data/important.txt"));
This is the quickest way to read in the entire contents of a file, however, that's also it's weakness. If you're loading in massive files, then you're going to have a bad time. Loading an object in memory with a ton of data that you will have to parse and traverse is probably not going to go well. But if you're dealing with a file in the Kilobytes perhaps, then this would be a good approach.
Note that the ReadAllText method requires a drive directory path in order to retrieve the files contents. And because we don't know the exact drive path normally, as that could result in a security concern, we can use the built in .NET Server method MapPath which takes in a relative file path, and returns the physical location on that particular server.
You can also read the entire contents of a file into an array of strings. More convenient for CSV files, however the same resource limitations apply as well. In order to accomplish this, you can use the ReadAllLines method of the File class.
string[] lines = System.IO.File.ReadAllLines(Server.MapPath(path));
If performance becomes an issue however, you can use the following StreamReader approach.
StreamReader
The StreamReader implementation requires a bit more work, however, it is more resource friendly, as it will open a stream to your file in which you can read the contents bytes at a time as oppose to all at once.
note: You won't be able to use the file while it is read from. This includes deletions or edits to the file.
try
{
using (StreamReader sr = new StreamReader(Server.MapPath("~/TestFile.txt")))
{
string line;
while ((line = sr.ReadLine()) != null)
{
// do what you will
}
}
}
catch (Exception e)
{
// log the error
}
2. Read in XML
This is a bit more work, and there are a few more ways to go about doing this in .NET. There are a few built in .NET classes that help work with XML, but you always have option of parsing the content yourself. There is no right or wrong here. Many times, I need data in a hurry, and don't normally have time to make sure that my XML is well formatted and such.
Using XmlTextReader
This is the most straightforward way to read in an XML file, however, it requires the most hands on work.
XmlTextReader reader = new XmlTextReader(strXmlFile);
while(reader.Read())
{
}
This is also the most resource friendly method to read in an XML file because it is using a stream of data. However, it is also the most hands on and somewhat tedious approaches. Each and every node, attribute, element etc will have to be parsed and detected by your code, so knowing the schema beforehand becomes important.
You also have the option to use the XMLDocument object however. Which gives much more built-in support as it loads the XML document into a strongly typed object.
Using XmlDocument
This is the easiest way to handle XMl files in .NET, however, it is also more resource intensive. I've seen many examples of its use with 10 maybe 20 records as sample data. In real life however, you're going to be loading in large sized XML files, and that, is noticeable on your hardware.
XmlDocument doc = new XmlDocument();
doc.Load(serverpath);
Once you load up the XML document you have some options to explore. You can get the entire contents of the file using the following:
string contents = doc.OuterXml;
Or you can traverse through the document node by node. But again, many ways you can go about doing this. For example, you can get elements by their tag names as such:
var ele = doc.GetElementsByTagName("book");
var book1 = doc.GetElementById("book1");
I find this to be the most useful approach as I have much more control over how I manipulate the data. I can get all of the "book" elements, and for each use the same GetElementsByTagName method to extract data from that particular node. You will have to play around with the many built in functions to see which is the most beneficial to you.
You also have the option of loading XML content directly into the Load method as well. But of course, since we're thinking more big data, this probably won't apply to you.
3. Read in JSON
This one is yet more work than the XML, but definitely comes in handy for many many things. For one, you can load data directly into a strongly-typed object with very minimal work. Here is a sample JSON file that you can use for your testing pleasure.
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
Using JSON.net
The simplest way to accomplish reading in JSON data is to use Json.Net by newtonsoft. It can accomplish two major things for you.
1. Serialize data into JSON
Product product = new Product();
product.Name = "Apple";
product.Expiry = new DateTime(2008, 12, 28);
product.Sizes = new string[] { "Small" };
string json = JsonConvert.SerializeObject(product);
This will get converted into the following string:
{
"Name": "Apple",
"Expiry": "2008-12-28T00:00:00",
"Sizes": [
"Small"
]
}
2. Deserialize data into a strongly-typed object
string json = @"{
'Name': 'Bad Boys',
'ReleaseDate': '1995-4-7T00:00:00',
'Genres': [
'Action',
'Comedy'
]
}";
Movie m = JsonConvert.DeserializeObject(json);
string name = m.Name;
And you are 1 DLL away from having full functionality of it in your project. Just head on over to newtonsoft's website to download the source package.
Excel files
.NET 4+ enables C# to read and also manipulate Excel files for PCs that have Excel installed. You will have to add Microsoft Excel to your project to access the object from Visual C# .NET. Create a new project in your application and in your C# Form, add a Command Button.
However, if your application needs to read data from an Excel file and you don’t have Microsoft Office installed in your system, your application will fail to read the file and send an error while loading the Excel dll. The easy way out is to use a c# excel API.
Read an Excel Sheet
Excel sheets are still around. And despite their antiquity, they are probably holding some very important data right now in someone's 10 year old desktop. The cool thing about Excel sheets, is that they are built using the ODBM data standard, and in .NET that means that we can use SQL Queries to parse and retrieve data.
string json = @"{
'Name': 'Bad Boys',
'ReleaseDate': '1995-4-7T00:00:00',
'Genres': [
'Action',
'Comedy'
]
}";
Movie m = JsonConvert.DeserializeObject<movie>(json);
string name = m.Name;
</movie>
Data extraction is pretty fun for the most part. You get to turn raw strings into meaningful structures and you get to play around with it as you wish. Having said that, there is almost always a performance problem when doing so. Or a timing issue. Depending on the amount of processing that you'll be doing, you could end up with a process that takes 1 second per record to process, which sounds great, just to realize that you have 1 million records to process. So if you have 277 hours of processing time available then have at it. But if you don't, then a big part of data analysis will also have to deal with how well your application performs.