A few days ago, I was asked to do a recap of a book's content. It is really funny and a bit depressing though because the content looks like this.
Yea, it is depressing because there is a lot of separate text and must be copied one by one and the most depressing part is the content is from page 70 until north than page 450. Think about it, how long is it going to take to copy all the content?!
Because I love my life and I do not want to waste it on something boring and repetitive, I started to look for any structured file that summarizes that content. I genuinely hoped I can get a kind of excel or csv.
But in the end, I found an XML file that defines the contents. I thought really hard about how to extract the information I want from this XML file without doing it one by one. Below is an example of the XML file.
Working on XML
Okay, this is good because XML is a structured document that can be extracted by code. Now, in order to code a program to extract the document, we need to find a pattern.
Always find the pattern before you code something — Joshuanatan
Just a little thought, you can skip until further notice
When you are getting bored when doing something, maybe it is because you are doing a repetitive task. Every repetitive task has a pattern. A pattern can be programmed. That is why every repetitive task can be replaced by a machine. Keep in mind that the standard of “repetitive” for machines is increasing. That is why there is an auto-drive technology. I do not belittle any of this technology, I just want to tell that at the end of the day, when we can extract any repetitive pattern, the machine can do it better than us the human-being. When you drive a car, you will accelerate when you have more space and will break when there are objects inside your safe space. With certain techniques that can help the machine to decide whether there are obstacles inside the safe space, the machine can drive.
You can continue here
Back to the topic, I need to get the pattern in order to retrieve the information from the document. These are my thinking results:
- Every text that I want to retrieve is inside the “decision” tag.
- Every “decision” tag, is always the child or under the “objective” tag.
- ….. (to be continued)
Now, we get the pattern, but the next problem is, the depth is random.
From the image above, I want to take the value of AC-2(i)1 which is 3 levels deep (AC-2 / AC-2(i) / AC-2(i)1) but with the same code I must be able to take the value of AC-2(1) inside the “Assessment Objective” section which is 1 level deep (AC-2(1)).
Because of the uncertainty of depth, the first thing that popped inside my head is recursive programming. With recursive programming, I can make the program go as deep as they need for each case (with a program I can dig into 3 levels deep but also can dig only into 2 levels deep). This is when the real fun starts.
Working with Recursive Programming (Basic of Recursive)
I really hate recursive programming when I was in university. Just because I got sleepy in that class and I missed the whole explanation (lol, sorry, my bad). But still, though, for me, this is just confusing, especially for the first time.
What is recursion? recursion is…
The process in which a function calls itself directly or indirectly is called recursion and the corresponding function is called as recursive function — https://www.geeksforgeeks.org/recursion/
On this site, there is a great visual explanation of how recursive is going.
To simply summarize, a function calls itself inside it hence it creates a “stacked” process I guess. The parent function has not finished yet but started a new process of that function and so on.
An important thing that I remember from my lecture is, there is a need for a base condition. Base condition is simply a condition that ends the recursive process or instead it will go forever.
Remember, the else condition only executed by the last one. When the last loop happens and it is not bigger than 1, it jumps to the else and returns. When it returns, it comes back to the function beforehand which is in line 6 (that calls the last loop). Because the function is done, it continues to execute, and apparently, there is no more code to be executed, then the function is done and is back to the function beforehand, and so forth until the first recursive calls (the top one) and finishes the whole function. (I hope you understand, read it slowly, imagine it, write it down on paper, you will understand it, it is not hard, just a little hard to digest)
The Final Solution
After I prepare the recursive concept and understand the pattern, I started to craft the code. I am using xml.etree.ElementTree module to parse the XML. I will use the code to explain what I am doing.
In Snippet 1, I import both modules I need which are xml.etree.ElementTree for XML parsing and pandas to help me working with the array. After that, I open the XML document in “rb” mode which is for reading mode. Now we parse the XML into the document tree and then get the root node.
I ran several tests and get some valuable things that can be used.
Now, we are entering the recursive function
In the “recursive” function, it takes a parameter of the current node. After that, I prepare an empty text that will be appended with some values later. The next step is to get the number of child nodes that this current node has. Then it loops all the way to the children nodes, every time the tag equals to objective, we know that it has more depth, then we will do the recursion and append the recursion return value to the “text” variable. On the other hand, if it is a “decision” tag, it will take the value and append it to the “text” variable. If all have done, finish the function and return the text variable
So, the main algorithm is, loop every child of current nodes, if it meets the “objective” tag, then go deeper (do the recursive) and append the return value to the variable, else if it meets the “decision” tag, get the value and append it to the variable. Return the value after finish the function.
Using this way, we can extract any value no matter how deep as long as it is inside the “objective” tag.
Now I will be talking about the main flow.
I am planning on taking some variables (not only the content) such as the code (AC-2), the name (Account Management), etc. So, I prepare an array that will keep the data. After that, I get the children node under the root node and loop over them one by one. I move the currently examined child node to the “child” variable.
Again, I loop every children node inside the currently examined node. In this section, I gather the information like the ID, the name, the family name, and etc including the “objective” tag. If we meet the “objective” tag, we will call the “recursive” function. I use an assumption that the family name, number, title, and objective has been set after the “recursive” function is called, hence I put it in the final_array.
But, there is one thing that I forget. There are some items that have an additional tag which is the control-enhancements tag. The control-enhancements tag contains information that needs to be extracted as well.
Because of that, we need special treatment whenever the current node has the “control-enhancements” tag. That is why, whenever the tag calls “control-enhancements”, it sets the control_enhancements_flag to true. If the flag is true, it will collect all the required data like id, title, and if it meets the objective tag, it calls the recursive function.
That is how I think about the solution, I really glad that I finally find how recursive is actually useful (when we are working on undecided depth). I also glad that I finally work with XML that benefits me.
That is all about XML, recursive, and how both of them work together. Feel free to give any suggestion at the most important is for GOD be all the glory! Soli Deo Gloria.