XML Recursive

Story Telling

Example of the Content

Yea, it is depressing because there is a lot of separate text and must be copied one by one and the most depressing part is the content is from page 70 until north than page 450. Think about it, how long is it going to take to copy all the content?!

Because I love my life and I do not want to waste it on something boring and repetitive, I started to look for any structured file that summarizes that content. I genuinely hoped I can get a kind of excel or csv.

But in the end, I found an XML file that defines the contents. I thought really hard about how to extract the information I want from this XML file without doing it one by one. Below is an example of the XML file.

A snippet of the XML Document

Working on XML

Always find the pattern before you code something — Joshuanatan

Just a little thought, you can skip until further notice

You can continue here

  1. Every text that I want to retrieve is inside the “decision” tag.
  2. Every “decision” tag, is always the child or under the “objective” tag.
  3. ….. (to be continued)
A Snapshot to Make the Thinking Result Clearer

Now, we get the pattern, but the next problem is, the depth is random.

Example of Depth Differences

From the image above, I want to take the value of AC-2(i)1 which is 3 levels deep (AC-2 / AC-2(i) / AC-2(i)1) but with the same code I must be able to take the value of AC-2(1) inside the “Assessment Objective” section which is 1 level deep (AC-2(1)).

Because of the uncertainty of depth, the first thing that popped inside my head is recursive programming. With recursive programming, I can make the program go as deep as they need for each case (with a program I can dig into 3 levels deep but also can dig only into 2 levels deep). This is when the real fun starts.

Working with Recursive Programming (Basic of Recursive)

What is recursion? recursion is…

The process in which a function calls itself directly or indirectly is called recursion and the corresponding function is called as recursive function — https://www.geeksforgeeks.org/recursion/

On this site, there is a great visual explanation of how recursive is going.

Visual Explanation of Recursive by https://www.geeksforgeeks.org/recursion/

To simply summarize, a function calls itself inside it hence it creates a “stacked” process I guess. The parent function has not finished yet but started a new process of that function and so on.

An important thing that I remember from my lecture is, there is a need for a base condition. Base condition is simply a condition that ends the recursive process or instead it will go forever.

A Simple Snippet of Recursive

Remember, the else condition only executed by the last one. When the last loop happens and it is not bigger than 1, it jumps to the else and returns. When it returns, it comes back to the function beforehand which is in line 6 (that calls the last loop). Because the function is done, it continues to execute, and apparently, there is no more code to be executed, then the function is done and is back to the function beforehand, and so forth until the first recursive calls (the top one) and finishes the whole function. (I hope you understand, read it slowly, imagine it, write it down on paper, you will understand it, it is not hard, just a little hard to digest)

The Final Solution

Snippet 1 (Preparing)

In Snippet 1, I import both modules I need which are xml.etree.ElementTree for XML parsing and pandas to help me working with the array. After that, I open the XML document in “rb” mode which is for reading mode. Now we parse the XML into the document tree and then get the root node.

I ran several tests and get some valuable things that can be used.

Snippet 2 (Tests)

Now, we are entering the recursive function

Snippet 3 (Recursive)

In the “recursive” function, it takes a parameter of the current node. After that, I prepare an empty text that will be appended with some values later. The next step is to get the number of child nodes that this current node has. Then it loops all the way to the children nodes, every time the tag equals to objective, we know that it has more depth, then we will do the recursion and append the recursion return value to the “text” variable. On the other hand, if it is a “decision” tag, it will take the value and append it to the “text” variable. If all have done, finish the function and return the text variable

So, the main algorithm is, loop every child of current nodes, if it meets the “objective” tag, then go deeper (do the recursive) and append the return value to the variable, else if it meets the “decision” tag, get the value and append it to the variable. Return the value after finish the function.

Using this way, we can extract any value no matter how deep as long as it is inside the “objective” tag.

Now I will be talking about the main flow.

Snippet 4 (Main Flow)

I am planning on taking some variables (not only the content) such as the code (AC-2), the name (Account Management), etc. So, I prepare an array that will keep the data. After that, I get the children node under the root node and loop over them one by one. I move the currently examined child node to the “child” variable.

Snippet 4 (Main Flow Contd)

Again, I loop every children node inside the currently examined node. In this section, I gather the information like the ID, the name, the family name, and etc including the “objective” tag. If we meet the “objective” tag, we will call the “recursive” function. I use an assumption that the family name, number, title, and objective has been set after the “recursive” function is called, hence I put it in the final_array.

But, there is one thing that I forget. There are some items that have an additional tag which is the control-enhancements tag. The control-enhancements tag contains information that needs to be extracted as well.

control-enhancements Tag
Snippet 5 (Main Flow Contd)

Because of that, we need special treatment whenever the current node has the “control-enhancements” tag. That is why, whenever the tag calls “control-enhancements”, it sets the control_enhancements_flag to true. If the flag is true, it will collect all the required data like id, title, and if it meets the objective tag, it calls the recursive function.

That is how I think about the solution, I really glad that I finally find how recursive is actually useful (when we are working on undecided depth). I also glad that I finally work with XML that benefits me.

Closing

A humble learner of everything around IT especially in IT implementation, governance, risk management, and cybersecurity.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store