(Unicode C) Efficiently Process a Huge XML File

Demonstrates a technique for processing a huge XML file (can be any size, even many gigabytes).

Note: This example requires Chilkat v9.5.0.80 or greater.

Chilkat C/C++ Library Downloads

MS Visual C/C++

C++ Builder

Linux C/C++

Alpine Linux C/C++

MacOS C/C++

iOS C/C++

Android C/C++

MinGW C/C++

#include <C_CkFileAccessW.h>
#include <C_CkXmlW.h>
#include <C_CkStringBuilderW.h>

void ChilkatSample(void)
    {
    HCkFileAccessW fac;
    BOOL success;
    HCkXmlW xml;
    HCkStringBuilderW sb;
    BOOL firstIteration;
    int retval;
    int numTransactions;
    const wchar_t *beginMarker;
    const wchar_t *endMarker;

    // This example shows a way to efficiently process a gigantic XML file -- one that may be too large
    // to fit in memory.  
    // 
    // Two types of XML parsers exist: DOM parsers and SAX parsers.

    // A DOM parser is a Document Object Model parser, where the entire XML is loaded into memory
    // and the application has the luxury of interacting with the XML in a convenient, random-access
    // way.  The Chilkat Xml class is a DOM parser.  Because the entire XML is loaded into memory,
    // huge XML files (on the order of gigabytes) are usually not loadable for memory constraints.

    // A SAX parser is such that the XML file is parsed as an input stream.  No DOM exists.  
    // Using a SAX parser is generally less palatable than using a DOM parser, for many reasons.
    // 
    // The technique described here is a hybrid.  It streams the XML file as unstructured text
    // to extract fragments that are individually treated as separate XML documents loaded into
    // the Chilkat Xml parser.
    // 
    // For example, imagine your XML file is several GBs in size, but has a relatively simple structure, such as:
    // 
    // <Transactions>
    //     <Transaction id="1">
    //          ...
    //     </Transaction>
    //     <Transaction id="2">
    //          ...
    //     </Transaction>
    //     <Transaction id="3">
    //          ...
    //     </Transaction>
    // ...
    // </Transactions>

    // In the following code, each <Transaction ...> ... </Transaction>
    // is extracted and loaded separately into an Xml object, where it can be manipulated
    // independently.  The entire XML file is never entirely loaded into memory.

    fac = CkFileAccessW_Create();

    success = CkFileAccessW_OpenForRead(fac,L"qa_data/xml/transactions.xml");
    if (success == FALSE) {
        wprintf(L"%s\n",CkFileAccessW_lastErrorText(fac));
        CkFileAccessW_Dispose(fac);
        return;
    }

    xml = CkXmlW_Create();
    sb = CkStringBuilderW_Create();
    firstIteration = TRUE;
    retval = 1;
    numTransactions = 0;

    // The begin marker is "XML tag aware".  If the begin marker begins with "<"
    // and ends with ">", then it is assumed to be an XML tag and it will also match
    // substrings where the ">" can be a whitespace char.
    beginMarker = L"<Transaction>";
    endMarker = L"</Transaction>";

    while (retval == 1) {
        CkStringBuilderW_Clear(sb);
        // The retval can have the following values:
        // 0: No more fragments exist.
        // 1: Captured the next fragment.  The text from beginMarker to endMarker, including the markers, are returned in sb.
        // -1: Error.
        retval = CkFileAccessW_ReadNextFragment(fac,firstIteration,beginMarker,endMarker,L"utf-8",sb);
        firstIteration = FALSE;

        if (retval == 1) {
            numTransactions = numTransactions + 1;
            success = CkXmlW_LoadSb(xml,sb,TRUE);
            // Your application may now do what it needs with this particular XML fragment...
        }

    }

    if (retval < 0) {
        wprintf(L"%s\n",CkFileAccessW_lastErrorText(fac));
    }

    wprintf(L"numTransactions: %d\n",numTransactions);


    CkFileAccessW_Dispose(fac);
    CkXmlW_Dispose(xml);
    CkStringBuilderW_Dispose(sb);

    }