My question is an add on from here, but I'm not meant to use the answer section for add-on questions.

If I have part of an XML file like this:

Inclusion Criteria:

- women undergoing cesarean section for any indication

- literate in german language

Exclusion Criteria:

- history of keloids

- previous transversal suprapubic scars

- known patient hypersensitivity to any of the suture materials used in the protocol

- a medical disorder that could affect wound healing (eg, diabetes mellitus, chronic

corticosteroid use)

Female

18 Years

45 Years

No

I want to pull out all of the strings in this eligibility section (i.e the string in the textblock section and the gender, minimum age, maximum age and healthy volunteers sections)

using the code above I did this:

import sys

from bs4 import BeautifulSoup

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'lxml')

eligibi = []

for eligibility in soup.find_all('eligibility'):

d = {'other_name':eligibility.criteria.textblock.string, 'gender':eligibility.gender.string}

eligibi.append(d)

print eligibi

My problem is I have many files. Sometimes the structure of the XML file might be:

eligibility -> criteria -> textblock -> text

eligibility -> other things (e.g. gender as above) -> text

eligibility -> text

e.g.

if there way to just take 'take all of the sub-headings and their texts'

so in the above example, the list/dictionary would contain:

{criteria textblock: inclusion and exclusion criteria, gender: xxx, minimum_age: xxx, maximum_age: xxx, healthy_volunteers: xxx}

My problem is, in reality, I am not going to know all the specific sub-tags of the eligibility tag, as each experiment could be different (e.g. maybe some say 'pregnant women accepted', 'drug history of XXX accepted' etc)

So I just want, if I give it a tag name, it will give me all the sub-tags and text of those sub-tags in a dictionary.

Extended XML for comment:

Subcutaneous Adaption and Cosmetic Outcome Following Caesarean Delivery

Klinikum Klagenfurt am Wörthersee

...and then the eligibility XML section above.

解决方案

Since you have lxml installed you can try the following (this code assumes leaf elements within a given element i.e eligibility are unique) :

from lxml import etree

tree = etree.parse(sys.argv[1])

root = tree.getroot()

eligibi = []

for eligibility in root.xpath('//eligibility'):

d = {}

for e in eligibility.xpath('.//*[not(*)]'):

d[e.tag] = e.text

eligibi.append(d)

print eligibi

XPath explanation :

.//* : find all elements within current eligibility, no matter its depth (//) and tag name (*)

[not(*)] : filter elements found by the previous bit to those that don't have any child element aka leaf elements

Logo

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。

更多推荐